CN111160004B - Method and device for establishing sentence-breaking model - Google Patents

Method and device for establishing sentence-breaking model Download PDF

Info

Publication number
CN111160004B
CN111160004B CN201811320993.5A CN201811320993A CN111160004B CN 111160004 B CN111160004 B CN 111160004B CN 201811320993 A CN201811320993 A CN 201811320993A CN 111160004 B CN111160004 B CN 111160004B
Authority
CN
China
Prior art keywords
sentence
breaking
word
deep learning
learning model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811320993.5A
Other languages
Chinese (zh)
Other versions
CN111160004A (en
Inventor
李晓普
王阳阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Orion Star Technology Co Ltd
Original Assignee
Beijing Orion Star Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Orion Star Technology Co Ltd filed Critical Beijing Orion Star Technology Co Ltd
Priority to CN201811320993.5A priority Critical patent/CN111160004B/en
Publication of CN111160004A publication Critical patent/CN111160004A/en
Application granted granted Critical
Publication of CN111160004B publication Critical patent/CN111160004B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The application discloses a method and a device for establishing a sentence-breaking model, wherein the method comprises the following steps: performing word segmentation processing on each acquired corpus sentence, and determining words contained in the corpus sentence; determining rare words in words contained in the corpus sentence, and segmenting the rare words by using a subword segmentation algorithm; inputting word sequences formed by words obtained after word segmentation and segmentation into a deep learning model for sentence breaking and marking; according to the original sentence-breaking mark of each corpus sentence and the sentence-breaking mark corresponding to the corpus sentence output by the deep learning model, the parameters of the deep learning model are adjusted, and the sentence-breaking model is built, so that the sentence without the sentence-breaking mark is processed by using the built sentence-breaking model, the sentence which is not provided with the sentence-breaking mark is seen by a user, and the readability and the understandability of the sentence are improved, so that the user experience can be improved.

Description

Method and device for establishing sentence-breaking model
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for establishing a sentence-breaking model.
Background
In recent years, with the rapid development of speech recognition technology, the application fields of speech recognition, such as voice messaging, voice memo, simultaneous interpretation, etc., are increasing.
However, when the voice information is recognized to obtain a corresponding sentence (the meaning of the sentence is complete or incomplete), the sentence is not provided with the sentence breaking mark, which is not beneficial to reading and understanding, and when the voice information is longer, the user can read the more characters contained in the recognized sentence, so that the related personnel begin to study how to break the sentence without the sentence breaking mark, thereby improving the user experience.
Disclosure of Invention
The embodiment of the application provides a method and a device for establishing a deep learning model for sentence breaking, which are used for breaking sentences without sentence breaking marks and improving user experience.
In a first aspect, a method for establishing a sentence-breaking model provided in an embodiment of the present application includes:
performing word segmentation processing on each acquired corpus sentence, and determining words contained in the corpus sentence;
determining rare words in words contained in the corpus sentence, and segmenting the rare words by using a subword segmentation algorithm;
Inputting word sequences formed by words obtained after word segmentation and segmentation into a deep learning model for sentence breaking and marking;
and adjusting parameters of the deep learning model according to the original sentence breaking identification of each corpus sentence and the sentence breaking label corresponding to the corpus sentence output by the deep learning model, and establishing a sentence breaking model.
After the sentence-breaking model is established by adopting the scheme, the sentence without the sentence-breaking mark can be subjected to sentence-breaking processing by utilizing the established sentence-breaking model, so that the user can see long sentences without the sentence-breaking mark, the readability and the understandability of the sentences are improved, and the user experience can be improved.
In one possible implementation, the corpus sentence is obtained according to the following steps:
acquiring a preset number of sample sentences, wherein sentence ends of the sample sentences are provided with sentence breaking marks;
splicing part or all of the sample sentences;
and dividing each spliced sample sentence, and determining the divided sample sentences as the corpus sentences.
In one possible implementation manner, the splitting of each spliced sample sentence includes:
dividing each spliced sample sentence according to a set step length; or alternatively
And randomly dividing each spliced sample sentence.
In one possible implementation, the subword segmentation algorithm is a byte-pair encoding BPE algorithm.
In one possible implementation manner, the deep learning model is controlled to label each word in the word sequence according to the following steps:
analyzing the contextual information of the word in the word sequence;
determining a first probability of marking the word with a sentence-breaking mark and a second probability of marking the word with a non-sentence-breaking mark according to the context information of the word;
and selecting the mark with the largest median value between the first probability and the second probability to mark the word.
In one possible implementation manner, after determining the first probability of marking the word with a sentence-breaking identifier and the second probability of marking the word with a non-sentence-breaking identifier according to the context information of the word, the deep learning model is further controlled to execute:
according to the labeling conditions of the labeled words in the word sequence, the first probability and the second probability are adjusted; and
selecting the identifier with the largest median value between the first probability and the second probability to label the word, including:
And selecting the mark with the largest median value between the adjusted first probability and the adjusted second probability to mark the word.
In one possible implementation manner, according to the original sentence breaking identifier of each corpus sentence and the sentence breaking identifier corresponding to the corpus sentence output by the deep learning model, adjusting parameters of the deep learning model includes:
and comparing the original position of the sentence breaking mark of each corpus sentence with the position of the sentence breaking mark corresponding to the corpus sentence output by the deep learning model, and adjusting the parameters of the deep learning model so that the adjusted position of the sentence breaking mark corresponding to the corpus sentence output by the deep learning model is the same as the original position of the sentence breaking mark of the corpus sentence.
In one possible implementation manner, after adjusting the parameters of the deep learning model, the method further includes:
inputting at least one test sentence into the adjusted deep learning model for sentence breaking and marking;
determining the marking accuracy of the adjusted deep learning model according to the original sentence-breaking mark of the test sentence and the sentence-breaking mark corresponding to the test sentence output by the adjusted deep learning model;
if the determined labeling accuracy is greater than or equal to the preset accuracy, determining the adjusted deep learning model as the sentence-breaking model;
And if the determined labeling accuracy is smaller than the preset accuracy, training the adjusted deep learning model according to at least one new corpus sentence to establish the sentence-breaking model.
In one possible implementation manner, after the sentence-breaking model is built, the method further includes:
performing sentence breaking processing on an input character sequence by using the sentence breaking model, wherein the character sequence is obtained by performing voice recognition processing on an acquired voice signal;
and outputting the character sequence after sentence breaking processing.
In a second aspect, an apparatus for creating a sentence-breaking model provided in an embodiment of the present application includes:
the preprocessing module is used for carrying out word segmentation processing on each acquired corpus sentence and determining words contained in the corpus sentence; determining rare words in words contained in the corpus sentence, and segmenting the rare words by using a subword segmentation algorithm;
the marking module is used for inputting word sequences formed by words obtained after word segmentation and segmentation into the deep learning model for sentence breaking marking;
and the adjustment module is used for adjusting parameters of the deep learning model according to the original sentence-breaking identification of each corpus sentence and the sentence-breaking identification corresponding to the corpus sentence output by the deep learning model, and establishing a sentence-breaking model.
In one possible implementation manner, the preprocessing module is specifically configured to obtain a corpus sentence according to the following steps:
acquiring a preset number of sample sentences, wherein sentence ends of the sample sentences are provided with sentence breaking marks;
splicing part or all of the sample sentences;
and dividing each spliced sample sentence, and determining the divided sample sentences as the corpus sentences.
In one possible implementation, the preprocessing module is specifically configured to:
dividing each spliced sample sentence according to a set step length; or alternatively
And randomly dividing each spliced sample sentence.
In one possible implementation, the subword segmentation algorithm is a byte-pair encoding BPE algorithm.
In one possible implementation manner, the labeling module is specifically configured to control the deep learning model to label each word in the word sequence according to the following steps:
analyzing the contextual information of the word in the word sequence;
determining a first probability of marking the word with a sentence-breaking mark and a second probability of marking the word with a non-sentence-breaking mark according to the context information of the word;
and selecting the mark with the largest median value between the first probability and the second probability to mark the word.
In one possible implementation, the labeling module further controls the deep learning model to perform:
after a first probability of marking a broken sentence of the word and a second probability of marking a non-broken sentence of the word are determined according to the context information of the word, the first probability and the second probability are adjusted according to marking conditions of the words marked in the word sequence;
and selecting the mark with the largest median value between the adjusted first probability and the adjusted second probability to mark the word.
In one possible embodiment, the adjustment module is specifically configured to:
and comparing the original position of the sentence breaking mark of each corpus sentence with the position of the sentence breaking mark corresponding to the corpus sentence output by the deep learning model, and adjusting the parameters of the deep learning model so that the adjusted position of the sentence breaking mark corresponding to the corpus sentence output by the deep learning model is the same as the original position of the sentence breaking mark of the corpus sentence.
In one possible implementation manner, the test device further comprises a test module for:
after the parameters of the deep learning model are adjusted, inputting at least one test sentence into the adjusted deep learning model for sentence breaking and marking;
Determining the marking accuracy of the adjusted deep learning model according to the original sentence-breaking mark of the test sentence and the sentence-breaking mark corresponding to the test sentence output by the adjusted deep learning model;
if the determined labeling accuracy is greater than or equal to the preset accuracy, determining the adjusted deep learning model as the sentence-breaking model;
and if the determined labeling accuracy is smaller than the preset accuracy, triggering the preprocessing module, the labeling module and the adjusting module, and training the adjusted deep learning model according to at least one new corpus sentence to establish the sentence-breaking model.
In one possible implementation manner, the method further comprises a sentence breaking module, configured to:
after establishing a sentence-breaking model, performing sentence-breaking processing on an input character sequence by using the sentence-breaking model, wherein the character sequence is obtained by performing voice recognition processing on an acquired voice signal;
and outputting the character sequence after sentence breaking processing.
In a third aspect, an electronic device provided in an embodiment of the present application includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein:
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of creating a deep learning model for sentence breaking described above.
In a fourth aspect, an embodiment of the present application provides a computer readable medium storing computer executable instructions for executing the above method for creating a deep learning model for sentence breaking.
In addition, the technical effects caused by any one of the design manners of the second aspect to the fourth aspect may be referred to technical effects caused by different implementation manners of the first aspect, which are not repeated herein.
These and other aspects of the present application will be more readily apparent from the following description of the embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
fig. 1 is a schematic structural diagram of a computing device of a method for creating a sentence-breaking model according to an embodiment of the present application;
FIG. 2 is a flowchart of a method for creating a sentence-breaking model according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a device for creating a sentence-breaking model according to an embodiment of the present application.
Detailed Description
In order to break sentences without broken sentence identification and improve user experience, the embodiment of the application provides a method and a device for establishing a broken sentence model.
The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and are not intended to limit the present application, and embodiments and features of embodiments of the present application may be combined with each other without conflict.
In order to facilitate understanding of the present application, the present application refers to the technical terms:
the sentence breaking identifier may be a symbol for dividing a sentence, such as a punctuation symbol, for example, "/", or a punctuation symbol, for example, ". "? ".
The non-sentence breaking mark represents a symbol for not breaking a sentence, and can be specified according to actual requirements, such as space, tab and the like.
Words, representing a semantic term, a word may include one, two, three or more characters, e.g., "I," "want," "go to school," "I," "want to," "go to school," are all single words.
The method provided in the present application may be applied to a variety of computing devices, and fig. 1 shows a schematic structural diagram of a computing device, where the computing device 10 shown in fig. 1 is merely an example, and does not impose any limitation on the functions and application scope of the embodiments of the present application.
As shown in fig. 1, computing device 10 is embodied in the form of a general purpose computing device, and the components of computing device 10 may include, but are not limited to: at least one processing unit 101, at least one memory unit 102, a bus 103 connecting the different system components, including the memory unit 102 and the processing unit 101.
Bus 103 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, and a local bus using any of a variety of bus architectures.
The storage unit 102 may include readable media in the form of volatile memory, such as Random Access Memory (RAM) 1021 and/or cache memory 1022, and may further include Read Only Memory (ROM) 1023.
Storage unit 102 may also include program/utility 1025 having a set (at least one) of program modules 1024, such program modules 1024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
The computing device 10 may also communicate with one or more external devices 104 (e.g., keyboard, pointing device, etc.), one or more devices that enable a user to interact with the computing device 10, and/or any devices (e.g., routers, modems, etc.) that enable the computing device 10 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 105. Moreover, computing device 10 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through network adapter 106. As shown in FIG. 1, network adapter 106 communicates with other modules for computing device 10 over bus 103. It should be appreciated that although not shown in fig. 1, other hardware and/or software modules may be used in connection with computing device 10, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
It will be appreciated by those skilled in the art that FIG. 1 is merely an example of a computing device and is not intended to be limiting of the computing device, and may include more or fewer components than shown, or may combine certain components, or different components.
The sentence-breaking model established in the embodiment of the present application may be applied to any scene requiring sentence breaking, such as acoustic interpretation, real-time on-screen, etc., referring to fig. 2, fig. 2 is a flow chart of a method for establishing a sentence-breaking model provided in the embodiment of the present application, and in the following description, the method is taken as an example to apply the method to the computing device 10 shown in fig. 1, and the specific implementation flow of the method is as follows:
s201: and acquiring a preset number of sample sentences, wherein the sentence ends of the sample sentences are provided with sentence breaking marks.
Here, the sample sentences may be independent of each other or may have an association relationship.
S202: and splicing part or all of the sample sentences, dividing each spliced sample sentence, and determining the divided sample sentences as corpus sentences.
For the situation that voice data needs to be recognized in real time, it is possible that a part of character sequences recognized at the time and character sequences recognized last time can form a sentence, at this time, if sentence breaking is performed on the character sequences, sentence breaking marks are likely to appear in the middle positions of the character sequences, in order to cope with the situation well, we want corpus sentences used for establishing a sentence breaking model to be diversified, and sentence breaking marks are not all appearing at the end of a sentence.
Therefore, after a preset number of sample sentences with sentence breaking marks at the end of the sentence is obtained, part or all of the sample sentences can be spliced, each spliced sample sentence is segmented, for example, the segmented sample sentences are segmented according to a set step length or randomly, and further the segmented sample sentences are used as corpus sentences for establishing a sentence breaking model, so that the probability of the sentence breaking marks appearing at the end of the sentence can be reduced, the probability of the sentence breaking marks appearing in the sentence can be improved, the scene can be more matched, and then, when the established deep learning model is applied to the scene, the sentence breaking accuracy of the deep learning model can be higher.
S203: and performing word segmentation on each corpus sentence, determining the words contained in the corpus sentence, and performing word segmentation on each rare word again by using a sub-word segmentation algorithm if the rare words are determined to be contained in the corpus sentence.
The rare words refer to words with low occurrence frequency in corpus sentences, for example, words with occurrence times less than set times.
In specific implementation, after determining the words contained in each corpus sentence, tools such as jieba, snowNLP, THULAC, NLPIR and the like for word segmentation processing on each corpus sentence can also give out information about which words are rare words, if determining that the words contained in the corpus sentence contain rare words, the segmentation processing can be performed on each rare word again by using a sub-word segmentation algorithm, for example, the word segmentation processing can be performed on each rare word again by using a double-byte coding (Byte Pair Encoding, BPE) algorithm, and the processing can also be called BPE processing, so that the influence of the rare words on the corpus sentence can be improved, the meaning of the corpus sentence can be fully understood, and the accuracy of word segmentation on the corpus sentence can be improved.
S204: and inputting word sequences formed by words obtained after word segmentation and segmentation of each corpus sentence into a deep learning model for sentence breaking and marking.
Here, each corpus sentence may obtain a plurality of words after the word segmentation and the segmentation, and the word sequence of the corpus sentence may be formed according to the position of each word in the corpus sentence.
For example, a corpus sentence is "i want to go to school", and words obtained by performing word segmentation and segmentation on the corpus sentence are "i", "go to school", "want", and then a word sequence finally formed according to the appearance position of each word in the corpus sentence is { i, want, go to school }.
When the method is implemented, after a word sequence formed by each corpus sentence is input into a deep learning model, the deep learning model can analyze the context information of the word in the word sequence for each word in the word sequence, further determine the first probability of marking the word with a broken sentence and the second probability of marking a non-broken sentence according to the context information of the word, further mark the word with the maximum value of the first probability and the second probability, for example, add the mark with the larger probability behind the word, and output the marked corpus sentence after marking all the words.
Alternatively, the types of the sentence breaking marks for marking the material sentences can be only one, such as "/", and various types of the sentence breaking marks can be also used, such as the same time use of ",". "sum"? The above procedure is described below using only one sentence-break identification as an example.
For example, a word sequence formed by a corpus sentence is: { word 1, word 2, word 3, word 4, word 5}, and the probability of adding "≡" after word 1 is 0.7, and the probability of adding "/" is 0.3; the probability of adding "≡" after the word 2 is 0.4, and the probability of adding "/" is 0.6; the probability of adding "≡" after the word 3 is 0.6, and the probability of adding "/" is 0.4; the probability of adding "≡" after the word 4 is 0.6, and the probability of adding "/" is 0.4; the probability of adding "≡" after the word 5 is 0.6, and the probability of adding "/" is 0.4, the corpus sentence is marked as follows: after the words 1 +.2/3 ≡4 ≡5 ≡, the corpus sentence after labeling processing can be output, and when actually processing, if the word sequence contains rare words, the output corpus sentence also contains BPE marks.
In the specific implementation, in order to make labeling of each word in a material sentence more accurate, after determining a first probability of labeling a broken sentence of the word and a second probability of labeling a non-broken sentence of the word according to context information of the word, the first probability and the second probability can be adjusted according to labeling conditions of each labeled word in a word sequence, and then labeling the word by taking a label with the largest median of the adjusted first probability and the adjusted second probability.
For example, a word sequence formed by a corpus sentence is: { word 1, word 2, word 3, word 4, word 5}, wherein "ζ" has been added after word 1, "/" has been added after word 2, and, taking word 3 as an example, after determining the probability of tagging word 3 with a sentence-break logo and tagging with a non-sentence-break logo, the tagging information that word 1 and word 2 have been added in the word sequence can be analyzed: the probability of adding the broken sentence mark behind the word 3 is not too high, namely the probability of adding the broken sentence mark behind the word 3 is relatively high, at the moment, if the determined probability of marking the broken sentence mark for the word 3 is slightly smaller than 0.6, the probability of marking the broken sentence mark for the word 3 is properly increased, and meanwhile, the determined probability of marking the broken sentence mark for the word 3 is properly reduced, so that the broken sentence mark integrally added in the word sequence is more in accordance with the actual situation by combining the marking situation of each marked word in the word sequence, and the accuracy of broken sentences is further improved.
S205: and adjusting parameters of the deep learning model according to the original sentence breaking identification of each corpus sentence and the sentence breaking label corresponding to the corpus sentence output by the deep learning model.
In the implementation, for each corpus sentence, whether the position of the original sentence breaking mark of the corpus sentence is the same as the position of the sentence breaking mark corresponding to the corpus sentence output by the deep learning model can be compared, and if not, parameters of the deep learning model can be adjusted so that the position of the sentence breaking mark corresponding to the corpus sentence output by the adjusted deep learning model is the same as the position of the original sentence breaking mark of the corpus sentence.
For example, a loss function for determining deviation of the original sentence-breaking mark of the corpus sentence and the corresponding sentence-breaking mark of the corpus sentence output by the deep learning model can be calculated, and then parameters of the deep learning model are adjusted by using a gradient descent algorithm to reduce the loss function until the adjusted position of the corresponding sentence-breaking mark of the corpus sentence output by the deep learning model is the same as the original sentence-breaking mark of the corpus sentence, and the adjustment is stopped.
S206: and testing the adjusted deep learning model by using the test sentence, and determining the labeling accuracy of the deep learning model according to the test result.
Wherein the test sentence is a sentence for which the break identification position is known.
S207: judging whether the labeling accuracy is smaller than the preset accuracy, if not, entering S208: if yes, the process advances to S209.
S208: training the adjusted deep learning model according to at least one new corpus sentence, taking the trained deep learning model as the new adjusted deep learning model, and returning to S206.
Wherein the new corpus sentence is a newly added corpus sentence, which is different from the corpus sentence used before in training the sentence-breaking model.
S209: and taking the adjusted deep learning model as an established sentence-breaking model.
S210: and performing sentence breaking processing on the input character sequence by using the established sentence breaking model, and outputting the character sequence subjected to the sentence breaking processing.
The input character sequence is obtained by performing voice recognition processing on the collected voice signals.
Specifically, word segmentation processing can be performed on an input character sequence, if the word subjected to the word segmentation processing contains rare words, the word segmentation processing is performed on each rare word again by using a sub-word segmentation algorithm, then the word sequence formed by the words obtained after the word segmentation processing and the word segmentation processing is input into a deep learning model for sentence breaking marking, and the character sequence subjected to sentence breaking processing is output by the deep learning model.
In the specific implementation, the character sequence output by the deep learning model is provided with various labeling information, such as sentence breaking marks and non-sentence breaking marks, if rare words exist, the labeling information of BPE processing is also included, so after the character sequence after sentence breaking processing output by the deep learning model is obtained, the non-sentence breaking marks in the character sequence can be filtered out, then the character sequence is subjected to word reversal and BPE processing, and finally, the character sequence after sentence breaking processing is displayed to a user, and because the whole processing process is that the user does not feel, the user finally sees a sentence which is clear and complete and has no processing trace, so that the user experience is further improved. According to the sentence breaking model provided by the embodiment of the application, for each word in a material sentence, the probability of marking the word with a sentence breaking mark and marking a non-sentence breaking mark is determined according to the context information of the word, and before the word is marked, the probability of marking the word with the sentence breaking mark and marking the non-sentence breaking mark can be adjusted according to the marking condition of each marked word in a word sequence corresponding to a corpus sentence, so that the mark with the maximum probability is used for marking the word, and the sentence breaking mode is very consistent with the characteristic of natural semantics, so that the sentence breaking mode is more reasonable.
In addition, in the embodiment of the application, when the sentence-breaking model is built, the sentence-breaking of which language can be completed by using the sample sentence which is which language, for example, if the sample sentence is English, the sentence-breaking can be performed on English, and if the sample sentence is Chinese, the sentence-breaking can be performed on Chinese, so that the universality is better.
In addition, the embodiment of the application also provides a network structure of the deep learning model: the sequence of the network structure is represented by an arrow, wherein the ebedding layer is used for encoding the semantics of each word in a word sequence formed by corpus sentences; the bilstm layer is used for analyzing the upper and lower Wen Yuyi of each word according to the semantic codes of a plurality of words before and after the word in the word sequence; the softmax layer is used for determining the probability of marking the word with the sentence-breaking mark and marking the word with the non-sentence-breaking mark according to the context semantic of each word; the crf layer is used for adjusting the probability of marking the sentence-breaking marks and marking the non-sentence-breaking marks of the current words according to marking conditions of marked words in the word sequence, marking the words by taking the mark with larger probability from the adjusted sentence-breaking marks and the non-sentence-breaking marks, and then outputting a final sentence-breaking marking result.
When the methods provided in the embodiments of the present application are implemented in software or hardware or a combination of software and hardware, a plurality of functional modules may be included in the computing device 10, each of which may include software, hardware, or a combination thereof. Specifically, referring to fig. 3, a schematic structural diagram of a device 30 for creating a sentence-breaking model according to an embodiment of the present application includes a preprocessing module 301, a labeling module 302, and an adjusting module 303.
The preprocessing module 301 is configured to perform word segmentation processing on each obtained corpus sentence, and determine terms contained in the corpus sentence; determining rare words in words contained in the corpus sentence, and segmenting the rare words by using a subword segmentation algorithm;
the labeling module 302 is configured to input a word sequence formed by words obtained after word segmentation and segmentation into a deep learning model for sentence segmentation labeling;
and the adjustment module 303 is configured to adjust parameters of the deep learning model according to the original sentence breaking identifier of each corpus sentence and the sentence breaking identifier corresponding to the corpus sentence output by the deep learning model, and establish a sentence breaking model.
In one possible implementation manner, the preprocessing module 301 is specifically configured to obtain a corpus sentence according to the following steps:
Acquiring a preset number of sample sentences, wherein sentence ends of the sample sentences are provided with sentence breaking marks;
splicing part or all of the sample sentences;
and dividing each spliced sample sentence, and determining the divided sample sentences as the corpus sentences.
In one possible implementation, the preprocessing module 301 is specifically configured to:
dividing each spliced sample sentence according to a set step length; or alternatively
And randomly dividing each spliced sample sentence.
In one possible implementation, the subword segmentation algorithm is a byte-pair encoding BPE algorithm.
In one possible implementation manner, the labeling module 302 is specifically configured to control the deep learning model to label each word in the word sequence according to the following steps:
analyzing the contextual information of the word in the word sequence;
determining a first probability of marking the word with a sentence-breaking mark and a second probability of marking the word with a non-sentence-breaking mark according to the context information of the word;
and selecting the mark with the largest median value between the first probability and the second probability to mark the word.
In one possible implementation, the labeling module 302 further controls the deep learning model to perform:
After a first probability of marking a broken sentence of the word and a second probability of marking a non-broken sentence of the word are determined according to the context information of the word, the first probability and the second probability are adjusted according to marking conditions of the words marked in the word sequence;
and selecting the mark with the largest median value between the adjusted first probability and the adjusted second probability to mark the word.
In one possible implementation, the adjustment module 303 is specifically configured to:
and comparing the original position of the sentence breaking mark of each corpus sentence with the position of the sentence breaking mark corresponding to the corpus sentence output by the deep learning model, and adjusting the parameters of the deep learning model so that the adjusted position of the sentence breaking mark corresponding to the corpus sentence output by the deep learning model is the same as the original position of the sentence breaking mark of the corpus sentence.
In one possible implementation, the device further includes a test module 304 configured to:
after the parameters of the deep learning model are adjusted, inputting at least one test sentence into the adjusted deep learning model for sentence breaking and marking;
determining the marking accuracy of the adjusted deep learning model according to the original sentence-breaking mark of the test sentence and the sentence-breaking mark corresponding to the test sentence output by the adjusted deep learning model;
If the determined labeling accuracy is greater than or equal to the preset accuracy, determining the adjusted deep learning model as the sentence-breaking model;
and if the determined labeling accuracy is smaller than the preset accuracy, triggering the preprocessing module, the labeling module and the adjusting module, and training the adjusted deep learning model according to at least one new corpus sentence to establish the sentence-breaking model.
In one possible implementation manner, the sentence breaking module 305 is further included, and is configured to:
after establishing a sentence-breaking model, performing sentence-breaking processing on an input character sequence by using the sentence-breaking model, wherein the character sequence is obtained by performing voice recognition processing on an acquired voice signal;
and outputting the character sequence after sentence breaking processing.
The division of the modules in the embodiments of the present application is schematically only one logic function division, and there may be another division manner in actual implementation, and in addition, each functional module in each embodiment of the present application may be integrated in one processor, or may exist separately and physically, or two or more modules may be integrated in one module. The coupling of the individual modules to each other may be achieved by means of interfaces which are typically electrical communication interfaces, but it is not excluded that they may be mechanical interfaces or other forms of interfaces. Thus, the modules illustrated as separate components may or may not be physically separate, may be located in one place, or may be distributed in different locations on the same or different devices. The integrated modules may be implemented in hardware or in software functional modules.
The embodiment of the application also provides electronic equipment, which comprises: at least one processor, and a memory communicatively coupled to the at least one processor, wherein:
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the embodiments described above.
The embodiment of the application also provides a computer readable storage medium which stores computer executable instructions required to be executed by the processor, and the computer readable storage medium contains a program for executing the processor.
In some possible embodiments, various aspects of the methods for creating a deep learning model for sentence breaking provided herein may also be implemented in the form of a program product comprising program code for causing an electronic device to perform the steps of the methods for creating a deep learning model for sentence breaking described herein above according to various exemplary embodiments of the present application when the program product is run on the electronic device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product of embodiments of the present application for performing the creation of a deep learning model of sentence breaking may employ a portable compact disc read only memory (CD-ROM) and include program code and may be run on a computing device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.
Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required to or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (18)

1. The method for establishing the sentence-breaking model is characterized by comprising the following steps of:
performing word segmentation processing on each acquired corpus sentence, and determining words contained in the corpus sentence;
determining rare words in words contained in the corpus sentence, and segmenting the rare words by using a subword segmentation algorithm;
inputting word sequences formed by words obtained after word segmentation and segmentation into a deep learning model for sentence breaking and marking;
according to the original sentence-breaking identification of each corpus sentence and the sentence-breaking label corresponding to the corpus sentence output by the deep learning model, adjusting parameters of the deep learning model, and establishing a sentence-breaking model;
the method for adjusting the parameters of the deep learning model according to the original sentence breaking identification of each corpus sentence and the sentence breaking label corresponding to the corpus sentence output by the deep learning model comprises the following steps:
And comparing the original position of the sentence breaking mark of each corpus sentence with the position of the sentence breaking mark corresponding to the corpus sentence output by the deep learning model, and adjusting the parameters of the deep learning model so that the adjusted position of the sentence breaking mark corresponding to the corpus sentence output by the deep learning model is the same as the original position of the sentence breaking mark of the corpus sentence.
2. The method of claim 1, wherein the corpus sentence is obtained according to the steps of:
acquiring a preset number of sample sentences, wherein sentence ends of the sample sentences are provided with sentence breaking marks;
splicing part or all of the sample sentences;
and dividing each spliced sample sentence, and determining the divided sample sentences as the corpus sentences.
3. The method of claim 2, wherein segmenting each spliced sample sentence comprises:
dividing each spliced sample sentence according to a set step length; or alternatively
And randomly dividing each spliced sample sentence.
4. The method of claim 1, wherein the subword segmentation algorithm is a byte-pair encoding BPE algorithm.
5. The method of claim 1, wherein the deep learning model is controlled to annotate each term in the sequence of terms according to the steps of:
Analyzing the contextual information of the word in the word sequence;
determining a first probability of marking the word with a sentence-breaking mark and a second probability of marking the word with a non-sentence-breaking mark according to the context information of the word;
and selecting the mark with the largest median value between the first probability and the second probability to mark the word.
6. The method of claim 5, wherein after determining the first probability of tagging the word with a sentence-break identifier and the second probability of tagging the word with a non-sentence-break identifier based on the context information of the word, further controlling the deep learning model to perform:
according to the labeling conditions of the labeled words in the word sequence, the first probability and the second probability are adjusted; and
selecting the identifier with the largest median value between the first probability and the second probability to label the word, including:
and selecting the mark with the largest median value between the adjusted first probability and the adjusted second probability to mark the word.
7. The method of claim 1, wherein after adjusting parameters of the deep learning model, further comprising:
inputting at least one test sentence into the adjusted deep learning model for sentence breaking and marking;
Determining the marking accuracy of the adjusted deep learning model according to the original sentence-breaking mark of the test sentence and the sentence-breaking mark corresponding to the test sentence output by the adjusted deep learning model;
if the determined labeling accuracy is greater than or equal to the preset accuracy, determining the adjusted deep learning model as the sentence-breaking model;
and if the determined labeling accuracy is smaller than the preset accuracy, training the adjusted deep learning model according to at least one new corpus sentence to establish the sentence-breaking model.
8. The method of claim 1, further comprising, after building the sentence-breaking model:
performing sentence breaking processing on an input character sequence by using the sentence breaking model, wherein the character sequence is obtained by performing voice recognition processing on an acquired voice signal;
and outputting the character sequence after sentence breaking processing.
9. The device for establishing the sentence-breaking model is characterized by comprising the following components:
the preprocessing module is used for carrying out word segmentation processing on each acquired corpus sentence and determining words contained in the corpus sentence; determining rare words in words contained in the corpus sentence, and segmenting the rare words by using a subword segmentation algorithm;
The marking module is used for inputting word sequences formed by words obtained after word segmentation and segmentation into the deep learning model for sentence breaking marking;
the adjustment module is used for adjusting parameters of the deep learning model according to the original sentence-breaking identification of each corpus sentence and the sentence-breaking label corresponding to the corpus sentence output by the deep learning model, and establishing a sentence-breaking model;
the adjusting module is specifically configured to:
and comparing the original position of the sentence breaking mark of each corpus sentence with the position of the sentence breaking mark corresponding to the corpus sentence output by the deep learning model, and adjusting the parameters of the deep learning model so that the adjusted position of the sentence breaking mark corresponding to the corpus sentence output by the deep learning model is the same as the original position of the sentence breaking mark of the corpus sentence.
10. The apparatus of claim 9, wherein the preprocessing module is specifically configured to obtain a corpus sentence according to the following steps:
acquiring a preset number of sample sentences, wherein sentence ends of the sample sentences are provided with sentence breaking marks;
splicing part or all of the sample sentences;
and dividing each spliced sample sentence, and determining the divided sample sentences as the corpus sentences.
11. The apparatus of claim 10, wherein the preprocessing module is specifically configured to:
dividing each spliced sample sentence according to a set step length; or alternatively
And randomly dividing each spliced sample sentence.
12. The apparatus of claim 9, wherein the subword segmentation algorithm is a byte-pair encoding BPE algorithm.
13. The apparatus of claim 9, wherein the labeling module is specifically configured to control the deep learning model to label each term in the sequence of terms according to:
analyzing the contextual information of the word in the word sequence;
determining a first probability of marking the word with a sentence-breaking mark and a second probability of marking the word with a non-sentence-breaking mark according to the context information of the word;
and selecting the mark with the largest median value between the first probability and the second probability to mark the word.
14. The apparatus of claim 13, wherein the labeling module further controls the deep learning model to perform:
after a first probability of marking a broken sentence of the word and a second probability of marking a non-broken sentence of the word are determined according to the context information of the word, the first probability and the second probability are adjusted according to marking conditions of the words marked in the word sequence;
And selecting the mark with the largest median value between the adjusted first probability and the adjusted second probability to mark the word.
15. The apparatus of claim 9, further comprising a test module to:
after the parameters of the deep learning model are adjusted, inputting at least one test sentence into the adjusted deep learning model for sentence breaking and marking;
determining the marking accuracy of the adjusted deep learning model according to the original sentence-breaking mark of the test sentence and the sentence-breaking mark corresponding to the test sentence output by the adjusted deep learning model;
if the determined labeling accuracy is greater than or equal to the preset accuracy, determining the adjusted deep learning model as the sentence-breaking model;
and if the determined labeling accuracy is smaller than the preset accuracy, triggering the preprocessing module, the labeling module and the adjusting module, and training the adjusted deep learning model according to at least one new corpus sentence to establish the sentence-breaking model.
16. The apparatus of claim 9, further comprising a sentence breaking module for:
after establishing a sentence-breaking model, performing sentence-breaking processing on an input character sequence by using the sentence-breaking model, wherein the character sequence is obtained by performing voice recognition processing on an acquired voice signal;
And outputting the character sequence after sentence breaking processing.
17. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein:
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 8.
18. A computer readable medium storing computer executable instructions for performing the method of any one of claims 1 to 8.
CN201811320993.5A 2018-11-07 2018-11-07 Method and device for establishing sentence-breaking model Active CN111160004B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811320993.5A CN111160004B (en) 2018-11-07 2018-11-07 Method and device for establishing sentence-breaking model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811320993.5A CN111160004B (en) 2018-11-07 2018-11-07 Method and device for establishing sentence-breaking model

Publications (2)

Publication Number Publication Date
CN111160004A CN111160004A (en) 2020-05-15
CN111160004B true CN111160004B (en) 2023-06-27

Family

ID=70554565

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811320993.5A Active CN111160004B (en) 2018-11-07 2018-11-07 Method and device for establishing sentence-breaking model

Country Status (1)

Country Link
CN (1) CN111160004B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737991B (en) * 2020-07-01 2023-12-12 携程计算机技术(上海)有限公司 Text sentence breaking position identification method and system, electronic equipment and storage medium
CN111753524A (en) * 2020-07-01 2020-10-09 携程计算机技术(上海)有限公司 Text sentence break position identification method and system, electronic device and storage medium
CN112397052B (en) * 2020-11-19 2024-06-28 康键信息技术(深圳)有限公司 VAD sentence breaking test method, device, computer equipment and storage medium
CN112632988B (en) * 2020-12-29 2024-07-19 文思海辉智科科技有限公司 Sentence segment breaking method and device and electronic equipment
CN113779964A (en) * 2021-09-02 2021-12-10 中联国智科技管理(北京)有限公司 Statement segmentation method and device
CN116052648B (en) * 2022-08-03 2023-10-20 荣耀终端有限公司 Training method, using method and training system of voice recognition model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107247706A (en) * 2017-06-16 2017-10-13 中国电子技术标准化研究院 Text punctuate method for establishing model, punctuate method, device and computer equipment
CN107305575A (en) * 2016-04-25 2017-10-31 北京京东尚科信息技术有限公司 The punctuate recognition methods of human-machine intelligence's question answering system and device
CN107578770A (en) * 2017-08-31 2018-01-12 百度在线网络技术(北京)有限公司 Networking telephone audio recognition method, device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291553B (en) * 2014-10-24 2023-11-21 谷歌有限责任公司 Neural machine translation system with rare word processing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107305575A (en) * 2016-04-25 2017-10-31 北京京东尚科信息技术有限公司 The punctuate recognition methods of human-machine intelligence's question answering system and device
CN107247706A (en) * 2017-06-16 2017-10-13 中国电子技术标准化研究院 Text punctuate method for establishing model, punctuate method, device and computer equipment
CN107578770A (en) * 2017-08-31 2018-01-12 百度在线网络技术(北京)有限公司 Networking telephone audio recognition method, device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张合 ; 王晓东 ; 杨建宇 ; 周卫东 ; .一种基于层叠CRF的古文断句与句读标记方法.计算机应用研究.2009,(09),全文. *

Also Published As

Publication number Publication date
CN111160004A (en) 2020-05-15

Similar Documents

Publication Publication Date Title
CN111160004B (en) Method and device for establishing sentence-breaking model
CN113705187B (en) Method and device for generating pre-training language model, electronic equipment and storage medium
CN107220235B (en) Speech recognition error correction method and device based on artificial intelligence and storage medium
CN107657947B (en) Speech processing method and device based on artificial intelligence
CN107908635B (en) Method and device for establishing text classification model and text classification
CN113807098B (en) Model training method and device, electronic equipment and storage medium
CN107291828B (en) Spoken language query analysis method and device based on artificial intelligence and storage medium
CN109313719B (en) Dependency resolution for generating text segments using neural networks
CN111160003B (en) Sentence breaking method and sentence breaking device
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN103544955A (en) Method of recognizing speech and electronic device thereof
JP2021192277A (en) Method for extracting information, method for training extraction model, device, and electronic apparatus
CN109726397B (en) Labeling method and device for Chinese named entities, storage medium and electronic equipment
CN112860919B (en) Data labeling method, device, equipment and storage medium based on generation model
CN112188311B (en) Method and apparatus for determining video material of news
CN113743101B (en) Text error correction method, apparatus, electronic device and computer storage medium
CN111753524A (en) Text sentence break position identification method and system, electronic device and storage medium
CN113221565A (en) Entity recognition model training method and device, electronic equipment and storage medium
CN111276129A (en) Method, device and equipment for segmenting audio frequency of television series
CN113449489A (en) Punctuation mark marking method, punctuation mark marking device, computer equipment and storage medium
CN114218940B (en) Text information processing and model training method, device, equipment and storage medium
CN113160820A (en) Speech recognition method, and training method, device and equipment of speech recognition model
CN114429106B (en) Page information processing method and device, electronic equipment and storage medium
CN114398952B (en) Training text generation method and device, electronic equipment and storage medium
CN113641724A (en) Knowledge tag mining method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant