CN112632988A - Sentence segmentation method and device and electronic equipment - Google Patents

Sentence segmentation method and device and electronic equipment Download PDF

Info

Publication number
CN112632988A
CN112632988A CN202011598556.7A CN202011598556A CN112632988A CN 112632988 A CN112632988 A CN 112632988A CN 202011598556 A CN202011598556 A CN 202011598556A CN 112632988 A CN112632988 A CN 112632988A
Authority
CN
China
Prior art keywords
sentence
segment
label
target
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011598556.7A
Other languages
Chinese (zh)
Inventor
陈海燕
钱开源
林怀谦
金喆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wensihai Huizhike Technology Co ltd
Original Assignee
Wensihai Huizhike Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wensihai Huizhike Technology Co ltd filed Critical Wensihai Huizhike Technology Co ltd
Priority to CN202011598556.7A priority Critical patent/CN112632988A/en
Publication of CN112632988A publication Critical patent/CN112632988A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Abstract

The invention provides a sentence-breaking method, a sentence-breaking device and electronic equipment of sentence segments, wherein the method comprises the following steps: acquiring a sentence segment to be punctuated containing a sentence segment label and a text punctuation result of text content of the sentence to be punctuated corresponding to the sentence segment to be punctuated; the sentence to be broken is broken according to the text sentence breaking result to obtain a sentence breaking result of the sentence to be broken; and adjusting the sentence segment labels and/or the sentence segment positions in the sentence segment and sentence segment result according to a preset adjustment strategy to obtain a target sentence segment result of the sentence segment to be punctuated so as to correctly translate the sentence segment to be punctuated according to the target sentence segment result. The sentence segmentation method of the sentence segment can reasonably segment the sentence segment to be segmented containing the sentence segment label, and finally can correctly translate the sentence segment to be segmented according to the target sentence segmentation result of the sentence segment to be segmented, thereby relieving the technical problem that the reasonable segmentation can not be carried out on the sentence segment containing the sentence segment label in the prior art.

Description

Sentence segmentation method and device and electronic equipment
Technical Field
The present invention relates to the field of information processing technologies, and in particular, to a sentence segmentation method and apparatus for sentence segments, and an electronic device.
Background
In the prior art, when a text sentence is to be segmented, a segmentation rule (e.g., a Regular Expression, RE) is usually preset according to the characteristics of a specific language, and then the segmentation position of the text sentence is determined according to the preset segmentation rule. For example: for english, a position of a sentence break followed by a ". sign can be set, and this position is followed by a white character. "thus, The text sentence" The red costs $2.50.The blue costs $2.50 "can be broken into two sentences. The red costs $2.50, and The blue costs $2.50.
In practical process, there often exist period labels for periods, such as: "The red costs < b > $2.50.< b > The blue costs < b > $ 2.50." in this case, because of The existence of The period label, The period number is not followed by a blank space, and The preset sentence-breaking rule is not satisfied any more, thereby causing sentence-breaking incapability. In addition, the improper handling of sentence fragment tags can also adversely affect the translation operation.
In summary, the prior art cannot reasonably break the sentence segments containing the sentence segment labels.
Disclosure of Invention
In view of the above, the present invention provides a sentence interrupting method, device and electronic device for a sentence segment, so as to alleviate the technical problem that the prior art cannot perform reasonable sentence interrupting on the sentence segment containing the sentence segment label.
In a first aspect, an embodiment of the present invention provides a sentence interruption method for a sentence segment, including:
acquiring a sentence segment to be punctuated containing a sentence segment label and a text punctuation result of the text content of the sentence to be punctuated corresponding to the sentence segment to be punctuated;
the sentence to be punctuated is punctuated with reference to the text punctuated result to obtain a sentence fragment punctuated result of the sentence to be punctuated;
and adjusting the sentence segment label and/or the sentence segment position in the sentence segment and sentence segment result according to a preset adjustment strategy to obtain a target sentence segment result of the sentence segment to be segmented, so as to correctly translate the sentence segment to be segmented according to the target sentence segment result.
Further, obtaining a text sentence break result of the text content of the sentence to be broken corresponding to the sentence segment to be broken comprises:
deleting the sentence segment labels in the sentence segments to be punctuated to obtain the text contents of the sentences to be punctuated corresponding to the sentence segments to be punctuated;
and carrying out sentence breaking on the text content of the sentence to be broken according to a preset sentence breaking rule to obtain a text sentence breaking result.
Further, the text sentence-breaking result at least includes one clause, and the making of the sentence to be broken with reference to the text sentence-breaking result includes:
aligning the sentence segment to be punctuated with the text punctuation result according to the text content;
and executing sentence breaking operation on the sentence segment to be broken based on the character alignment result.
Further, performing a sentence-breaking operation on the sentence segment to be broken based on the result of the character alignment includes:
scanning the sentence segment to be punctuated until a first target character corresponding to a first character of a target clause is scanned, and taking the content before the first target character as the sentence front content of a punctuation unit corresponding to the target clause, wherein the target clause is a clause which is traversed from a first clause in sequence in the text punctuation result;
continuing to scan the sentence segment to be punctuated until a second target character corresponding to the last character of the target clause is scanned, and taking the content between the first target character and the second target character as the clause segment content of the punctuated sentence unit corresponding to the target clause;
setting the post-sentence content of the sentence break unit corresponding to the target clause as null;
returning to the step of scanning the sentence segments of the sentence to be punctuated until the character corresponding to the last character of the last clause in the text punctuation result is scanned;
and if the sentence segment to be punctuated has the unscanned character, taking the unscanned character as the post-sentence content of the last punctuation unit of the sentence segment to be punctuated.
Further, the sentence fragment and sentence break result at least includes a sentence break unit, and the sentence break unit includes: the adjusting the sentence segment label and/or the sentence segment position in the sentence segment and sentence segment result according to the preset adjusting strategy comprises the following steps:
determining a first sentence break attribute of each sentence segment label in the sentence segment and sentence break result according to a preset corresponding relation between the sentence segment label and the sentence break attribute, wherein the sentence break attribute comprises: must be excluded, excludable, text-following, and must be preserved;
adjusting the position of the sentence segment label in the sentence segment and sentence segment result and/or increasing the sentence segment position based on the first sentence segment attribute of each sentence segment label to obtain a sentence segment and sentence segment adjustment result;
and adding a sentence segment label in the sentence segment and sentence segment break adjustment result according to a preset sentence segment label adding strategy to obtain the target sentence segment result.
Further, adjusting the position of the sentence segment label in the sentence segment and sentence segment result and/or increasing the sentence segment position based on the first sentence segment attribute of each sentence segment label comprises:
if the first sentence break attribute of the first target sentence segment label is excludable and the first target sentence segment label is an independent label, changing the first sentence break attribute of the first target sentence segment label into a following text;
if the first sentence break attribute of a second target sentence segment label is to be eliminated and the second target sentence segment label is in the clause content, adding a sentence break position at the second target sentence segment label to further obtain a plurality of sentence break units, and taking the second target sentence segment label as the pre-sentence content of the target sentence break unit, wherein the target sentence break unit is the sentence break unit after the sentence break position is newly added;
checking the sentence front content of each sentence break unit in a first sequence, if the sentence segment label of the target sentence break attribute is checked, moving the sentence segment label of the target sentence break attribute and the characters behind the sentence segment label from the sentence front content to the clause content, and continuing checking the sentence front content of each sentence break unit in the first sequence until the sentence segment label of the attribute which needs to be eliminated is checked or the sentence front content is checked completely, wherein the target sentence break attribute comprises: attributes and follow-text attributes must be preserved;
and checking the post-sentence content of each punctuation unit in a second sequence, if the post-sentence label of the target punctuation attribute is checked, moving the post-sentence label of the target punctuation attribute and the previous characters thereof from the post-sentence content to the sub-sentence content, and continuously checking the post-sentence content of each punctuation unit in the second sequence until the post-sentence label of the attribute which needs to be eliminated is detected or the post-sentence content is checked.
Further, adding a sentence fragment tag in the sentence fragment and punctuation adjustment result according to a preset sentence fragment tag adding strategy comprises:
judging whether the content of the clause segment in the sentence segment and punctuation adjustment result has the condition of sentence segment label loss;
if yes, adding sentence segment labels in the sentence break unit to which the target sub-sentence segment content belongs according to the preset sentence segment label adding strategy, and further obtaining the target sentence break result, wherein the target sub-sentence segment content is the sub-sentence segment content with the sentence segment labels missing;
if not, the sentence segment and sentence break adjustment result is used as the target sentence break result.
Further, adding sentence fragment tags in the sentence break unit to which the target sub-sentence fragment content belongs according to the preset sentence fragment tag adding strategy includes:
determining a sentence segment label to be added corresponding to the target sub-sentence segment content, wherein the sentence segment label to be added comprises at least one of the following: a first to-be-added beginning sentence segment label, a second to-be-added ending sentence segment label, a third to-be-added ending sentence segment label and a fourth to-be-added beginning sentence segment label, wherein the second to-be-added ending sentence segment label is a reverse-sequence sentence segment label paired with the first to-be-added beginning sentence segment label, and the fourth to-be-added beginning sentence segment label is a reverse-sequence sentence segment label paired with the third to-be-added ending sentence segment label;
adding the first to-be-added beginning sentence segment label to the head of the post-sentence content of the sentence break unit to which the target sub-sentence segment content belongs;
adding the second to-be-added end sentence segment label to the end of the target clause segment content;
adding the third to-be-added end sentence fragment label to the tail of the sentence front content of the sentence break unit to which the target sub-sentence fragment content belongs;
and adding the fourth to-be-added beginning sentence segment label to the head of the target clause segment content.
Further, correctly translating the sentence segment to be punctuated according to the target sentence punctuation result comprises:
translating each punctuation unit in the target punctuation result to obtain translation results of a plurality of punctuation units;
connecting the translation results of the sentence interruption units in sequence to obtain an initial translation result of the sentence segment to be interrupted;
determining sentence segment label pairs in the initial translation result;
if the target sentence fragment label pair does not contain text content, marking the target sentence fragment label pair with a deletion mark;
determining the minimum number of sentence segment label pairs needing to be reserved for the sentence segment label pairs of the same sentence segment label, wherein the same sentence segment label refers to a copy of the original sentence segment label;
and performing supplementary deletion operation on the sentence segment label pair of the same sentence segment label in the initial translation result based on the target sentence segment label pair with the deletion mark and the minimum number of the sentence segment label pairs needing to be reserved to obtain the translation result of the sentence segment to be broken.
In a second aspect, an embodiment of the present invention further provides a sentence interruption device for a sentence segment, including:
the device comprises an acquisition unit, a display unit and a display unit, wherein the acquisition unit is used for acquiring a sentence segment to be punctuated containing a sentence segment label and a text punctuation result of text content of the sentence to be punctuated corresponding to the sentence segment to be punctuated;
the sentence breaking unit is used for breaking the sentence segment to be broken by referring to the text sentence breaking result to obtain a sentence segment and sentence breaking result of the sentence segment to be broken;
and the adjusting unit is used for adjusting the sentence segment labels and/or the sentence segment positions in the sentence segment and sentence segment breaking result according to a preset adjusting strategy to obtain a target sentence segment result of the sentence segment to be broken so as to correctly translate the sentence segment to be broken according to the target sentence segment result.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method according to any one of the above first aspects when executing the computer program.
In a fourth aspect, an embodiment of the present invention provides a computer-readable medium having non-volatile program code executable by a processor, where the program code causes the processor to perform the steps of the method according to any one of the first aspect.
The embodiment of the invention provides a sentence segmentation method of sentence segments, which comprises the following steps: firstly, obtaining a sentence segment to be broken containing a sentence segment label and a text break result of text content of the sentence to be broken corresponding to the sentence segment to be broken; then, the sentence to be broken is broken according to the text sentence breaking result to obtain a sentence segment and sentence breaking result of the sentence to be broken; finally, adjusting the sentence segment labels and/or the sentence segment positions in the sentence segment and sentence segment breaking result according to a preset adjusting strategy to obtain a target sentence breaking result of the sentence segment to be broken, and correctly translating the sentence segment to be broken according to the target sentence breaking result. The sentence segmentation method of the sentence segment can reasonably segment the sentence segment to be segmented containing the sentence segment label, and finally can correctly translate the sentence segment to be segmented according to the target sentence segmentation result of the sentence segment to be segmented, thereby relieving the technical problem that the sentence segment containing the sentence segment label can not be reasonably segmented in the prior art.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic diagram of an electronic device according to an embodiment of the present invention;
FIG. 2 is a flowchart of a sentence fragment method according to an embodiment of the present invention;
fig. 3 is a flowchart of obtaining a text sentence break result according to an embodiment of the present invention;
fig. 4 is a flowchart of sentence breaking for a sentence segment to be broken with reference to a text sentence breaking result according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating how to adjust sentence fragment tags and/or sentence fragment positions in a sentence fragment and sentence fragment result according to an embodiment of the present invention;
FIG. 6 is a flowchart illustrating adjusting positions of sentence fragment tags and/or increasing positions of sentence fragments in a sentence fragment result based on a first sentence fragment attribute of each sentence fragment tag according to an embodiment of the present invention;
FIG. 7 is a flowchart of a method for adding sentence fragment tags to sentence break units to which the contents of target sub-sentence fragments belong according to an embodiment of the present invention;
fig. 8 is a flowchart of a method for correctly translating a sentence fragment to be brokened according to a target sentence fragment result according to an embodiment of the present invention;
fig. 9 is a schematic diagram of a sentence interrupting device for a sentence segment according to an embodiment of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1:
first, an electronic device 100 for implementing an embodiment of the present invention, which may be used to run sentence-breaking methods and apparatuses for sentence fragments of embodiments of the present invention, is described with reference to fig. 1.
In the embodiment of the present application, the electronic Device 100 may be a server, for example, a network server, a database server, or a terminal Device, for example, a smart phone, a tablet computer, a Personal Digital Assistant (PAD), a Mobile Internet Device (MID), or the like.
In addition, structurally, the electronic device 100 provided by the embodiments of the present application may include one or more processors 110 and one or more memories 120. These components may be interconnected, directly or indirectly, by a bus system and/or other type of connection mechanism (not shown) to enable data transfer or interaction, e.g., the components may be electrically connected to each other via one or more communication buses or signal lines. The sentence-cutting device of the sentence segment includes one or more software modules which can be stored in the memory 120 in the form of software or Firmware (Firmware) or solidified in an Operating System (OS) of the electronic device 100. The processor 110 is used for executing executable modules stored in the memory 120, such as software functional modules and computer programs included in the sentence interrupting device, to implement the sentence interrupting method of the sentence. The processor 110 may execute the computer program upon receiving the execution instruction.
The Processor 110 may be an Integrated Circuit chip having Signal processing capability, or the Processor 110 may be a general-purpose Processor, for example, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a discrete gate or transistor logic device, or a discrete hardware component, which may implement or execute the methods, steps, and logic blocks disclosed in the embodiments of the present Application. Further, a general purpose processor may be a microprocessor or any conventional processor or the like.
The Memory 120 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), and an Electrically Erasable Programmable Read Only Memory (EEPROM), the Memory 120 is used to store a program, and the processor 110 executes the program upon receiving an execution instruction.
It should be understood that the structure shown in fig. 1 is merely an illustration, and the electronic device 100 provided in the embodiment of the present application may have fewer or more components than those shown in fig. 1, or may have a different configuration than that shown in fig. 1.
Example 2:
in accordance with an embodiment of the present invention, there is provided an embodiment of a method of sentence fragmentation, it being noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
Fig. 2 is a flow chart of a sentence interrupting method for a sentence segment according to an embodiment of the present invention, as shown in fig. 2, the method includes the steps of:
step S102, obtaining a sentence segment to be punctuated containing a sentence segment label and a text punctuation result of the text content of the sentence to be punctuated corresponding to the sentence segment to be punctuated;
in the embodiment of the present invention, the sentence segment to be broken may be a sentence segment of an extensible markup language (XML for short), or may also be a sentence segment of a word.
The sentence segment to be punctuated comprises a sentence segment label and also comprises text content. The sentence segment labels can be independent sentence segment labels for referring to some contents, or can be sentence segment labels appearing in pairs for formatting the text contents wrapped in the sentence segment labels or playing a specific identification role on the text contents wrapped in the sentence segment labels.
The text sentence-break result of the text content of the sentence to be broken corresponding to the sentence segment to be broken is obtained by breaking the sentence of the text content of the sentence to be broken according to the conventional technology, and the process is described in detail below, which is not described herein again.
Step S104, the sentence to be broken is broken according to the text sentence breaking result, and the sentence breaking result of the sentence to be broken is obtained;
the process can perform preliminary sentence break on the sentence segment to be broken, but because the process is the sentence break performed on the sentence segment to be broken by referring to the text sentence break result, the obtained sentence segment and sentence break result has unreasonable phenomenon, namely the problem of improper processing of the sentence segment label exists, therefore, the sentence segment label and/or the position of the sentence break in the sentence segment and sentence break result can be further adjusted according to the step S106, and the target sentence break result capable of accurately reacting on the sentence segment to be broken is obtained.
And step S106, adjusting the sentence segment labels and/or the sentence segment positions in the sentence segment and sentence segment result according to a preset adjustment strategy to obtain a target sentence segment result of the sentence segment to be segmented, and correctly translating the sentence segment to be segmented according to the target sentence segment result.
The above-mentioned correct translation means that after translation is performed according to the target sentence-breaking result, the obtained translation result not only has a complete structure (i.e. no sentence component, such as no subject, no object, etc.), but also has a sentence order (i.e. conforms to the characteristics of the translated language, and the word order is in order), and the format in the translation result also corresponds to the format of the sentence segment to be broken (e.g. a certain word in the sentence segment to be broken is in a bold format, and the corresponding translation vocabulary in the obtained translation result is also in a bold format). Namely, the finally obtained target sentence-breaking result of the sentence segment to be broken is accurate and proper, and the meaning and various formats of the sentence segment to be broken can be accurately reflected.
The embodiment of the invention provides a sentence segmentation method of sentence segments, which comprises the following steps: firstly, obtaining a sentence segment to be broken containing a sentence segment label and a text break result of text content of the sentence to be broken corresponding to the sentence segment to be broken; then, the sentence to be broken is broken according to the text sentence breaking result to obtain a sentence segment and sentence breaking result of the sentence to be broken; finally, adjusting the sentence segment labels and/or the sentence segment positions in the sentence segment and sentence segment breaking result according to a preset adjusting strategy to obtain a target sentence breaking result of the sentence segment to be broken, and correctly translating the sentence segment to be broken according to the target sentence breaking result. The sentence segmentation method of the sentence segment can reasonably segment the sentence segment to be segmented containing the sentence segment label, and finally can correctly translate the sentence segment to be segmented according to the target sentence segmentation result of the sentence segment to be segmented, thereby relieving the technical problem that the sentence segment containing the sentence segment label can not be reasonably segmented in the prior art.
The foregoing has outlined rather briefly the sentence-breaking method of sentence segments of the present invention, and the details of the description that follows are set forth.
In an alternative embodiment of the present invention, referring to fig. 3, in the step S102, the step of obtaining a text sentence-breaking result of the text content of the sentence to be broken corresponding to the sentence to be broken includes the following steps:
step S301, deleting the sentence segment labels in the sentence segments to be punctuated to obtain the text contents of the sentences to be punctuated corresponding to the sentence segments to be punctuated;
step S302, sentence breaking is carried out on the text content of the sentence to be broken according to a preset sentence breaking rule, and a text sentence breaking result is obtained.
The preset sentence-break rule is a sentence-break rule used when the text content of the sentence to be broken is broken in the traditional technology. For example, for the content of the english text, the preset sentence-breaking rule may be that, if the preset sentence-breaking rule is a white character, the preset sentence-breaking rule is followed by a sentence-breaking position; for the content of the Chinese text, the preset sentence-breaking rule may be as follows. And then, sentence break positions are obtained, and the preset sentence break rules are not exemplified one by one in the embodiment of the invention.
And after the text sentence-breaking result is obtained, the sentence-breaking segment is subjected to sentence-breaking by referring to the text sentence-breaking result. In an alternative real-time manner, referring to fig. 4, step S104, the step of performing sentence break on the sentence segment to be brokened with reference to the text sentence break result specifically includes the following steps:
step S401, aligning the sentence segment to be punctuated with the text punctuation result according to the text content;
step S402, scanning the sentence segment to be punctuated until a first target character corresponding to the first character of the target clause is scanned, and taking the content before the first target character as the sentence front content of the punctuation unit corresponding to the target clause, wherein the target clause is a clause which is traversed from the first clause in sequence in the text punctuation result;
step S403, continuing to scan the sentence segment to be punctuated until a second target character corresponding to the last character of the target clause is scanned, and taking the content between the first target character and the second target character as the clause segment content of the punctuated sentence unit corresponding to the target clause;
step S404, setting the post-sentence content of the sentence break unit corresponding to the target clause as null;
step S405, returning to the step of scanning the sentence segment to be punctuated until the character corresponding to the last character of the last clause in the text punctuation result is scanned;
in step S406, if there is an unscanned character in the sentence segment to be sentence-broken, the unscanned character is used as the post-sentence content of the last sentence-breaking unit of the sentence segment to be sentence-broken.
To facilitate a better understanding of the process, the process is described below in a specific example:
suppose the sentence segment to be punctuated is: deleting The sentence segment label in The sentence segment to be punctuated, and obtaining The text content of The sentence segment to be punctuated corresponding to The sentence segment to be punctuated as follows: the red costs $2.50.The blue costs $2.50 to. And (3) carrying out sentence breaking on the text content of the sentence to be broken according to a preset sentence breaking rule, wherein the obtained text sentence breaking result is as follows: (The red costs $2.50.) (The blue costs $2.50 to.), wherein The content in (represents one clause, namely The red costs $2.50, is one clause, The blue costs $2.50to, is another clause, and The text sentence break result at least comprises one clause.
After a text sentence-breaking result (The red costs $2.50.) (The blue costs $2.50 to.) is obtained, The sentence segment to be broken, The red costs < b > $2.50.</b > The blue costs < b > $2.50 to. </b > is character-aligned with The text sentence-breaking result according to The text content.
And starting to scan The sentence segment to be sentence-punctuated, The red costs < b > $2.50.</b > The blue costs < b > $2.50too </b >, wherein The first scanned character is T, and The character corresponds to The first character T of The first clause of The text sentence-punctuation result, and then The content before The sentence segment to be sentence-punctuated is taken as The pre-sentence content of The sentence-punctuation unit corresponding to The red costs $2.50.
Continuing to scan The sentence segment to be segmented, The red costs < b > $2.50.</b > The blue costs < b > $2.50 to. </b >, until The character corresponding to The last character of The red costs $2.50. clause is scanned, and further taking The content between T and.as The clause segment content of The sentence segmentation unit corresponding to The red costs $2.50. clause, namely The sentence segment content of The sentence segmentation unit corresponding to The red costs $2.50. clause is: the red costs < b > $2.50.
The post-sentence content of The sentence-breaking unit corresponding to The red costs $2.50. clause is set to be null, i.e. The post-sentence content of The sentence-breaking unit corresponding to The red costs $2.50. clause is null.
And then returning to execute The step of scanning The sentence segment to be sentence break, namely The red costs < b > $2.50.</b > The blue costs < b > $2.50 to. </b >, until The character corresponding to The first character T of The blue costs 2.50 to. clause is scanned, taking The content before T as The pre-sentence content of The sentence break unit corresponding to The blue costs 2.50 to. clause, and thus, The pre-sentence content of The sentence break unit corresponding to The blue costs 2.50 to. clause is as follows: [ solution ] A method for manufacturing a semiconductor device.
Continuing to scan The sentence segment to be segmented, The red costs < b > $2.50.</b > The blue costs < b > $2.50 to. </b >, until The $ character corresponding to The last character of The blue costs $2.50 to. clause pair is scanned, and further taking The content between T and T as The clause segment content of The sentence segmentation unit corresponding to The blue costs 2.50 to. clause, namely The clause segment content of The sentence segmentation unit corresponding to The blue costs $2.50 to. clause as follows: the blue costs < b > $2.50 to.
The above process has scanned The last character of The last clause of The blue costs $2.50to in The text sentence result (The red costs $2.50 to), and so stops scanning.
And at The moment, The sentence segment to be sentence-broken, The red costs < b > $2.50.</b > The blue costs < b > $2.50 to. </b > still has unscanned characters, and The unscanned characters are used as The post-sentence content of The last sentence-breaking unit of The sentence segment to be sentence-broken. That is, The post-sentence content of The sentence break unit corresponding to The last clause of The blue costs $2.50to is: [ solution ] A method for manufacturing a semiconductor device.
Therefore, the sentence segment to be punctuated is: the result of sentence fragment sentence break corresponding to The red costs < b > $2.50. </The blue costs < b > $2.50 to. </b > is: { [ The red costs < b > $2.50.] } { </b > [ The blue costs < b > $2.50 to. ] } b }. Wherein { } denotes a sentence-break unit, [ ] denotes a sentence-break position, { denotes a start position of a sentence-break unit, } denotes an end position of a sentence-break unit, [ denotes a start position of a sub-sentence fragment content, ] denotes an end position of a sub-sentence fragment content, and one sentence-break unit can be expressed as: { pre-sentence content [ clause segment content ] post-sentence content }, where the clause segment content is the content to be translated.
The above description describes in detail the process of sentence-breaking for the sentence-to-be-broken sentence with reference to the text sentence-breaking result, and the following describes in detail the process of adjusting the sentence-breaking label and/or the position of the sentence-breaking in the sentence-breaking result.
In an optional embodiment of the present invention, the sentence fragment and sentence fragment result includes at least one sentence fragment unit, and the sentence fragment unit includes: referring to fig. 5, step S106, adjusting the sentence segment label and/or the sentence segment position in the sentence segment and sentence segment break result according to the preset adjustment strategy specifically includes the following steps:
step S501, determining a first sentence break attribute of each sentence segment label in the sentence segment and sentence break result according to the preset corresponding relation between the sentence segment label and the sentence break attribute, wherein the sentence break attribute comprises: must be excluded, excludable, text-following, and must be preserved;
in the embodiment of the present invention, the preset corresponding relationship between the sentence fragment tag and the sentence fragment attribute is shown in the following table:
Figure BDA0002868902970000141
step S502, adjusting the position of the sentence segment label in the sentence segment and sentence segment result and/or increasing the sentence segment position based on the first sentence segment attribute of each sentence segment label to obtain a sentence segment and sentence segment adjustment result;
referring to fig. 6, the method specifically includes the following steps:
step S601, if the first sentence break attribute of the first target sentence segment label is excludable and the first target sentence segment label is an independent label, changing the first sentence break attribute of the first target sentence segment label into a following text;
the meaning of the independent label is explained below: for period labels that may exclude attributes, they typically occur in pairs, such as: for the sentence segment label pair which represents the bold format, the beginning sentence segment label is < b >, the ending sentence segment label is < b >, the sentence segment label pair which represents the bold format and the bold format of the process of < b > and </b >, text contents are generally included between the label pairs, and the represented meaning is to bold the text contents between the label pairs; if no text content exists between the label pairs, the label pairs can be regarded as independent labels, and in practical application, the label pairs are often abbreviated as < b/>, that is, the label in the form of < b/> is the independent label. Whereas for period tags that follow the text attribute, placeholders are often represented, which are also in the form of independent tags, such as: % s is also an independent label.
In step S601, the punctuation attribute of the sentence segment tag with the exclusionable attribute is changed into the following text, so that the complexity of subsequent processing on the sentence segment tag can be reduced, that is, the subsequent adjustment process can be simplified.
Step S602, if the first sentence break attribute of the second target sentence segment label is to be eliminated and the second target sentence segment label is in the content of the clause, adding a sentence break position at the second target sentence segment label to further obtain a plurality of sentence break units, and taking the second target sentence segment label as the pre-sentence content of the target sentence break unit, wherein the target sentence break unit is the sentence break unit after the sentence break position is newly added;
for example: the sentence segments to be punctuated are: the red costs 2.50, The corresponding sentence fragment sentence-punctuation result is: { [ The red < p > costs </p > $2.50.] }, wherein < p > and </p > are structural labels, The punctuation attribute is that must be excluded, and it is located in The clause content, increase The punctuation position at < p > and </p >, and then get multiple punctuation units, and regard < p > and </p > as The sentence content before The punctuation unit after newly increasing The punctuation position, so The result is: { [ The red ] } { < p > [ costs ] } { </p > [ $2.50.] }.
It can be seen that, after the process of step S602, the sentence segment labels in the clause content may be the sentence segment label that must retain the attribute, the sentence segment label that follows the text attribute, and the sentence segment label that can exclude the attribute, and the sentence segment labels in the inter-sentence content (i.e., the pre-sentence content and the post-sentence content) may be the sentence segment label that must exclude the attribute, the sentence segment label that must retain the attribute, the sentence segment label that follows the text attribute, the sentence segment label that can exclude the attribute, and the white character.
Step S603, checking the pre-sentence content of each sentence break unit in a first order, if the sentence break label of the target sentence break attribute is checked, moving the sentence break label of the target sentence break attribute and the following characters from the pre-sentence content to the sub-sentence content, and continuing to check the pre-sentence content of each sentence break unit in the first order until the sentence break label of the attribute must be eliminated is checked, or the pre-sentence content is checked, wherein the target sentence break attribute comprises: attributes and follow-text attributes must be preserved;
the first sequence is a sequence from back to front, and the second sequence is from back to front along the direction of the text flow.
If the sentence segment to be punctuated is: < html > < body > < p >% is my bug </p > </body > </html >, and the corresponding sentence fragment and sentence break result is: { < html > < body > < p >% s [ is my favorite ] ] </body > </html > }, it is obvious that the sentence fragment sentence interruption result is unreasonable, and the sentence fragment interruption result needs to be adjusted because the content of the sub sentence fragment is my favorite.
The sentence break result is adjusted according to the above step S603. Specifically, the sentence front content < html > < body > < p > s of the sentence break unit { < html > < body > < p > < is my discount ] </body > } is checked from back to front, the sentence fragment label of% s is checked first, the fragment attribute is the following text attribute, and is the target fragment attribute, then the sentence fragment label of% s and the character behind the sentence fragment label are moved from the sentence front content < html > < body > < p > s to the sub sentence fragment content is my discount, and the result is { html > < body > < p >% s my discount ] </body > }; the check is continued from back to front for the sentence front content < html > < body > < p >, the sentence segment label of < p > is checked, the sentence segment attribute is the must-exclude attribute, and the check is terminated. The adjusted result is therefore: { < html > < body > < p > [% is my favorite. ] </body > }, the clause content is: % s is my favorite, wherein% s is the subject, and the sentence-breaking result is more accurate and correct.
Step S604, checking the post-sentence content of each sentence break unit in the second order, if the sentence break label of the target sentence break attribute is checked, moving the sentence break label of the target sentence break attribute and the previous characters from the post-sentence content to the sub-sentence content, and continuing to check the post-sentence content of each sentence break unit in the second order until the sentence break label of the attribute that must be eliminated is detected or the post-sentence content is checked.
The second sequence is a sequence from front to back, and the second sequence is from front to back along the direction of the text stream, and the process is described in detail below by using a specific example.
If the sentence segment to be punctuated is: and the closing% s corresponds to the result of sentence segment and sentence break: { < b > [ close ]% s }, obviously, the content of the sub-period is short of the object after being closed, and is not proper for Japanese translation, so the result of the sentence break of the period needs to be adjusted.
The sentence break result is adjusted according to the above step S604. Specifically, the post-sentence content% s of the sentence break unit { < b > [ close ]% s } is checked from front to back, the sentence segment label of% s is checked first, the sentence break attribute is the following text attribute, and is the target sentence break attribute, the sentence segment label of% s and the previous character thereof are moved from the post-sentence content% s to the sub-sentence segment content, and the result is { < b > [ close% s ] </b }; and continuing to check the post-sentence content from front to back, checking the sentence segment label of the sentence segment, wherein the sentence segment attribute is excludable and is not the target sentence segment attribute, and then, the sentence segment label of the sentence segment is not required to be moved to the sub-sentence content from the post-sentence content, and at the moment, finishing checking the post-sentence content and terminating the check. The adjusted result is therefore: { < b > [ close% s ] </b > }, the clause segment content is: closing% s, wherein% s is object, and% s を after translation to japanese is pressed ます, which accords with the characteristics of japanese language, namely that the sentence break result of { < b > [ closing% s ] </b > } is more accurate and correct.
It can be seen that, after the processes of step S603 and step S604, the sentence fragment tags in the clause content may be the sentence fragment tags that must retain the attribute, the sentence fragment tags that follow the text attribute, and the sentence fragment tags that can exclude the attribute, and the sentence fragment tags in the inter-sentence content (i.e., the pre-sentence content and the post-sentence content) may be the sentence fragment tags that must exclude the attribute, the sentence fragment tags that follow the text attribute, the sentence fragment tags that can exclude the attribute, and the white characters.
After the process of step S502, the sentence completeness of the translation result translated according to the sentence segment and sentence break adjustment result can be ensured.
Step S503, adding sentence segment labels in the sentence segment and sentence segment adjustment result according to the preset strategy for adding sentence segment labels to obtain the target sentence segment result.
The method specifically comprises the following steps:
(1) judging whether the content of the clause segment in the sentence segment and punctuation adjustment result has the condition of sentence segment label loss;
(2) if yes, adding a sentence segment label in a sentence break unit to which the target sub-sentence segment content belongs according to a preset sentence segment label adding strategy, and further obtaining a target sentence break result, wherein the target sub-sentence segment content is the sub-sentence segment content with the sentence segment label missing;
(3) if not, the sentence segment and sentence break adjustment result is used as the target sentence break result.
The following describes in detail a process of adding sentence fragment tags in a sentence break unit to which a target sentence fragment content belongs according to a preset sentence fragment tag adding strategy, with reference to fig. 7, which specifically includes the following steps:
step S701, determining a sentence segment label to be added corresponding to the target clause content, where the sentence segment label to be added includes at least one of the following: a first to-be-added beginning sentence segment label, a second to-be-added ending sentence segment label, a third to-be-added ending sentence segment label and a fourth to-be-added beginning sentence segment label, wherein the second to-be-added ending sentence segment label is a reverse sequence sentence segment label paired with the first to-be-added beginning sentence segment label, and the fourth to-be-added beginning sentence segment label is a reverse sequence sentence segment label paired with the third to-be-added ending sentence segment label;
for the convenience of understanding, the following detailed description will be given of a specific example of the process of determining the sentence segment tags to be added corresponding to the target clause content:
if the sentence segment to be punctuated is: the red costs < b > < f > $2.50.The blue costs < b > $2.50to > f >, wherein The sentence break attribute of < f > < b > is excludable.
The result of the sentence-breaking method according to the present invention after step S502 is: { < i > [ The red costs < b > < f > $2.50.] } { </f > </b > < i > [ The blue costs < b > $2.50 to. ] } b.
Namely, two sentence break units are obtained, and the clause content in the first sentence break unit is: the red costs < b > < f > $2.50, wherein The I, The < b >, and The < f > are not paired, that is, in The case that The content of The clause segment in The first sentence break unit has a missing sentence segment tag, a sentence segment tag needs to be added to pair (The second sentence break unit is similar and will not be described further).
The first sentence-breaking unit { < i > [ The red costs < b > < f > $2.50.] } is taken as an example for explanation:
firstly, determining The content of The target clause segment, red costs < b > < f > $2.50, and The corresponding sentence segment label to be added can be realized by The following two modes:
the first method is as follows: a stack structure is taken and is marked as so (stack of open tag), and is marked with [ ], wherein the left side is the stack bottom, and the right side is the stack top. Taking an array structure, and marking as Ac (array of closing tag). Scanning the content of the target sub sentence segment, and when meeting the sentence segment label, if the label is the starting sentence segment label, putting the label into a stack So; if the label is the end sentence segment label and is matched with the beginning sentence segment label of the current So stack top, the So stack top beginning sentence segment label is subjected to stack operation; if the end sentence segment label is not matched with the beginning sentence segment label of the current So stack top, the end sentence segment label is added at the Ac tail.
Scanning according to The principle, The red costs < b > < f > 2.50, scanning, wherein The period label which is met at The beginning is The period label which is The ending period label and is not matched with The starting period label at The current So stack top, and then adding The ending period label at The end of Ac, namely Ac: [ </i > ];
scanning is continued, the encountered sentence segment label is < b >, which is the beginning sentence segment label, and the label is put into the stack So, namely So: [ < b > ];
scanning is continued, the encountered sentence segment label is < f >, which is the beginning sentence segment label, and the label is put into the stack So, that is, So: [ < b > < f > ];
and (4) finishing scanning, and obtaining the final So: [ < b > < f > ], Ac: [ </i > ].
The obtained < b > < f > is a first beginning period label to be added, [ </i ] is a third ending period label to be added, and the corresponding second ending period label to be added is a reverse sequence period label matched with the first beginning period label to be added, namely </f > </b >; the corresponding fourth to-be-added beginning sentence segment label is the reverse-sequence sentence segment label paired with the third to-be-added ending sentence segment label, namely < i >, so that the sentence segment label to be added is obtained.
The second method comprises the following steps: a linked list structure is taken and is marked as Co (chain of open tag) and is expressed by [ ]. Another chain structure is taken and recorded as cc (chain of closing tag) (here, the structure of array can also be used). Scanning the content of the target sub sentence segment, and when meeting the sentence segment label, if the label is the starting sentence segment label, putting the starting sentence segment label at the tail of the linked list Co; if the label is the end sentence segment label, searching the first start sentence segment label matched with the label from back to front in the linked list; if the start sentence segment label is found, removing the found start sentence segment label from the linked list Co; if not, the ending period segment label chain is as the end of the linked list Cc.
Scanning The red costs < b > < f > 2.50 according to The principle, and finally obtaining The result: co: [ < b > < f > ], Cc: [ </i > ].
Using the linked list data structure, for the case where the target sub-sentence content contains sentence labels that are not strictly paired (e.g., the format labels < b > < i > </b > </i > are mismatched pairs), the stack structure cannot identify mismatched pairs and a large number of sentence labels will be added. It is very useful to use the linked list structure at this time.
Step S702, adding the first beginning sentence segment label to be added to the head of the post-sentence content of the sentence break unit to which the target sub-sentence segment content belongs;
as an example in the above step S701, the result obtained after step S702 is: { < i > [ The red costs < b > < f > $2.50.] < b > < f > }.
Step S703, adding a second to-be-added end sentence segment label to the end of the target clause segment content;
as an example in the above step S702, the result obtained after step S703 is: { < i > [ The red costs < b > < f > $2.50.</f > ] < b > < f > }.
Step S704, adding a third to-be-added end sentence fragment label to the end of the pre-sentence content of the punctuation unit to which the target sentence fragment content belongs;
as an example in the above step S703, the result obtained after step S704 is: { < i > [ The red costs < b > < f > $2.50.</f > ] < b > < f > }.
In step S705, the fourth to-be-added beginning sentence segment tag is added to the head of the target clause content.
As an example in the above step S704, the result obtained after step S705 is: { < i > [ < i > The red ] costs < b > < f > $2.50.</f > ] < b > < f > }.
Therefore, on the premise of not destroying the original sentence segment label matching relation of the whole sentence segment to be broken, the unpaired sentence segment label in the sentence segment unit is supplemented, and the translation processing of an interpreter is facilitated.
In addition, after the target sentence-breaking result is obtained, the sentence-segment label pair may be simplified and cancelled, for example, the result is:
{<i></i>[<i>The red</i>costs<b><f>$2.50.</f></b>]<b><f>}{</f></b><i> </i>[<i>The blue</i>costs<b>$2.50too.</b>]<b></b>and the marked line can be deleted.
Step S503 can make the translation processing of the target sentence-breaking result correspond to the format of the sentence to be broken.
The following is illustrated by way of example:
suppose the sentence segment to be punctuated is: according to The sentence-breaking process of The step S403, The obtained sentence-segment sentence-breaking adjustment result is as follows: { [ The red costs < b > $2.50.] } { </b > [ The blue costs < b > $2.50 to. ] } if The sentence fragment sentence-break adjustment result is translated, The translation is: { [ safflower version, < b > $2.50. } { </b > [ blue version < b > also costs $2.50. H ], wherein the translation result of the first sentence: the red version has a flower < b > $2.50. Can be compared with sentence segments to be punctuated: the format of The red costs < b > $2.50. corresponds, but The translation result of The second sentence: the blue version < b > also spends $2.50. In the </b >, $2.50 is also spent due to the text content between < b > and </b >. ", it can be seen that the translation is a pair" also costing $2.50. "bold, and wait for sentence break: in The blue costs < b > $2.50 to. </b >, The text content between < b > and </b > is "$ 2.50. "and" also ", obviously, the translated text translated according to the sentence fragment and sentence fragment adjustment result does not correspond to the format of the sentence fragment to be fragmented, so that the sentence fragment label needs to be added according to the content in the above step S503, and the final target sentence fragment result is: { [ The red costs < b > $2.50.</b > ] } { [ The blue costs < b > $2.50 to. </b > ] }, after translation, The corresponding translation is: { [ safflower version, < b > $2.50.The blue version < b > also has the value of < b > 2.50. Obviously, The translation corresponds to The format of The sentence segment to be sentence-segmented, namely, The target sentence-segmentation result { [ The red costs < b > $2.50.</b > ] } { [ The blue costs < b > $2.50 to. </b > ] } can accurately and properly reflect The sentence segment to be sentence-segmented The red costs < b > $2.50.</b > The blue costs < b > $2.50 to. </b >.
In an alternative embodiment of the present invention, referring to fig. 8, the correct translation of the sentence fragment to be punctuated according to the target sentence fragment result comprises the following steps:
step S801, translating each punctuation unit in the target punctuation result to obtain translation results of a plurality of punctuation units;
step S802, connecting the translation results of the sentence interruption units in sequence to obtain an initial translation result of the sentence segment to be interrupted;
when they are connected in order, it is necessary to process the white characters (spaces, paragraph marks, etc. are all white characters) of the inter-sentence content. For example: in the translation, since chinese does not require the use of spaces between sentences, the spaces between sentences need to be removed.
Step S803, determining sentence fragment label pairs in the initial translation result;
as described in step S701 above, sentence segment tag pairs can be searched through a stack structure or a linked list structure, and after sentence segment tag pairs are searched, the pairing relationship between the two pairs is recorded.
Step S804, if the text content is not contained between the target sentence fragment label pair, the target sentence fragment label pair is marked with a deletion mark;
specifically, whether text content or independent sentence segment labels are contained between the beginning sentence segment label and the ending sentence segment label of the sentence segment label pair is checked in sequence, and if not, the sentence segment label pair is marked with a deletion mark.
Step S805, determining the minimum number of the sentence segment label pairs needing to be reserved for the sentence segment label pairs of the same sentence segment label, wherein the same sentence segment label refers to the copy of the original sentence segment label;
specifically, the minimum number of sentence segment tag pairs to be retained is determined according to the preset requirement, and the copy may be a sentence segment tag added before.
For example, if the preset requirement is that the reservation can be not performed, at least 0 pair is reserved, and the minimum number is 0; if the preset requirement is that at least 1 pair is reserved, the minimum number is 1; if the preset requirement is that the number of the sentence segment labels of the sentence segment to be broken is not less than the number of the sentence segment labels of the sentence segment to be broken, at least n pairs are reserved, and n is the number of the sentence segment labels corresponding to the sentence segment to be broken.
Step S806, performing a supplementary deletion operation on the sentence segment label pair of the same sentence segment label in the initial translation result based on the target sentence segment label pair with the deletion mark and the minimum number of the sentence segment label pairs that need to be preserved, to obtain a translation result of the sentence segment to be broken.
The specific process is as follows: recording r as the label logarithm without the deletion mark in the sentence segment to be broken, d as the label logarithm with the deletion mark in the sentence segment to be broken, m as the minimum number of the label pairs of the sentence segment to be reserved, and deleting all the label pairs of the sentence segment with the deletion mark if r is greater than m; if r < m and r + d > m, deleting the sentence segment label pair with the deletion mark by r + d-m; if r + d < m, the end of sentence segment to be punctuated is supplemented with m-r-d pairs of sentence segment labels.
The sentence fragment method of the invention expands the sentence fragment method of the traditional technology, and reasonably processes the sentence fragment label (including increasing the position of the sentence fragment, adjusting the position of the sentence fragment label and increasing the sentence fragment label) on the basis of the text sentence fragment result, thereby realizing the real-time proper sentence fragment of the sentence fragment containing the sentence fragment label, and effectively avoiding the translation difficulty caused by improper processing of the sentence fragment label.
Example 3:
the embodiment of the present invention further provides a sentence interruption device for sentence segments, which is mainly used for executing the sentence interruption method for sentence segments provided by the foregoing content of the embodiment of the present invention, and the following provides a detailed description of the sentence interruption device for sentence segments provided by the embodiment of the present invention.
Fig. 9 is a schematic diagram of a sentence interrupting device for a sentence segment according to an embodiment of the present invention, as shown in fig. 9, the sentence interrupting device for a sentence segment mainly includes: an obtaining unit 10, a sentence-breaking unit 20 and an adjusting unit 30, wherein:
the acquisition unit is used for acquiring a sentence segment to be punctuated containing a sentence segment label and a text punctuation result of the text content of the sentence to be punctuated corresponding to the sentence segment to be punctuated;
the sentence breaking unit is used for breaking the sentence to be broken according to the text sentence breaking result to obtain the sentence breaking result of the sentence to be broken;
and the adjusting unit is used for adjusting the sentence segment labels and/or the sentence segment positions in the sentence segment and sentence segment breaking result according to the preset adjusting strategy to obtain a target sentence segment result of the sentence segment to be broken so as to correctly translate the sentence segment to be broken according to the target sentence segment result.
In an embodiment of the present invention, a sentence interruption device for sentence segments is provided, including: firstly, obtaining a sentence segment to be broken containing a sentence segment label and a text break result of text content of the sentence to be broken corresponding to the sentence segment to be broken; then, the sentence to be broken is broken according to the text sentence breaking result to obtain a sentence segment and sentence breaking result of the sentence to be broken; finally, adjusting the sentence segment labels and/or the sentence segment positions in the sentence segment and sentence segment breaking result according to a preset adjusting strategy to obtain a target sentence breaking result of the sentence segment to be broken, and correctly translating the sentence segment to be broken according to the target sentence breaking result. The sentence segmentation device of the sentence segment can reasonably segment the sentence segment to be segmented containing the sentence segment label, and finally can correctly translate the sentence segment to be segmented according to the target sentence segmentation result of the sentence segment to be segmented, so that the technical problem that the sentence segment containing the sentence segment label cannot be reasonably segmented in the prior art is solved.
Optionally, the obtaining unit is further configured to: deleting the sentence segment labels in the sentence segments to be punctuated to obtain the text contents of the sentences to be punctuated corresponding to the sentence segments to be punctuated; and carrying out sentence breaking on the text content of the sentence to be broken according to a preset sentence breaking rule to obtain a text sentence breaking result.
Optionally, the text sentence-break result includes at least one clause, and the sentence-break unit is further configured to: aligning the sentence segment to be punctuated with the text punctuation result according to the text content; and executing sentence breaking operation on the sentence to be broken based on the character alignment result.
Optionally, the sentence punctuation unit is further configured to: scanning the sentence segment to be punctuated until a first target character corresponding to the first character of the target clause is scanned, and taking the content before the first target character as the pre-sentence content of the punctuation unit corresponding to the target clause, wherein the target clause is a clause which is traversed from the first clause in sequence in the text punctuation result; continuing to scan the sentence segment to be punctuated until a second target character corresponding to the last character of the target clause is scanned, and taking the content between the first target character and the second target character as the clause segment content of the punctuated sentence unit corresponding to the target clause; setting the post-sentence content of the sentence break unit corresponding to the target clause as null; returning to the step of scanning the sentence segments to be punctuated until the character corresponding to the last character of the last clause in the text punctuation result is scanned; if the sentence segment to be broken has the unscanned character, the unscanned character is used as the post-sentence content of the last sentence unit of the sentence segment to be broken.
Optionally, the sentence fragment and sentence fragment result at least includes a sentence fragment unit, and the sentence fragment unit includes: the adjusting unit is further configured to: determining a first sentence break attribute of each sentence segment label in the sentence segment and sentence break result according to the preset corresponding relation between the sentence segment label and the sentence break attribute, wherein the sentence break attribute comprises: must be excluded, excludable, text-following, and must be preserved; adjusting the position of the sentence segment label in the sentence segment and sentence segment result and/or increasing the sentence segment position based on the first sentence segment attribute of each sentence segment label to obtain a sentence segment and sentence segment adjustment result; and adding sentence segment labels in the sentence segment and sentence segment adjustment result according to a preset sentence segment label adding strategy to obtain a target sentence segment result.
Optionally, the adjusting unit is further configured to: if the first sentence break attribute of the first target sentence segment label is excludable and the first target sentence segment label is an independent label, changing the first sentence break attribute of the first target sentence segment label into a following text; if the first sentence break attribute of the second target sentence segment label is to be eliminated and the second target sentence segment label is in the content of the clause segment, adding a sentence break position at the second target sentence segment label to further obtain a plurality of sentence break units, and taking the second target sentence segment label as the pre-sentence content of the target sentence break unit, wherein the target sentence break unit is the sentence break unit after the sentence break position is newly added; checking the sentence front content of each sentence break unit in a first sequence, if the sentence segment label of the target sentence break attribute is checked, moving the sentence segment label of the target sentence break attribute and the characters behind the sentence segment label from the sentence front content to the clause content, and continuously checking the sentence front content of each sentence break unit in the first sequence until the sentence segment label of the attribute which needs to be eliminated is checked or the sentence front content is checked completely, wherein the target sentence break attribute comprises: attributes and follow-text attributes must be preserved; and checking the post-sentence content of each punctuation unit in a second sequence, if the post-sentence label of the target punctuation attribute is checked, moving the post-sentence label of the target punctuation attribute and the previous characters thereof from the post-sentence content to the sub-sentence content, and continuously checking the post-sentence content of each punctuation unit in the second sequence until the post-sentence label of the attribute which needs to be eliminated is detected or the post-sentence content is checked.
Optionally, the adjusting unit is further configured to: judging whether the content of the clause segment in the sentence segment and punctuation adjustment result has the condition of sentence segment label loss; if yes, adding a sentence segment label in a sentence break unit to which the target sub-sentence segment content belongs according to a preset sentence segment label adding strategy, and further obtaining a target sentence break result, wherein the target sub-sentence segment content is the sub-sentence segment content with the sentence segment label missing; if not, the sentence segment and sentence break adjustment result is used as the target sentence break result.
Optionally, the adjusting unit is further configured to: determining a sentence segment label to be added corresponding to the target clause segment content, wherein the sentence segment label to be added comprises at least one of the following: a first to-be-added beginning sentence segment label, a second to-be-added ending sentence segment label, a third to-be-added ending sentence segment label and a fourth to-be-added beginning sentence segment label, wherein the second to-be-added ending sentence segment label is a reverse sequence sentence segment label paired with the first to-be-added beginning sentence segment label, and the fourth to-be-added beginning sentence segment label is a reverse sequence sentence segment label paired with the third to-be-added ending sentence segment label; adding a first beginning sentence segment label to be added to the head of the post-sentence content of the punctuation unit to which the target sub-sentence segment content belongs; adding a second to-be-added end sentence segment label to the tail of the target clause segment content; adding a third to-be-added end sentence fragment label to the tail of the sentence front content of the sentence break unit to which the target sub-sentence fragment content belongs; and adding a fourth to-be-added beginning sentence segment label to the head of the target clause segment content.
Optionally, the apparatus is further configured to: translating each punctuation unit in the target punctuation result to obtain translation results of a plurality of punctuation units; connecting the translation results of the multiple sentence breaking units in sequence to obtain an initial translation result of the sentence segment to be broken; determining sentence segment label pairs in the initial translation result; if the target sentence fragment label pair does not contain text content, marking the target sentence fragment label pair with a deletion mark; determining the minimum number of the sentence segment label pairs needing to be reserved for the sentence segment label pairs of the same sentence segment label, wherein the same sentence segment label refers to the copy of the original sentence segment label; and performing supplementary deletion operation on the sentence segment label pair of the same sentence segment label in the initial translation result based on the target sentence segment label pair with the deletion mark and the minimum number of the sentence segment label pairs needing to be reserved to obtain the translation result of the sentence segment to be broken.
The implementation principle and the technical effect of the sentence-breaking device provided by the embodiment of the present invention are the same as those of the method embodiment in embodiment 2, and for the sake of brief description, the corresponding contents in the method embodiment can be referred to where the embodiment of the device is not mentioned.
In another embodiment, there is also provided a computer readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the steps of the method of any of the above embodiments 2.
In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit is merely a division of one logic function, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (12)

1. A sentence-breaking method of sentence segments is characterized by comprising the following steps:
acquiring a sentence segment to be punctuated containing a sentence segment label and a text punctuation result of the text content of the sentence to be punctuated corresponding to the sentence segment to be punctuated;
the sentence to be punctuated is punctuated with reference to the text punctuated result to obtain a sentence fragment punctuated result of the sentence to be punctuated;
and adjusting the sentence segment label and/or the sentence segment position in the sentence segment and sentence segment result according to a preset adjustment strategy to obtain a target sentence segment result of the sentence segment to be segmented, so as to correctly translate the sentence segment to be segmented according to the target sentence segment result.
2. The method of claim 1, wherein obtaining text sentence break results for the text content of the sentence to be broken corresponding to the sentence to be broken comprises:
deleting the sentence segment labels in the sentence segments to be punctuated to obtain the text contents of the sentences to be punctuated corresponding to the sentence segments to be punctuated;
and carrying out sentence breaking on the text content of the sentence to be broken according to a preset sentence breaking rule to obtain a text sentence breaking result.
3. The method of claim 1, wherein the text sentence-breaking result comprises at least one clause, and wherein breaking the to-be-broken sentence segment with reference to the text sentence-breaking result comprises:
aligning the sentence segment to be punctuated with the text punctuation result according to the text content;
and executing sentence breaking operation on the sentence segment to be broken based on the character alignment result.
4. The method of claim 3, wherein performing a sentence break operation on the sentence segment to be broken based on the result of the character alignment comprises:
scanning the sentence segment to be punctuated until a first target character corresponding to a first character of a target clause is scanned, and taking the content before the first target character as the sentence front content of a punctuation unit corresponding to the target clause, wherein the target clause is a clause which is traversed from a first clause in sequence in the text punctuation result;
continuing to scan the sentence segment to be punctuated until a second target character corresponding to the last character of the target clause is scanned, and taking the content between the first target character and the second target character as the clause segment content of the punctuated sentence unit corresponding to the target clause;
setting the post-sentence content of the sentence break unit corresponding to the target clause as null;
returning to the step of scanning the sentence segments of the sentence to be punctuated until the character corresponding to the last character of the last clause in the text punctuation result is scanned;
and if the sentence segment to be punctuated has the unscanned character, taking the unscanned character as the post-sentence content of the last punctuation unit of the sentence segment to be punctuated.
5. The method of claim 1, wherein said sentence fragment and sentence fragment result comprises at least one sentence fragment unit, said sentence fragment unit comprising: the adjusting the sentence segment label and/or the sentence segment position in the sentence segment and sentence segment result according to the preset adjusting strategy comprises the following steps:
determining a first sentence break attribute of each sentence segment label in the sentence segment and sentence break result according to a preset corresponding relation between the sentence segment label and the sentence break attribute, wherein the sentence break attribute comprises: must be excluded, excludable, text-following, and must be preserved;
adjusting the position of the sentence segment label in the sentence segment and sentence segment result and/or increasing the sentence segment position based on the first sentence segment attribute of each sentence segment label to obtain a sentence segment and sentence segment adjustment result;
and adding a sentence segment label in the sentence segment and sentence segment break adjustment result according to a preset sentence segment label adding strategy to obtain the target sentence segment result.
6. The method of claim 5, wherein adjusting the position of the sentence fragment tag and/or increasing the sentence fragment position in the sentence fragment result based on the first sentence fragment attribute of each sentence fragment tag comprises:
if the first sentence break attribute of the first target sentence segment label is excludable and the first target sentence segment label is an independent label, changing the first sentence break attribute of the first target sentence segment label into a following text;
if the first sentence break attribute of a second target sentence segment label is to be eliminated and the second target sentence segment label is in the clause content, adding a sentence break position at the second target sentence segment label to further obtain a plurality of sentence break units, and taking the second target sentence segment label as the pre-sentence content of the target sentence break unit, wherein the target sentence break unit is the sentence break unit after the sentence break position is newly added;
checking the sentence front content of each sentence break unit in a first sequence, if the sentence segment label of the target sentence break attribute is checked, moving the sentence segment label of the target sentence break attribute and the characters behind the sentence segment label from the sentence front content to the clause content, and continuing checking the sentence front content of each sentence break unit in the first sequence until the sentence segment label of the attribute which needs to be eliminated is checked or the sentence front content is checked completely, wherein the target sentence break attribute comprises: attributes and follow-text attributes must be preserved;
and checking the post-sentence content of each punctuation unit in a second sequence, if the post-sentence label of the target punctuation attribute is checked, moving the post-sentence label of the target punctuation attribute and the previous characters thereof from the post-sentence content to the sub-sentence content, and continuously checking the post-sentence content of each punctuation unit in the second sequence until the post-sentence label of the attribute which needs to be eliminated is detected or the post-sentence content is checked.
7. The method of claim 5, wherein adding sentence fragment tags in the sentence fragment phrase adjustment result according to a preset sentence fragment tag adding strategy comprises:
judging whether the content of the clause segment in the sentence segment and punctuation adjustment result has the condition of sentence segment label loss;
if yes, adding sentence segment labels in the sentence break unit to which the target sub-sentence segment content belongs according to the preset sentence segment label adding strategy, and further obtaining the target sentence break result, wherein the target sub-sentence segment content is the sub-sentence segment content with the sentence segment labels missing;
if not, the sentence segment and sentence break adjustment result is used as the target sentence break result.
8. The method of claim 7, wherein adding sentence fragment tags in sentence break units to which the target sub-sentence fragment content belongs according to the preset sentence fragment tag adding strategy comprises:
determining a sentence segment label to be added corresponding to the target sub-sentence segment content, wherein the sentence segment label to be added comprises at least one of the following: a first to-be-added beginning sentence segment label, a second to-be-added ending sentence segment label, a third to-be-added ending sentence segment label and a fourth to-be-added beginning sentence segment label, wherein the second to-be-added ending sentence segment label is a reverse-sequence sentence segment label paired with the first to-be-added beginning sentence segment label, and the fourth to-be-added beginning sentence segment label is a reverse-sequence sentence segment label paired with the third to-be-added ending sentence segment label;
adding the first to-be-added beginning sentence segment label to the head of the post-sentence content of the sentence break unit to which the target sub-sentence segment content belongs;
adding the second to-be-added end sentence segment label to the end of the target clause segment content;
adding the third to-be-added end sentence fragment label to the tail of the sentence front content of the sentence break unit to which the target sub-sentence fragment content belongs;
and adding the fourth to-be-added beginning sentence segment label to the head of the target clause segment content.
9. The method of claim 1, wherein correctly translating the sentence segment to be punctuated according to the target sentence-breaking result comprises:
translating each punctuation unit in the target punctuation result to obtain translation results of a plurality of punctuation units;
connecting the translation results of the sentence interruption units in sequence to obtain an initial translation result of the sentence segment to be interrupted;
determining sentence segment label pairs in the initial translation result;
if the target sentence fragment label pair does not contain text content, marking the target sentence fragment label pair with a deletion mark;
determining the minimum number of sentence segment label pairs needing to be reserved for the sentence segment label pairs of the same sentence segment label, wherein the same sentence segment label refers to a copy of the original sentence segment label;
and performing supplementary deletion operation on the sentence segment label pair of the same sentence segment label in the initial translation result based on the target sentence segment label pair with the deletion mark and the minimum number of the sentence segment label pairs needing to be reserved to obtain the translation result of the sentence segment to be broken.
10. A sentence-breaking device for sentence fragments, comprising:
the device comprises an acquisition unit, a display unit and a display unit, wherein the acquisition unit is used for acquiring a sentence segment to be punctuated containing a sentence segment label and a text punctuation result of text content of the sentence to be punctuated corresponding to the sentence segment to be punctuated;
the sentence breaking unit is used for breaking the sentence segment to be broken by referring to the text sentence breaking result to obtain a sentence segment and sentence breaking result of the sentence segment to be broken;
and the adjusting unit is used for adjusting the sentence segment labels and/or the sentence segment positions in the sentence segment and sentence segment breaking result according to a preset adjusting strategy to obtain a target sentence segment result of the sentence segment to be broken so as to correctly translate the sentence segment to be broken according to the target sentence segment result.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of the preceding claims 1 to 9 are implemented when the computer program is executed by the processor.
12. A computer-readable medium having non-volatile program code executable by a processor, characterized in that the program code causes the processor to perform the steps of the method of any of the preceding claims 1 to 9.
CN202011598556.7A 2020-12-29 2020-12-29 Sentence segmentation method and device and electronic equipment Pending CN112632988A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011598556.7A CN112632988A (en) 2020-12-29 2020-12-29 Sentence segmentation method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011598556.7A CN112632988A (en) 2020-12-29 2020-12-29 Sentence segmentation method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN112632988A true CN112632988A (en) 2021-04-09

Family

ID=75286747

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011598556.7A Pending CN112632988A (en) 2020-12-29 2020-12-29 Sentence segmentation method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN112632988A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007018098A (en) * 2005-07-05 2007-01-25 Advanced Telecommunication Research Institute International Text division processor and computer program
CN105869446A (en) * 2016-03-29 2016-08-17 广州阿里巴巴文学信息技术有限公司 Electronic reading apparatus and voice reading loading method
US20190205396A1 (en) * 2017-12-29 2019-07-04 Yandex Europe Ag Method and system of translating a source sentence in a first language into a target sentence in a second language
CN111160004A (en) * 2018-11-07 2020-05-15 北京猎户星空科技有限公司 Method and device for establishing sentence-breaking model
CN111401004A (en) * 2020-03-28 2020-07-10 苏州机数芯微科技有限公司 Article sentence-breaking method based on machine learning
CN111967274A (en) * 2020-08-25 2020-11-20 文思海辉智科科技有限公司 Label conversion processing method and device, electronic equipment and readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007018098A (en) * 2005-07-05 2007-01-25 Advanced Telecommunication Research Institute International Text division processor and computer program
CN105869446A (en) * 2016-03-29 2016-08-17 广州阿里巴巴文学信息技术有限公司 Electronic reading apparatus and voice reading loading method
US20190205396A1 (en) * 2017-12-29 2019-07-04 Yandex Europe Ag Method and system of translating a source sentence in a first language into a target sentence in a second language
CN111160004A (en) * 2018-11-07 2020-05-15 北京猎户星空科技有限公司 Method and device for establishing sentence-breaking model
CN111401004A (en) * 2020-03-28 2020-07-10 苏州机数芯微科技有限公司 Article sentence-breaking method based on machine learning
CN111967274A (en) * 2020-08-25 2020-11-20 文思海辉智科科技有限公司 Label conversion processing method and device, electronic equipment and readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王川;张小红;韩采华;: "古汉语句子切分与句读标记方法研究", 河南大学学报(自然科学版), vol. 39, no. 5, 16 September 2009 (2009-09-16), pages 525 - 529 *

Similar Documents

Publication Publication Date Title
US10650192B2 (en) Method and device for recognizing domain named entity
CN102737012B (en) text information comparison method and system
KR20220133141A (en) Text extraction method, text extraction model training method, apparatus and device
JPH0250478B2 (en)
EP2790111A1 (en) Method and device for acquiring structured information in layout file
US9613005B2 (en) Method and apparatus for bidirectional typesetting
CN110162782B (en) Entity extraction method, device and equipment based on medical dictionary and storage medium
CN111144100B (en) Question text recognition method and device, electronic equipment and storage medium
CN111797630B (en) PDF-format-paper-oriented biomedical entity identification method
CN110837788A (en) PDF document processing method and device
CN108664471B (en) Character recognition error correction method, device, equipment and computer readable storage medium
WO2021174786A1 (en) Training sample production method and apparatus, computer device, and readable storage medium
CN109871544B (en) Entity identification method, device, equipment and storage medium based on Chinese medical record
CN115546809A (en) Table structure identification method based on cell constraint and application thereof
CN109670461A (en) PDF text extraction method, device, computer equipment and storage medium
CN112632988A (en) Sentence segmentation method and device and electronic equipment
CN113283231B (en) Method for acquiring signature bit, setting system, signature system and storage medium
CN114579796A (en) Machine reading understanding method and device
CN110807322B (en) Method, device, server and storage medium for identifying new words based on information entropy
CN111679825A (en) Cascading style sheet generation method and device, computer equipment and storage medium
CN111708891B (en) Food material entity linking method and device between multi-source food material data
KR101721536B1 (en) statistical WORD ALIGNMENT METHOD FOR APPLYING ALIGNMENT TENDENCY BETWEEN WORD CLASS AND machine translation APPARATUS USING THE SAME
CN113779218B (en) Question-answer pair construction method, question-answer pair construction device, computer equipment and storage medium
CN113255369B (en) Text similarity analysis method and device and storage medium
CN117173725B (en) Table information processing method, apparatus, computer device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination