CN112861513B - Text segmentation method, device, electronic equipment and storage medium - Google Patents

Text segmentation method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112861513B
CN112861513B CN202110164368.1A CN202110164368A CN112861513B CN 112861513 B CN112861513 B CN 112861513B CN 202110164368 A CN202110164368 A CN 202110164368A CN 112861513 B CN112861513 B CN 112861513B
Authority
CN
China
Prior art keywords
clauses
clause
text
output
dividing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110164368.1A
Other languages
Chinese (zh)
Other versions
CN112861513A (en
Inventor
常炎隆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202110164368.1A priority Critical patent/CN112861513B/en
Publication of CN112861513A publication Critical patent/CN112861513A/en
Application granted granted Critical
Publication of CN112861513B publication Critical patent/CN112861513B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text segmentation method, a text segmentation device, electronic equipment and a storage medium, and relates to the field of information processing. The specific implementation scheme is as follows: dividing the text to be processed based on punctuation marks to obtain L first clauses; l is an integer of 1 or more; determining M to-be-output clauses based on the L first clauses, and taking the M to-be-output clauses as a segmentation result of the to-be-processed text; m is an integer greater than or equal to 1; wherein determining M phrases to be output based on the L first phrases includes: processing the ith first clause based on a matching rule under the condition that the length of the ith first clause in the L first clauses is larger than a preset length threshold value to obtain the clause to be output; i is an integer of 1 or more and L or less.

Description

Text segmentation method, device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of information processing, and in particular, to the field of text information processing.
Background
In the prior art, text is segmented, usually, word segmentation is performed first, and then a sentence is generated according to part of speech. The workflow processing method in the processing is too redundant and huge, word segmentation needs to rely on huge word libraries and word segmentation algorithms, word part regression of the segmented words is needed to generate sentences after word segmentation, huge part-of-speech models are needed to rely on, and then combined sentences can also have errors in sentence generation due to insufficient part-of-speech coverage or part-of-speech collision.
Disclosure of Invention
The disclosure provides a text segmentation method, a text segmentation device, electronic equipment and a storage medium.
According to an aspect of the present disclosure, there is provided a text segmentation method including:
dividing the text to be processed based on punctuation marks to obtain L first clauses; l is an integer of 1 or more;
determining M to-be-output clauses based on the L first clauses, and taking the M to-be-output clauses as a segmentation result of the to-be-processed text; m is an integer greater than or equal to 1;
wherein the determining M phrases to be output based on the L first phrases includes:
processing the ith first clause based on a matching rule under the condition that the length of the ith first clause in the L first clauses is larger than a preset length threshold value to obtain the clause to be output; i is an integer of 1 or more and L or less.
According to another aspect of the present disclosure, there is provided a text segmentation apparatus including:
the first dividing module is used for dividing the text to be processed based on punctuation marks to obtain L first clauses; l is an integer of 1 or more;
the second dividing module is used for determining M to-be-output clauses based on the L first clauses, and taking the M to-be-output clauses as a segmentation result of the to-be-processed text;
The second dividing module is configured to process, based on a matching rule, the ith first clause to obtain the clause to be output when the length of the ith first clause in the L first clauses is greater than a preset length threshold value; i is an integer of 1 or more and L or less.
According to another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods provided by any one of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method provided by any of the embodiments of the present disclosure.
According to the technical scheme, the text to be processed can be segmented without adopting a complex model, and meanwhile, the accuracy and the processing efficiency of the text segmentation processing are ensured.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is a flowchart of a text segmentation method according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a process flow for obtaining a first clause in a text segmentation method according to one embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a process flow for splitting a second sentence in a text splitting method according to another embodiment of the present disclosure;
fig. 4 is a second flowchart of a text segmentation method according to another embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a text segmentation apparatus according to an embodiment of the present disclosure;
FIG. 6 is a schematic diagram of another text segmentation apparatus according to an embodiment of the present disclosure;
fig. 7 is a block diagram of an electronic device for implementing a text segmentation method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
A schematic diagram of a text segmentation method provided in a first embodiment of the present disclosure. As shown in fig. 1, the method includes:
s101: dividing the text to be processed based on punctuation marks to obtain L first clauses; l is an integer of 1 or more;
s102: determining M to-be-output clauses based on the L first clauses, and taking the M to-be-output clauses as a segmentation result of the to-be-processed text; m is an integer greater than or equal to 1;
wherein determining M phrases to be output based on the L first phrases includes:
processing the ith first clause based on a matching rule under the condition that the length of the ith first clause in the L first clauses is larger than a preset length threshold value to obtain the clause to be output; i is an integer of 1 or more and L or less.
Here, the text to be processed may be an article, and the language used may include chinese and/or foreign language.
The punctuation marks may include any type of punctuation mark, such as periods, commas, ellipses, and the like, and are not intended to be exhaustive.
The number L of the first clauses may be 1 or more, which is not limited in this embodiment.
The L first clauses may have a part of the lengths of the first clauses greater than the preset length threshold value, and the part of the L first clauses needs to be further processed according to a matching rule, so that a final clause to be output is obtained. The matching rules can be set according to actual conditions and can comprise one or more different matching rules, and the matching rules further divide sentences.
In addition, a part of the lengths of the L first clauses may not be greater than the preset length threshold value, and the part of the L first clauses may be directly used as the clause to be output without processing.
That is, whether the length of each first clause of the L first clauses is greater than the preset length threshold value may be sequentially determined, and determining M clauses to be output based on the L first clauses if the first clause currently determined is referred to as the i-th first clause includes:
processing the ith first clause based on a matching rule under the condition that the length of the ith first clause in the L first clauses is larger than a preset length threshold value to obtain the clause to be output; i is an integer of 1 or more and L or less;
And taking the ith first clause as the clause to be output under the condition that the length of the ith first clause in the L first clauses is not more than the preset length threshold value.
Here, the preset length threshold value may be set according to actual situations, for example, may be 15 characters, etc.
The M may be an integer of 1 or more, and is preferably an integer of 2 or more. The taking the M phrases to be output as the segmentation result of the text to be processed may specifically refer to arranging the M phrases to be output in sequence as the segmentation result of the document to be processed.
Therefore, by adopting the scheme, the text to be processed can be divided through punctuation marks to obtain a plurality of clauses, and then partial clauses with lengths larger than the length are processed by adopting a matching rule to finally obtain the segmentation result of the text to be processed. Therefore, the segmentation processing of the text to be processed can be realized without adopting a complex model, and meanwhile, the accuracy and the processing efficiency of the text segmentation processing are ensured.
In addition, because punctuation marks have certain relevance with the semantics of the text to be processed, the first clause obtained through division has no influence on the expression of the semantics.
In the text segmentation method provided in the second embodiment of the present disclosure, the dividing the text to be processed based on punctuation marks to obtain L first clauses, as shown in fig. 2, includes:
s201: dividing the text to be processed based on the first punctuation mark to obtain K paragraphs; k is an integer greater than or equal to 1;
s202: processing the second punctuation mark in the K paragraphs to obtain processed K paragraphs;
s203: dividing the processed K paragraphs based on the third class punctuation marks to obtain L first clauses.
Here, the first category punctuation mark, the second category punctuation mark, and the third category punctuation mark are all different from each other.
For example, the first category punctuation may be a punctuation that characterizes the end of a sentence, such as a period, question mark, sigh, ellipsis, and so forth.
The second punctuation mark may be a preset symbol having no influence on the whole article semantics, for example, may include middle brackets, angle brackets, and the like, which are not exhaustive herein.
The third category of punctuation marks may be punctuation marks for semantically dividing the entire long sentence or punctuation marks called pauses in the characterization sentence, such as commas, stop signs, semicolons, and the like.
The dividing the text to be processed based on the first punctuation mark to obtain K paragraphs may specifically be: and segmenting the text to be processed based on the first punctuation mark to obtain K segmented paragraphs.
Processing the second punctuation mark in the K paragraphs to obtain processed K paragraphs may include:
filtering the second punctuation marks in the r-th paragraph under the condition that the second punctuation marks exist in the r-th paragraph in the K paragraphs to obtain a processed r-th paragraph; r is an integer of 1 or more and K or less;
and directly taking the r paragraph as the processed r paragraph under the condition that the second punctuation mark does not exist in the r paragraph in the K paragraphs.
That is, the current paragraph is sequentially extracted from the K paragraphs; judging whether the current paragraph contains second punctuation marks or not; if the current paragraph contains the second punctuation marks, filtering the second punctuation marks in the current paragraph to obtain the processed current paragraph; if the current paragraph does not contain the second punctuation mark, the current paragraph is directly used as the processed paragraph.
The dividing the processed K paragraphs based on the third punctuation mark to obtain the L first clauses may include:
dividing the processed r paragraph based on the third punctuation mark under the condition that the third punctuation mark exists in the processed r paragraph in the processed K paragraphs, so as to obtain a plurality of first clauses;
and taking the r paragraph as the first sentence when the third punctuation mark is not present in the r paragraph after the processing in the K paragraphs after the processing.
Here, it should be noted that the number of paragraphs obtained by finally dividing different texts to be processed may be different, and the number of divided paragraphs is not limited in this embodiment.
Finally, L first clauses divided by K paragraphs can be obtained. Wherein L may be an integer of 2 or more. The number of first phrases that can be divided with respect to different paragraphs is not limited herein, but all first phrases divided based on all K paragraphs are referred to as L first phrases.
Therefore, a plurality of clauses can be obtained by dividing the text to be processed through different types of punctuation marks, and the division mode can be realized without a complex model, so that the method can be more efficient; in addition, because punctuation marks have certain relevance with the semantics of the text to be processed, the first clause obtained through division has no influence on the expression of the semantics. In addition, the invalid punctuation marks are further filtered on the divided paragraphs, so that influence factors are reduced in subsequent processing, the processing efficiency is improved, and the semantics are not influenced.
As shown in fig. 3, in the text segmentation method provided in the third embodiment of the present disclosure, the processing the ith first clause based on a matching rule to obtain the clause to be output includes:
s301: obtaining target text matched with a text matching rule from the ith first clause, and dividing the ith first clause into a plurality of second clauses based on the target text;
s302: and under the condition that second clauses to be adjusted, the length of which is larger than the preset length threshold value, exist in the plurality of second clauses, determining a first class separator from the second clauses to be adjusted based on a separator matching rule, and processing the second clauses to be adjusted based on the first class separator to obtain the clauses to be output.
It should be understood that, although the processing of the ith first clause is described in this embodiment, the foregoing description has already been provided, where the ith first clause is any one of the first clauses of the L first clauses having lengths greater than the preset length threshold value, so the processing of the first clause of each of the L first clauses having lengths greater than the preset length threshold value may be performed by using the scheme provided in this embodiment, and only one-to-one description is omitted.
The preset length threshold value can be set according to practical situations, for example, 15 characters and the like.
Here, the text matching rule is used to match the corresponding target text from the ith first clause. The text matching rule may be represented by a first regular expression, for example, the first regular expression may be "a×b" and be used to express that a piece of text corresponding to a beginning character of "a" and an ending character of "B" is taken as the target text.
Dividing the ith first clause into a plurality of second clauses based on the target text may be: extracting the target text as one of the plurality of second clauses; if the content contained in the remaining ith first clause is divided into two parts by the target text, respectively taking the two parts as two second clauses; if the content contained in the remaining ith first clause is a part, the part is taken as another second clause.
The separator matching rule may be used to match a separator of a preset type, which is referred to as the first type separator in this embodiment. For example, the separator matching rule may be set to ([ 0-9] +%), and the corresponding separator of the first type may be a percentage; of course, the separator matching rules may comprise a plurality, that is, a plurality or plurality of separators of the first type may be matched, and all possible separator matching rules are not exhaustive.
Note that the processing of this embodiment may further include:
judging whether a target text matched with a text matching rule exists in the ith first clause, and if not, taking the ith first clause as a second clause to be adjusted;
and judging whether the first class separator matched with the separator matching rule exists in the second sentence to be adjusted, and if not, taking the second sentence to be adjusted as an initial third sentence to carry out subsequent processing.
The method can obtain the clause to be output based on at least one matching rule division through the processing, and the matching mode based on the matching rule does not need to depend on a complex language model, so that the processing complexity of the whole flow can be reduced, and the accuracy of the finally obtained clause to be output is ensured.
Wherein the dividing the ith first clause into a plurality of second clauses based on the target text includes:
dividing the ith first clause into a plurality of initial second clauses and a second class separator in a case that the second class separator is at a neighboring position behind the target text; and adding the second class separator to one of the plurality of initial second clauses to obtain the plurality of second clauses.
That is, judging whether a target text matched with a text matching rule exists in the ith first clause, and if not, taking the ith first clause as a second clause to be adjusted;
if so, judging whether a second class separator exists at a position adjacent to the target text, if so, dividing the ith first clause into a plurality of initial second clauses and the second class separator, and adding the second class separator to one of the initial second clauses to obtain the second clauses;
if the second class separator does not exist, extracting the target text as one second clause; if the content contained in the ith first clause which is remained except the target text is divided into two parts by the target text, respectively taking the two parts as two second clauses; and if the content contained in the ith first clause except the target text is one part, taking the part as another second clause.
In particular, in case of a second class separator at a neighboring position after the target text, dividing the i-th first clause into a plurality of initial second clauses and the second class separator; adding the second class separator to one of the plurality of initial second clauses to obtain the plurality of second clauses may be:
In the case that the adjacent position behind the target text is a second class separator, extracting the second class separator, and dividing the remaining ith first clause into a plurality of initial second clauses; wherein the target text is one of the plurality of initial second clauses;
judging whether to add the second class separator to the plurality of initial second clauses based on a first preset rule; if the first class separator is determined not to be added, directly deleting the first class separator, and taking the plurality of initial second sentences as the plurality of second sentences; if the adding is determined, the second class separator is added to a first initial second clause or a second initial second clause adjacent to the second class separator based on a second preset rule; if it is determined to add the second class separator to the first initial second clause, then adding the second class separator to the end of the first initial second clause, otherwise, adding the second class separator to the beginning of the second initial second clause.
The first initial second clause is an initial second clause before the second class separator, and the second initial second clause is an initial second clause after the second class separator.
The types of the second class separator and the first class separator may be preset, but they are different, for example, the second class separator may be $, # and so on, which is not exhaustive herein.
For example, the content of the ith first clause is [ 1234 ] 567 ]; extracting a target text from an ith first clause by adopting a text matching rule [ 1*4 ] to obtain a [ 1234 ] and a second class separator [ Y ] after the target text [ 1234 ], determining that the ith first clause is divided into a plurality of initial second clauses [ 1234 ] and [ 567 ] and the second class separator [ Y ]; judging whether the symbol is fed in or not based on a first preset rule, if the fed in is determined, judging the fed-in position of the second class separator based on a second preset rule, and if the second class separator is determined to be fed in a first initial second clause, obtaining a plurality of finally obtained second clauses as [ 1234 ] and [ 567 ].
The method for dividing the second sentence based on the text matching rule can be obtained through the above processing, the situation that the second class separator behind the target text is divided outside the target text can exist in the dividing processing, whether the second class separator is supplemented can be determined through judging based on the preset rule, and therefore the accuracy of the finally obtained sentence to be output can be ensured.
The scheme provided by this embodiment further includes: marking the text to be processed based on a unit text of a preset type to obtain a marked target unit text in the text to be processed;
the method comprises the steps of obtaining target text matched with a text matching rule from the ith first clause, dividing the ith first clause into a plurality of second clauses based on the target text, and further comprising: and determining the matched target text from the ith first clause based on the text matching rule, and dividing the ith first clause into a plurality of second clauses based on the target text when the target unit text is not segmented in the target text.
In addition, it may further include: and under the condition that the target unit text is segmented in the target text, the ith first clause is not subjected to division processing, the ith first clause is directly used as a second clause to be adjusted, and subsequent judgment processing based on the delimiter matching rule is executed.
The target unit text can be one or more, specifically can be an inseparable unit text, and any specific target unit text can be one of one or more preset types of unit texts. For example, the unit text of a certain preset type may be any one of the following: phrases in the title number, phrases in double quotation marks, phrases in brackets, and the like.
That is, when determining the target text based on the text matching rule, before extracting the target text, it may be determined whether the target text includes a part of the content in the target unit text, if so, it may be determined that there is a division of the target unit text, otherwise, it may be determined that there is no division of the target unit text.
For example, the preset type of unit text is "x", the i-th first clause is "123" ab-4 "567", and the "ab-4" contained in the i-th first clause may be used as a target unit text in advance based on the preset type of unit text; the text matching rule [ 1*a ] is adopted, the target text matched from the ith first clause is [ 123 a ], and the target unit text is obviously segmented, so that the subsequent processing can be executed without continuously executing the subsequent processing, namely, the ith first clause is not segmented, and the ith first clause is used as the second clause to be adjusted to execute the subsequent processing.
Through the processing, whether the unit text can be segmented or not can be added in the processing of at least one second clause divided based on the preset matching rule, so that the final division result can be ensured to be more consistent with actual semantics, and the accuracy of the clause is ensured.
After the above processing is completed, if there is no second clause to be adjusted, that is, if there is no second clause to be adjusted with a length greater than the preset length threshold value in the plurality of second clauses, the plurality of second clauses obtained by dividing the ith first clause are all used as the clause to be output.
And under the condition that second clauses to be adjusted, the length of which is larger than the preset length threshold value, exist in the plurality of second clauses, determining a first class separator from the second clauses to be adjusted based on a separator matching rule, and processing the second clauses to be adjusted based on the first class separator to obtain the clauses to be output.
The preset length threshold value can be set according to practical situations, for example, 15 characters and the like.
The foregoing embodiments have been described with respect to the manner in which the first-type separator is determined from the second clause to be adjusted based on the separator matching rule, and thus, a description thereof will not be repeated here.
After determining the first class separator, processing the second clause to be adjusted based on the first class separator to obtain the clause to be output may include:
Dividing the second clause to be adjusted into a plurality of initial third clauses and the first class separators based on the first class separators, and taking the plurality of initial third clauses and the first class separators as a plurality of third clauses; and under the condition that a candidate third clause with the length not larger than the preset length threshold value exists in the plurality of third clauses, taking the candidate third clause as the clause to be output.
Dividing the second clause to be adjusted into a plurality of initial third clauses and the first class delimiters based on the first class delimiters, taking the plurality of initial third clauses and the first class delimiters as a plurality of third clauses may include:
and extracting the first class separator from the second clause to be adjusted, and dividing the text in the rest of the second clauses to be adjusted into one or more third clauses.
For example, to adjust the second clause to [ 67% of XXXXXX in XXX ], the delimiter match rule is ([ 0-9] +%, then the split unit is "67%"; after the segmentation unit is extracted, the rest text contains [ in XXX ] and [ XXXXXX ], and the segmentation unit is taken as one third clause, so that 3 third clauses are finally obtained, namely [ in XXX ], 67% and [ XXXXXX ].
Further, judging whether the length of each third clause of the plurality of third clauses is larger than the preset length threshold value in sequence, and if so, taking the third clause as a third clause to be adjusted; if not, the third clause is used as a candidate third clause, and the candidate third clause is directly used as the clause to be output.
Therefore, at least one clause divided based on the preset separator matching rule can be obtained, and the matching mode based on the preset rule does not need to depend on a complex language model, so that the processing complexity of the whole flow can be reduced, and the semantic accuracy of finally output clauses to be output is ensured.
Further, the method may further include: when a third clause to be adjusted, the length of which is larger than the preset length threshold value, exists in the plurality of third clauses, the third clause to be adjusted is segmented to obtain a plurality of phrases; and combining based on the phrases to obtain a plurality of phrases to be output, wherein the phrases to be output are smaller than a preset length threshold value.
Here, the preset length threshold value may be set according to actual situations, for example, may be 15 characters, etc.
All of the plurality of third clauses may be clauses with lengths not greater than the preset length threshold value, where the plurality of third clauses may be directly used as a plurality of clauses to be output; or, if a part of the third clauses is a third clause with a length greater than the preset length threshold value, the third clause with a length greater than the preset length threshold value is treated as a third clause to be adjusted, and a third clause with a length not greater than the preset length threshold value is treated as a clause to be output.
That is, the number of the third clauses to be adjusted may be 0, or may be 1 or more, which is not limited in this embodiment, and as long as there are 1 or more third clauses to be adjusted, word segmentation is performed on the third clauses to be adjusted to obtain multiple phrases; and combining based on the phrases to obtain a plurality of phrases to be output, wherein the phrases to be output are smaller than the preset length threshold value.
The method for word segmentation of the third clause to be adjusted may be based on any word segmentation method, such as semantic-based segmentation, and the like.
It should be noted that, in the process of combining based on the plurality of phrases, the method may further include: the word number of the last phrase in the combination is larger than or equal to a preset word number threshold value; the preset threshold value may be set according to actual situations, for example, may be 2 characters.
That is, in the combination processing of a plurality of phrases, the limitation of the preset length threshold value may be directly performed on each finally obtained phrase to be output, and in addition, since a single character is likely to not express complete semantics, the limitation is further increased on the word number of the last phrase in each phrase to be output.
For example, a third phrase to be adjusted is "abcd is the goal of our present year and keeps climbing", and the corresponding word segmentation result may include a plurality of phrases, which may be "we", "present year", "goal", "is", "abcd", "keep climbing", respectively; the preset length threshold value may be 8, and then three combinations of "we", "this year", "target", "is", "abcd", "and" keep "," climb ", are obtained by combining, but if the" end phrase "in the first combination is not greater than the preset word number threshold value, the division needs to be performed again, and finally the to-be-output clause may include" our present year target "," is abcd "and" keep climb ".
Thus, on the basis of at least one third clause, if the third clause to be adjusted is larger than the preset length threshold value, then the third clause to be adjusted is segmented to obtain a phrase, and the phrase is combined to finally obtain the clause to be output; although the word segmentation process may be used in the embodiment, compared with the prior art, the data size for word segmentation process is greatly reduced, so that the overall processing efficiency is not affected, and the scheme of the embodiment does not use a model with huge language class, so that the accuracy of the finally obtained segmentation result is ensured on the basis of reducing the complexity of the overall processing flow.
The scheme provided in this embodiment is described with reference to fig. 4, by way of example:
s401: marking a text to be processed based on a unit text of a preset type to obtain a marked target unit text in the text to be processed; wherein, the preset type unit text refers to text combinations which are inseparable in the process.
S402: dividing the text to be processed based on the first punctuation mark to obtain K paragraphs.
S403: and processing the second punctuation mark in the K paragraphs to obtain processed K paragraphs.
S404: dividing the processed K paragraphs based on a third category punctuation mark to obtain L first clauses.
S405: extracting an ith first clause from the L first clauses, judging whether the length of the ith first clause is larger than a preset length threshold value, and executing S406 if the length of the ith first clause is larger than the preset length threshold value; otherwise, the ith first clause is used as the clause to be output.
In addition, after the ith first clause is taken as the clause to be output, the method may further include: extracting the (i+1) th first clause from the L first clauses, and repeatedly executing S405 by taking the (i+1) th clause as the (i) th first clause.
S406: and acquiring target text matched with a text matching rule from the ith first clause, and dividing the ith first clause into a plurality of second clauses based on the target text.
S407: judging whether the length of each second clause in the plurality of second clauses is larger than a preset length threshold value, if so, taking the second clause as a second clause to be adjusted, and executing S408; otherwise, the second clause with the length not larger than the preset length threshold value is used as the clause to be output.
S408: and determining a first class separator from the second clauses to be adjusted based on a separator matching rule, dividing the second clauses to be adjusted into a plurality of initial third clauses and the first class separator based on the first class separator, and taking the plurality of initial third clauses and the first class separator as a plurality of third clauses.
S409: judging whether the length of each third clause of the plurality of third clauses is greater than a preset threshold value, if so, taking the third clause with the length greater than the preset threshold value as a third clause to be adjusted, and executing S410; otherwise, directly taking the third clause with the length not longer than the preset threshold value as the third clause to be output.
S410: and word segmentation is carried out on the third clause to be adjusted to obtain a plurality of phrases, and a plurality of clauses to be output, which are smaller than the preset length threshold value, are obtained based on the plurality of phrases.
A fourth embodiment of the present disclosure provides a text segmentation apparatus, as shown in fig. 5, including:
the first dividing module 501 is configured to divide a text to be processed based on punctuation marks, so as to obtain L first clauses; l is an integer of 1 or more;
the second dividing module 502 is configured to determine M clauses to be output based on the L first clauses, and use the M clauses to be output as a segmentation result of the text to be processed;
the second dividing module 502 is configured to process, based on a matching rule, the ith first clause to obtain the clause to be output when the length of the ith first clause in the L first clauses is greater than a preset length threshold value; i is an integer of 1 or more and L or less.
The first dividing module 501 is configured to divide the text to be processed based on a first punctuation mark, so as to obtain K paragraphs; k is an integer greater than or equal to 1; processing the second punctuation mark in the K paragraphs to obtain processed K paragraphs; dividing the processed K paragraphs based on a third category punctuation mark to obtain L first clauses.
The second dividing module 502 is configured to obtain, from the ith first clause, a target text that matches the text matching rule, and divide the ith first clause into a plurality of second clauses based on the target text; and under the condition that second clauses to be adjusted, the length of which is larger than the preset length threshold value, exist in the plurality of second clauses, determining a first class separator from the second clauses to be adjusted based on a separator matching rule, and processing the second clauses to be adjusted based on the first class separator to obtain the clauses to be output.
The second dividing module 502 is configured to divide the ith first clause into a plurality of initial second clauses and a second class separator if the second class separator is located at a neighboring position after the target text; and adding the second class separator to one of the plurality of initial second clauses to obtain the plurality of second clauses.
As shown in fig. 6, the apparatus further includes:
a marking module 503, configured to mark the text to be processed based on a unit text of a preset type, so as to obtain a marked target unit text in the text to be processed;
The second dividing module 502 is configured to determine, based on the text matching rule, the target text that matches from the ith first clause, and divide the ith first clause into a plurality of second clauses based on the target text when the target unit text is not segmented in the target text.
The second dividing module 502 is configured to divide the second clause to be adjusted into a plurality of initial third clauses and the first class separators based on the first class separators, and take the plurality of initial third clauses and the first class separators as a plurality of third clauses; and under the condition that a candidate third clause with the length not larger than the preset length threshold value exists in the plurality of third clauses, taking the candidate third clause as the clause to be output.
The second dividing module 502 is configured to, when a third clause to be adjusted, whose length is greater than the preset length threshold value, exists in the plurality of third clauses, word-segment the third clause to be adjusted, so as to obtain a plurality of phrases; and combining based on the phrases to obtain a plurality of phrases to be output, wherein the phrases to be output are smaller than the preset length threshold value.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the electronic device 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 may also be stored. The computing unit 001, ROM 702, and RAM 703 are connected to each other through a bus 704. An input output (I/O) interface 705 is also connected to bus 704.
Various components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the respective methods and processes described above, for example, a content recommendation method. For example, in some embodiments, the content recommendation method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the content recommendation method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the content recommendation method by any other suitable means (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (8)

1. A text segmentation method, comprising:
dividing the text to be processed based on punctuation marks to obtain L first clauses; l is an integer of 1 or more;
determining M to-be-output clauses based on the L first clauses, and taking the M to-be-output clauses as a segmentation result of the to-be-processed text; m is an integer greater than or equal to 1;
wherein the determining M phrases to be output based on the L first phrases includes:
processing the ith first clause based on a matching rule under the condition that the length of the ith first clause in the L first clauses is larger than a preset length threshold value to obtain the clause to be output; i is an integer of 1 or more and L or less;
the processing the ith first clause based on the matching rule to obtain the clause to be output comprises the following steps:
Marking the text to be processed based on a unit text of a preset type to obtain a marked target unit text in the text to be processed; determining a matched target text from the ith first clause based on a text matching rule, and dividing the ith first clause into a plurality of second clauses based on the target text under the condition that the target unit text is not segmented in the target text; determining a first class separator from the second clauses to be adjusted based on a separator matching rule under the condition that the second clauses to be adjusted with the length larger than the preset length threshold value exist in the second clauses, dividing the second clauses to be adjusted into a plurality of initial third clauses and the first class separator based on the first class separator, and taking the initial third clauses and the first class separator as a plurality of third clauses; under the condition that a candidate third clause with the length not larger than the preset length threshold value exists in the plurality of third clauses, the candidate third clause is used as the clause to be output; when a third clause to be adjusted, the length of which is larger than the preset length threshold value, exists in the plurality of third clauses, the third clause to be adjusted is segmented to obtain a plurality of phrases; and combining based on the phrases to obtain a plurality of phrases to be output, wherein the phrases to be output are smaller than the preset length threshold value.
2. The method of claim 1, wherein the dividing the text to be processed based on punctuation to obtain L first clauses comprises:
dividing the text to be processed based on the first punctuation mark to obtain K paragraphs; k is an integer greater than or equal to 1;
processing the second punctuation mark in the K paragraphs to obtain processed K paragraphs;
dividing the processed K paragraphs based on a third category punctuation mark to obtain L first clauses.
3. The method of claim 1, wherein the dividing the i-th first clause into a plurality of second clauses based on the target text comprises:
dividing the ith first clause into a plurality of initial second clauses and a second class separator in a case that the second class separator is at a neighboring position behind the target text; and adding the second class separator to one of the plurality of initial second clauses to obtain the plurality of second clauses.
4. A text segmentation apparatus comprising:
the first dividing module is used for dividing the text to be processed based on punctuation marks to obtain L first clauses; l is an integer of 1 or more;
The second dividing module is used for determining M to-be-output clauses based on the L first clauses, and taking the M to-be-output clauses as a segmentation result of the to-be-processed text;
the second dividing module is configured to process, based on a matching rule, the ith first clause to obtain the clause to be output when the length of the ith first clause in the L first clauses is greater than a preset length threshold value; i is an integer of 1 or more and L or less;
the second dividing module is further configured to process the ith first clause based on a matching rule to obtain the clause to be output by executing the following steps:
marking the text to be processed based on a unit text of a preset type to obtain a marked target unit text in the text to be processed; determining a matched target text from the ith first clause based on a text matching rule, and dividing the ith first clause into a plurality of second clauses based on the target text under the condition that the target unit text is not segmented in the target text; determining a first class separator from the second clauses to be adjusted based on a separator matching rule under the condition that the second clauses to be adjusted with the length larger than the preset length threshold value exist in the second clauses, dividing the second clauses to be adjusted into a plurality of initial third clauses and the first class separator based on the first class separator, and taking the initial third clauses and the first class separator as a plurality of third clauses; under the condition that a candidate third clause with the length not larger than the preset length threshold value exists in the plurality of third clauses, the candidate third clause is used as the clause to be output; when a third clause to be adjusted, the length of which is larger than the preset length threshold value, exists in the plurality of third clauses, the third clause to be adjusted is segmented to obtain a plurality of phrases; and combining based on the phrases to obtain a plurality of phrases to be output, wherein the phrases to be output are smaller than the preset length threshold value.
5. The apparatus of claim 4, wherein the first dividing module is configured to divide the text to be processed based on a first punctuation mark to obtain K paragraphs; k is an integer greater than or equal to 1; processing the second punctuation mark in the K paragraphs to obtain processed K paragraphs; dividing the processed K paragraphs based on a third category punctuation mark to obtain L first clauses.
6. The apparatus of claim 4, wherein the second dividing module is configured to divide the i-th first clause into a plurality of initial second clauses and the second class separator if the second class separator is at a neighboring position after the target text; and adding the second class separator to one of the plurality of initial second clauses to obtain the plurality of second clauses.
7. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions for execution by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-3.
8. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-3.
CN202110164368.1A 2021-02-05 2021-02-05 Text segmentation method, device, electronic equipment and storage medium Active CN112861513B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110164368.1A CN112861513B (en) 2021-02-05 2021-02-05 Text segmentation method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110164368.1A CN112861513B (en) 2021-02-05 2021-02-05 Text segmentation method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112861513A CN112861513A (en) 2021-05-28
CN112861513B true CN112861513B (en) 2024-02-06

Family

ID=75989671

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110164368.1A Active CN112861513B (en) 2021-02-05 2021-02-05 Text segmentation method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112861513B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02103662A (en) * 1988-10-12 1990-04-16 Ricoh Co Ltd Sentence dividing system
CN106126496A (en) * 2016-06-17 2016-11-16 联动优势科技有限公司 A kind of information segmenting method and device
CN106802886A (en) * 2016-12-30 2017-06-06 语联网(武汉)信息技术有限公司 A kind of cutting word method of multi-lingual text
CN107992475A (en) * 2017-11-27 2018-05-04 武汉中海庭数据技术有限公司 A kind of multilingual segmenting method and device based on automatic navigator full-text search
CN108874780A (en) * 2018-06-27 2018-11-23 清远墨墨教育科技有限公司 A kind of segmentation methods system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7774192B2 (en) * 2005-01-03 2010-08-10 Industrial Technology Research Institute Method for extracting translations from translated texts using punctuation-based sub-sentential alignment
CN101221558A (en) * 2008-01-22 2008-07-16 安徽科大讯飞信息科技股份有限公司 Method for automatically extracting sentence template
CN110046348B (en) * 2019-03-19 2021-05-25 西安理工大学 Method for recognizing main body in subway design specification based on rules and dictionaries

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02103662A (en) * 1988-10-12 1990-04-16 Ricoh Co Ltd Sentence dividing system
CN106126496A (en) * 2016-06-17 2016-11-16 联动优势科技有限公司 A kind of information segmenting method and device
CN106802886A (en) * 2016-12-30 2017-06-06 语联网(武汉)信息技术有限公司 A kind of cutting word method of multi-lingual text
CN107992475A (en) * 2017-11-27 2018-05-04 武汉中海庭数据技术有限公司 A kind of multilingual segmenting method and device based on automatic navigator full-text search
CN108874780A (en) * 2018-06-27 2018-11-23 清远墨墨教育科技有限公司 A kind of segmentation methods system

Also Published As

Publication number Publication date
CN112861513A (en) 2021-05-28

Similar Documents

Publication Publication Date Title
CN113850080A (en) Rhyme word recommendation method, device, equipment and storage medium
CN113408306A (en) Translation method, training method, device, equipment and storage medium of classification model
CN114417879B (en) Method and device for generating cross-language text semantic model and electronic equipment
CN113919424A (en) Training of text processing model, text processing method, device, equipment and medium
CN113157877A (en) Multi-semantic recognition method, device, equipment and medium
CN112861513B (en) Text segmentation method, device, electronic equipment and storage medium
CN113553833B (en) Text error correction method and device and electronic equipment
CN113641724B (en) Knowledge tag mining method and device, electronic equipment and storage medium
CN110895655A (en) Method and device for extracting text core phrase
CN115952258A (en) Generation method of government affair label library, and label determination method and device of government affair text
CN112784599B (en) Method and device for generating poem, electronic equipment and storage medium
CN114417862A (en) Text matching method, and training method and device of text matching model
CN114817476A (en) Language model training method and device, electronic equipment and storage medium
CN114239562A (en) Method, device and equipment for identifying program code blocks in document
CN114020918A (en) Classification model training method, translation device and electronic equipment
CN114781408B (en) Training method and device for simultaneous translation model and electronic equipment
CN113377922B (en) Method, device, electronic equipment and medium for matching information
CN113360636B (en) Content display method, device, equipment and storage medium
CN116069914B (en) Training data generation method, model training method and device
CN114492456B (en) Text generation method, model training method, device, electronic equipment and medium
CN114186552B (en) Text analysis method, device and equipment and computer storage medium
CN113593528B (en) Training method and device of voice segmentation model, electronic equipment and storage medium
EP4113328A1 (en) Method and apparatus for processing data based on knowledge graph, electronic device and medium
CN116090436A (en) Text generation method and device, electronic equipment and storage medium
CN117764044A (en) Document dividing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant