CN114662487A - Text segmentation method and device, electronic equipment and readable storage medium - Google Patents

Text segmentation method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN114662487A
CN114662487A CN202011540880.3A CN202011540880A CN114662487A CN 114662487 A CN114662487 A CN 114662487A CN 202011540880 A CN202011540880 A CN 202011540880A CN 114662487 A CN114662487 A CN 114662487A
Authority
CN
China
Prior art keywords
word
target
word frequency
clause
paragraph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011540880.3A
Other languages
Chinese (zh)
Inventor
付强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Guoshuang Software Co ltd
Original Assignee
Suzhou Guoshuang Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Guoshuang Software Co ltd filed Critical Suzhou Guoshuang Software Co ltd
Priority to CN202011540880.3A priority Critical patent/CN114662487A/en
Publication of CN114662487A publication Critical patent/CN114662487A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

According to the text segmentation method, the text segmentation device, the electronic device and the readable storage medium, for any target clause in a text to be segmented, word frequency vectors corresponding to the target clause are determined based on a first tuple set and a second tuple set which are constructed in advance, whether the target clause is a paragraph ending sentence is further determined according to the word frequency vectors, and when the target clause is determined to be the paragraph ending sentence, paragraph segmentation is performed on the text to be segmented based on the target clause. Compared with the segmentation based on the deep learning algorithm, the method is not required to carry out model training, is simpler, has relatively lower requirements on the performance of a computer, and is easier to realize.

Description

Text segmentation method and device, electronic equipment and readable storage medium
Technical Field
The embodiment of the invention relates to the technical field of text processing, in particular to a text segmentation method and device, electronic equipment and a readable storage medium.
Background
At present, many service scenes need to sort or analyze OCR text content, the OCR text content has no obvious paragraph division mark in form, and is often stored or transmitted in the form of text blocks, which makes subsequent processing work (such as parsing, segment display, etc.) difficult to develop or obtain expected effects.
At present, the scheme adopted for segmenting the text generally uses a deep learning algorithm (LSTM), performs supervised training on the text to obtain a segmentation discrimination model, and segments the text based on the segmentation discrimination model.
However, the model obtained based on the deep learning algorithm is relatively more like a black box, and the prediction result has inexplicability. And the training process of the model is accompanied by a large amount of parameter optimization, the time cost is high, and the performance of the computer is also required to be higher.
The above description of the discovery process of the problems is only for the purpose of assisting understanding of the technical solutions of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
In order to solve the technical problems or at least partially solve the technical problems, embodiments of the present invention provide a text segmentation method, an apparatus, an electronic device, and a readable storage medium.
In view of this, in a first aspect, an embodiment of the present invention provides a text segmentation method, including:
aiming at any target clause in the text to be segmented, determining a word frequency vector corresponding to the target clause based on a first tuple set and a second tuple set which are constructed in advance;
determining whether the target clause is a paragraph ending clause or not according to the word frequency vector corresponding to the target clause;
and when the target clause is determined to be a paragraph ending clause, paragraph division is carried out on the text to be segmented based on the target clause.
As a possible implementation manner, the first tuple set and the second tuple set are constructed in the following manner:
obtaining a corpus, wherein the corpus is a text with paragraph marks and sentence division marks;
performing paragraph division on the corpus according to paragraph marks to obtain a plurality of paragraphs;
performing sentence division on each paragraph according to the sentence division marks to obtain a plurality of clauses;
dividing the plurality of clauses into a first sentence set and a second sentence set, wherein the first sentence set is composed of paragraph ending sentences in the plurality of clauses, and the second sentence set is composed of non-paragraph ending sentences in the plurality of clauses;
determining a first average word frequency of each word in the first sentence set and a second average word frequency of each word in the second sentence set;
taking each word in the first sentence set and the first average word frequency corresponding to each word as a first ancestor to form a first ancestor set;
and taking each word in the second sentence set and the second average word frequency corresponding to each word as a second element ancestor to form a second element ancestor set.
As a possible implementation manner, the word frequency vector includes a target word frequency vector, a first word frequency vector, and a second word frequency vector;
the determining the word frequency vector corresponding to the target clause based on the preset first tuple set and the second tuple set comprises:
performing word division on the target clauses to obtain corresponding target word sets;
determining a target word frequency vector according to the word frequency of each word in the target word set in the target clause;
searching first average word frequencies respectively corresponding to all words in the target word set from the first ancestor set, and generating first word frequency vectors corresponding to the target clauses according to the searched first average word frequencies;
and searching second average word frequency respectively corresponding to each word in the target word set from the second primitive ancestor set, and generating a second word frequency vector corresponding to the target clause according to the searched second average word frequency.
As a possible implementation manner, determining whether the target clause is an end-of-paragraph clause according to the word frequency vector corresponding to the target clause includes:
determining a first similarity of the target word frequency vector and the first word frequency vector;
determining a second similarity of the target word frequency vector and the second word frequency vector;
if the first similarity is greater than the second similarity, determining that the target clause is a paragraph ending clause;
and if the first similarity is smaller than the second similarity, determining that the target clause is not a paragraph ending clause.
As a possible implementation, the first average word frequency of the words in the first sentence set is determined by:
determining a first word frequency of a word in the first sentence set;
determining a first ratio of the first word frequency to the number of clauses contained in the first sentence set;
and taking the first ratio as a first average word frequency corresponding to the word.
As a possible implementation, the second average word frequency of the words in the second sentence set is determined by:
determining a second word frequency of a word in the second sentence set;
determining a second ratio of the second word frequency to the number of clauses contained in the second sentence set;
and taking the second ratio as a second average word frequency corresponding to the word.
As a possible implementation manner, performing word division on the target clause to obtain a corresponding target word set, including:
performing word division on the target clause to obtain a plurality of words;
removing preset words in the plurality of words to obtain a plurality of target words;
and forming the target words into a target word set.
In a second aspect, an embodiment of the present invention further provides a text segmenting apparatus, including:
the word frequency vector determining module is used for determining a word frequency vector corresponding to a target clause based on a first tuple set and a second tuple set which are constructed in advance aiming at any target clause in a text to be segmented;
the final sentence determining module is used for determining whether the target clause is a paragraph final sentence or not according to the word frequency vector corresponding to the target clause;
and the paragraph dividing module is used for carrying out paragraph division on the text to be segmented based on the target clause when the target clause is determined to be the paragraph ending clause.
In a third aspect, an embodiment of the present invention further provides an electronic device, including at least one processor, and at least one memory and a bus that are connected to the processor; the processor and the memory complete mutual communication through a bus; the processor is adapted to invoke program instructions in the memory to perform the steps of the text segmentation method of the first aspect.
In a fourth aspect, the embodiments of the present invention further provide a readable storage medium, which stores computer instructions, where the computer instructions cause a computer to execute the steps of the text segmentation method according to the first aspect.
Compared with the prior art, the text segmentation method provided by the embodiment of the invention determines the word frequency vector corresponding to the target clause based on the pre-constructed first tuple set and second tuple set aiming at any target clause in the text to be segmented, further determines whether the target clause is a paragraph ending sentence according to the word frequency vector, and further performs paragraph division on the text to be segmented based on the target clause when the target clause is determined to be the paragraph ending sentence. Compared with the segmentation based on the deep learning algorithm, the method is not required to carry out model training, is simpler, has relatively lower requirements on the performance of a computer, and is easier to realize.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive labor.
Fig. 1 is a flowchart of a text segmentation method according to an embodiment of the present invention;
FIG. 2 is a flow chart of another text segmentation method provided by the embodiment of the invention;
FIG. 3 is a flowchart of a meta-ancestor set constructing method according to an embodiment of the present invention;
fig. 4 is a flowchart of a word frequency vector determining method according to an embodiment of the present invention;
FIG. 5 is a block diagram of a text segmentation apparatus according to an embodiment of the present invention;
fig. 6 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The invention provides a text segmentation method capable of segmenting an OCR text based on a statistical method, aiming at solving the problem of difficult segmentation of the conventional OCR text.
The text segmentation method provided by the invention is explained in detail below with reference to the accompanying drawings.
Fig. 1 is a flowchart of a text segmentation method according to an embodiment of the present invention, and as shown in fig. 1, the method may include the following steps:
s11, for any target clause in the text to be segmented, determining a word frequency vector corresponding to the target clause based on a first tuple set and a second tuple set which are constructed in advance.
In the text segmenting method provided in this embodiment, paragraphs of the text to be segmented are divided according to the paragraph ending sentence in the text to be segmented. Therefore, when the text to be segmented is segmented, the text to be segmented needs to be subjected to sentence division to obtain a plurality of clauses corresponding to the text to be segmented, and then whether each clause is a paragraph ending sentence is judged respectively.
As an alternative implementation manner, a natural language processing technique may be adopted to perform sentence division on the text to be segmented according to a preset punctuation mark, where the preset punctuation mark may be a punctuation mark that represents the semantic end of a sentence, for example. ","? "and"! "and the like.
As another alternative implementation, the text to be segmented may be sentence-divided by using a regular expression re.
It should be noted that the two sentence division manners are only exemplary, and besides the two sentence division manners, the sentence division may also be performed in other manners, and the embodiment of the present invention is not particularly limited.
In one embodiment, when determining whether the clause is an end-of-paragraph sentence, the clause may be taken as a target clause, and then steps S11-S12 may be performed with respect to the target clause. When segmenting the text, each clause can be used as a target clause to execute corresponding steps, so that the judgment on whether all clauses in the text are paragraph ending clauses is finished.
As an embodiment, the first tuple set is a set of words and first average word frequency of the words constructed according to a plurality of paragraph ending sentences, and the second tuple set is a set of words and second average word frequency of the words constructed according to a plurality of non-paragraph ending sentences.
Based on this, determining the word frequency vector corresponding to the target clause may include: a target word frequency vector determined from the target clause itself, a first word frequency vector determined from the first tuple set, and a second word frequency vector determined from the second tuple set.
S12, determining whether the target clause is a paragraph ending clause or not according to the word frequency vector corresponding to the target clause.
According to the embodiment of the invention, the possibility that the target clause is used as the paragraph ending clause and the possibility that the target clause is used as the non-paragraph ending clause are determined according to the word frequency vector of the target clause, and whether the target clause is the paragraph ending clause is estimated according to the comparison possibility. Specifically, if the probability of the target clause being an end-of-paragraph clause is greater than the probability of being a non-end-of-paragraph clause, the target clause is determined to be an end-of-paragraph clause, and if the probability of the target clause being an end-of-paragraph clause is less than the probability of being a non-end-of-paragraph clause, the target clause is determined not to be an end-of-paragraph clause.
As for the case where the probability of the target clause as an end-of-paragraph sentence is equal to the probability of being a non-end-of-paragraph sentence, it is not usually the case, and if it occurs, it can be determined manually whether the target clause is an end-of-paragraph sentence. Of course, other modes besides manual mode can be adopted, and the embodiment of the invention is not particularly limited.
And S13, when the target clause is determined to be a paragraph ending clause, paragraph division is carried out on the text to be segmented based on the target clause.
As an embodiment, when the target clause is determined to be an end-of-paragraph clause, paragraph division is performed with the end of the target clause as a division point.
The text segmentation method provided by the embodiment of the invention determines word frequency vectors corresponding to target clauses based on a first tuple set and a second tuple set which are constructed in advance for any target clause in a text to be segmented, further determines whether the target clause is a paragraph ending sentence or not according to the word frequency vectors, and further performs paragraph division on the text to be segmented based on the target clause when the target clause is determined to be the paragraph ending sentence. Compared with the segmentation based on the deep learning algorithm, the method is not required to carry out model training, is simpler, has relatively lower requirements on the performance of a computer, and is easier to realize.
Fig. 2 is a flowchart of another text segmentation method according to an embodiment of the present invention, and as shown in fig. 2, the method may include the following steps:
s21, constructing a first tuple ancestor set and a second tuple set.
As an example, as shown in FIG. 3, constructing the first set of tuple ancestors and the second set of tuples may comprise the steps of:
s31, obtaining the corpus.
In this embodiment, the corpus is a text with paragraph marks (e.g., first line indentation, etc.) and sentence division marks (e.g., punctuation marks representing the semantic end of a sentence).
As an alternative implementation, the corpus may be obtained by crawling from the network using web crawler technology.
As another possible implementation manner, the corpus can be obtained by means of input of a user or an external device, and the external device can perform input in a wired or wireless manner.
And S32, carrying out paragraph division on the corpus according to the paragraph identification to obtain a plurality of paragraphs.
And S33, performing sentence division on each paragraph according to the sentence division marks to obtain a plurality of clauses.
In one embodiment, when dividing a sentence, the divided clauses are divided in units of complete sentences. ","? ","! "equal" means that the punctuation mark at the end of the sentence semantics is the end.
S34, dividing the multiple clauses into a first sentence set and a second sentence set, wherein the first sentence set is composed of paragraph ending sentences in the multiple clauses, and the second sentence set is composed of non-paragraph ending sentences in the multiple clauses.
As an embodiment, when performing sentence division in step S23, an end-of-paragraph sentence identifier (e.g., is _ end ═ true) may be added to the end-of-paragraph sentence, a non-end-of-paragraph sentence identifier (e.g., is _ end ═ false) may be added to the end-of-paragraph sentence, and then the clauses to which the end-of-paragraph sentence identifier is added may be grouped into a first sentence set, and the clauses to which the non-end-of-paragraph sentence identifier is added may be grouped into a second sentence set.
S35, determining a first average word frequency of each word in the first sentence set and a second average word frequency of each word in the second sentence set.
In this embodiment, the first average word frequency of a word is used to represent the probability of the word appearing in the end-of-paragraph sentence, and the second average word frequency of the word is used to represent the probability of the word appearing in the non-end-of-paragraph sentence.
As an embodiment, when determining the first average word frequency and the second average word frequency, words included in the first sentence set and the second sentence set are determined first. Based on the above, the sentences contained in the first sentence set and the second sentence set are segmented to obtain a first word set corresponding to the first sentence set and a second word set corresponding to the second sentence set, and then a first average word frequency corresponding to each word in the first word set and a second average word frequency corresponding to each word in the second word set are determined respectively.
Optionally, a jieba chinese word segmenter may be used for word segmentation.
As an example, the following may be used to determine the first average word frequency for any word in the first set of words:
determining a first word frequency of a word in the first sentence set, determining a first ratio of the first word frequency to the number of clauses contained in the first sentence set, and taking the first ratio as a first average word frequency corresponding to the word.
And determining the corresponding first average word frequency by respectively adopting the above mode aiming at each word in the first word set, so as to obtain the first average word frequency corresponding to each word in the first word set.
As an example, the following may be employed for determining the second average word frequency for any word in the second set of words:
determining a second word frequency of a word in the second sentence set, determining a second ratio of the second word frequency to the number of clauses contained in the second sentence set, and taking the second ratio as a second average word frequency corresponding to the word.
And determining corresponding second average word frequency by respectively adopting the above mode for each word in the second word set, so as to obtain the second average word frequency corresponding to each word in the second word set.
In this embodiment, since the number of clauses included in the first sentence set is different from the number of clauses included in the second sentence set, the dimension can be eliminated by using the average word frequency, and normalization is achieved.
And S36, taking each word in the first sentence set and the first average word frequency corresponding to each word as a first ancestor to form a first ancestor set.
As an embodiment, in a format (words, first average word frequency corresponding to the words) are respectively generated, first ancestors corresponding to the words are respectively generated, one word corresponds to one first ancestor, for example, the first sentence set contains "us", and the first average word frequency corresponding to "us" is 2, then the first ancestor corresponding to "us" is (us, 2), and the first ancestors corresponding to all the words in the first sentence set jointly form the first ancestor set.
And S37, taking each word in the second sentence set and the second average word frequency corresponding to each word as a second ancestor to form a second ancestor set.
Similar to the first ancestor set, second ancestors corresponding to all the words are respectively generated in a format (words, second average word frequency corresponding to the words), one word corresponds to one second tuple, and the second ancestors corresponding to all the words in the second sentence set jointly form the second ancestor set.
S22, aiming at any target clause in the text to be segmented, determining a target word frequency vector, a first word frequency vector and a second word frequency vector corresponding to the target clause based on the first tuple set and the second tuple set.
The following describes, with reference to fig. 4, determining a target word frequency vector, a first word frequency vector, and a second word frequency vector corresponding to the target clause based on the first tuple set and the second tuple set.
As shown in fig. 4, the following steps may be included:
and S41, carrying out word division on the target clause to obtain a corresponding target word set.
As an alternative implementation, all words obtained by word division may be combined into the target word set.
As another optional implementation manner, word division may be performed on the target clause to obtain preset words contained in the multiple words, the remaining words are used as target words, and then the target words are combined into a target word set. The preset words can be stop words (such as 'ones', etc.), and the stop words are filtered out, so that the final paragraph ending sentence judgment result is more accurate.
And S42, determining a target word frequency vector according to the word frequency of each word in the target word set in the target clause.
As an embodiment, according to the word sequence in the target word set, a corresponding target word frequency vector is generated according to the word frequency of each word in the target clause, for example, if the target clause is "light-poise, which is a emotion, all-along reason, and all-along heart", and the corresponding target word set is (light-poise, which is a emotion, all-along reason, and all-along heart), where the word frequency of "light-poise" is 1, the word frequency of "yes" is 1, the word frequency of "one" is 1, the word frequency of "emotion" is 1, the word frequency of "all" is 2, the word frequency of "along edge" is 1, and the word frequency of "along heart" is 1, the generated target word frequency vector is (1, 1, 1, 1, 2, 1, 1, etc.).
S43, searching first average word frequency corresponding to each word in the target word set from the first ancestor set, and generating a first word frequency vector corresponding to the target clause according to the searched first average word frequency.
In the first ancestor set, the first average word frequency corresponding to the word and the word is correspondingly stored in an ancestor form, so that the corresponding first average word frequency can be easily determined in a searching mode under the condition that the word is known.
After the first average word frequency of each word in the target word set is determined, the obtained first average word frequency is used for generating a corresponding first word frequency vector according to the word arrangement sequence in the target word set.
S44, second average word frequency corresponding to each word in the target word set is searched from the second ancestor set, and second word frequency vectors corresponding to the target clauses are generated according to the searched second average word frequency.
As an embodiment, similarly to step S43, a corresponding second average word frequency in the second tuple set is determined in a searching manner according to the words in the target word set, and then the second average word frequency is found to generate a corresponding second word frequency vector according to the word arrangement order in the target word set.
S23, determining the first similarity of the target word frequency vector and the first word frequency vector.
And S24, determining a second similarity of the target word frequency vector and the second word frequency vector.
In this embodiment, S23 and S24 have no specific sequence, and S23 may be executed first, or S24 may be executed first, or S23 and S24 are executed simultaneously.
As an embodiment, the first similarity and the second similarity may be calculated using a cosine similarity calculation method.
Specifically, the cosine similarity calculation formula is as follows:
Figure BDA0002854839990000141
wherein, A and B are two vectors for similarity calculation, for example, when the first similarity is calculated, A represents the target word frequency vector, B represents the first word frequency vector, AiIt represents the ith element, B, in the target word frequency vectoriIt represents the ith element in the first word frequency vector and n represents the number of elements in the target/first word frequency vector. When the second similarity is calculated, A represents the target word frequency vector, B represents the second word frequency vector, AiIt represents the ith element, B, in the target word frequency vectoriIt represents the ith element in the second word frequency vector and n represents the number of elements in the target/second word frequency vector.
Besides the cosine similarity, other similarity calculation methods may be used to calculate the first similarity and the second similarity, which are not listed here.
S25, judging the sizes of the first similarity and the second similarity, if the first similarity is larger than the second similarity, executing S26, and if the first similarity is smaller than the second similarity, executing S28.
As an embodiment, the greater the first similarity, the greater the likelihood that the target clause is an end-of-paragraph clause, and similarly, the greater the second similarity, the greater the likelihood that the target clause is a non-end-of-paragraph clause, and if the first similarity is greater than the second similarity, the greater the likelihood that the target clause is an end-of-paragraph clause, then S26 is executed, and if the first similarity is less than the second similarity, the greater the likelihood that the target clause is a non-end-of-paragraph clause is executed, then S28 is executed.
S26. determine that the target clause is an end-of-paragraph clause, and perform S27.
And S27, paragraph division is carried out on the text to be segmented based on the target clause.
As one embodiment, paragraph division is performed based on taking the end of the target clause as a paragraph division point when the target clause is an end-of-paragraph sentence.
S28, determining that the target clause is not an end-of-paragraph clause.
After determining whether the current target clause is the paragraph ending clause, the process is carried out on other target clauses until all target clauses in the text to be segmented determine whether the target clauses are the paragraph ending clauses.
Another embodiment of the present invention further provides a text segmenting apparatus, as shown in fig. 5, the apparatus may include a word frequency vector determining module 501, an end sentence determining module 502 and a paragraph dividing module 503.
The word frequency vector determining module 501 is configured to determine, for any target clause in the text to be segmented, a word frequency vector corresponding to the target clause based on a first tuple set and a second tuple set that are constructed in advance.
An ending sentence determining module 502, configured to determine whether the target clause is a paragraph ending sentence according to the word frequency vector corresponding to the target clause.
A paragraph dividing module 503, configured to perform paragraph division on the text to be segmented based on the target clause when the target clause is determined to be a paragraph ending clause.
As an embodiment, the apparatus may further include a metaancestor set building module (not shown in the figure), which is specifically configured to:
obtaining a corpus, wherein the corpus is a text with paragraph marks and sentence dividing marks;
performing paragraph division on the corpus according to paragraph marks to obtain a plurality of paragraphs;
performing sentence division on each paragraph according to the sentence division marks to obtain a plurality of clauses;
dividing the plurality of clauses into a first sentence set and a second sentence set, wherein the first sentence set is composed of paragraph ending sentences in the plurality of clauses, and the second sentence set is composed of non-paragraph ending sentences in the plurality of clauses;
determining a first average word frequency of each word in the first sentence set and a second average word frequency of each word in the second sentence set;
taking each word in the first sentence set and the first average word frequency corresponding to each word as a first ancestor to form a first ancestor set;
and taking each word in the second sentence set and the second average word frequency corresponding to each word as a second ancestor to form a second ancestor set.
As an example, the first average word frequency of words in the first sentence set is determined in the following manner:
determining a first word frequency of a word in the first sentence set;
determining a first ratio of the first word frequency to the number of clauses contained in the first sentence set;
and taking the first ratio as a first average word frequency corresponding to the word.
As an example, the second average word frequency of words in the second set of sentences is determined in the following manner:
determining a second word frequency of a word in the second sentence set;
determining a second ratio of the second word frequency to the number of clauses contained in the second sentence set;
and taking the second ratio as a second average word frequency corresponding to the word.
As an embodiment, the word frequency vector includes a target word frequency vector, a first word frequency vector, and a second word frequency vector.
The word frequency vector determining module 501 is specifically configured to:
performing word division on the target clause to obtain a corresponding target word set, determining a target word frequency vector according to the word frequency of each word in the target clause, searching a first average word frequency corresponding to each word in the target word set from the first ancestor set, generating a first word frequency vector corresponding to the target clause according to the searched first average word frequency, searching a second average word frequency corresponding to each word in the target word set from the second ancestor set, and generating a second word frequency vector corresponding to the target clause according to the searched second average word frequency.
As an embodiment, the final sentence determining module 502 is specifically configured to:
determining a first similarity between the target word frequency vector and the first word frequency vector, determining a second similarity between the target word frequency vector and the second word frequency vector, if the first similarity is greater than the second similarity, determining that the target clause is an end-of-paragraph sentence, and if the first similarity is less than the second similarity, determining that the target clause is not an end-of-paragraph sentence.
As an embodiment, performing word division on the target clause to obtain a corresponding target word set, including:
performing word division on the target clause to obtain a plurality of words;
removing preset words in the plurality of words to obtain a plurality of target words;
and forming the target words into a target word set.
The text segmentation device comprises a processor and a memory, wherein the word frequency vector determination module 501, the final sentence determination module 502, the paragraph division module 503 and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, an end sentence of a paragraph in the text to be segmented is determined by adjusting the kernel parameters, and the text to be segmented is segmented according to the end sentence of the paragraph.
An embodiment of the present invention provides a storage medium on which a program is stored, which, when executed by a processor, implements the text segmentation method.
The embodiment of the invention provides a processor, which is used for running a program, wherein the text segmentation method is executed when the program runs.
An embodiment of the present invention provides an apparatus 60, as shown in fig. 6, the apparatus 60 includes at least one processor 601, and at least one memory 602 and a bus 603 connected to the processor 601; the processor 601 and the memory 602 complete communication with each other through the bus 603; processor 601 is used to call program instructions in memory 602 to perform the text segmentation method described above. The device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device:
aiming at any target clause in the text to be segmented, determining a word frequency vector corresponding to the target clause based on a first tuple set and a second tuple set which are constructed in advance;
determining whether the target clause is a paragraph ending clause or not according to the word frequency vector corresponding to the target clause;
and when the target clause is determined to be a paragraph ending clause, paragraph division is carried out on the text to be segmented based on the target clause.
The first tuple set and the second tuple set are constructed in the following way:
obtaining a corpus, wherein the corpus is a text with paragraph marks and sentence division marks;
performing paragraph division on the corpus according to paragraph marks to obtain a plurality of paragraphs;
performing sentence division on each paragraph according to the sentence division marks to obtain a plurality of clauses;
dividing the plurality of clauses into a first sentence set and a second sentence set, wherein the first sentence set is composed of paragraph ending sentences in the plurality of clauses, and the second sentence set is composed of non-paragraph ending sentences in the plurality of clauses;
determining a first average word frequency of each word in the first sentence set and a second average word frequency of each word in the second sentence set;
taking each word in the first sentence set and the first average word frequency corresponding to each word as a first ancestor to form a first ancestor set;
and taking each word in the second sentence set and the second average word frequency corresponding to each word as a second ancestor to form a second ancestor set.
The word frequency vector comprises a target word frequency vector, a first word frequency vector and a second word frequency vector;
the determining the word frequency vector corresponding to the target clause based on the preset first tuple set and the preset second tuple set comprises:
performing word division on the target clauses to obtain corresponding target word sets;
determining a target word frequency vector according to the word frequency of each word in the target word set in the target clause;
searching first average word frequency corresponding to each word in the target word set from the first ancestor set, and generating a first word frequency vector corresponding to the target clause according to the searched first average word frequency;
and searching second average word frequency respectively corresponding to each word in the target word set from the second primitive ancestor set, and generating a second word frequency vector corresponding to the target clause according to the searched second average word frequency.
Determining whether the target clause is a paragraph ending clause according to the word frequency vector corresponding to the target clause, including:
determining a first similarity of the target word frequency vector and the first word frequency vector;
determining a second similarity of the target word frequency vector and the second word frequency vector;
if the first similarity is greater than the second similarity, determining that the target clause is a paragraph ending clause;
and if the first similarity is smaller than the second similarity, determining that the target clause is not a paragraph ending clause.
The first average word frequency of the words in the first sentence set is determined by the following method:
determining a first word frequency of a word in the first sentence set;
determining a first ratio of the first word frequency to the number of clauses contained in the first sentence set;
and taking the first ratio as a first average word frequency corresponding to the word.
The second average word frequency of the words in the second sentence set is determined in the following manner:
determining a second word frequency of a word in the second sentence set;
determining a second ratio of the second word frequency to the number of clauses contained in the second sentence set;
and taking the second ratio as a second average word frequency corresponding to the word.
Performing word division on the target clause to obtain a corresponding target word set, including:
performing word division on the target clause to obtain a plurality of words;
removing preset words in the plurality of words to obtain a plurality of target words;
and forming the target words into a target word set.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method of text segmentation, comprising:
aiming at any target clause in the text to be segmented, determining a word frequency vector corresponding to the target clause based on a first tuple set and a second tuple set which are constructed in advance;
determining whether the target clause is a paragraph ending clause or not according to the word frequency vector corresponding to the target clause;
and when the target clause is determined to be a paragraph ending clause, paragraph division is carried out on the text to be segmented based on the target clause.
2. The method of claim 1, wherein the first tuple set and the second tuple set are constructed by:
obtaining a corpus, wherein the corpus is a text with paragraph marks and sentence division marks;
performing paragraph division on the corpus according to paragraph marks to obtain a plurality of paragraphs;
performing sentence division on each paragraph according to the sentence division marks to obtain a plurality of clauses;
dividing the plurality of clauses into a first sentence set and a second sentence set, wherein the first sentence set is composed of paragraph ending sentences in the plurality of clauses, and the second sentence set is composed of non-paragraph ending sentences in the plurality of clauses;
determining a first average word frequency of each word in the first sentence set and a second average word frequency of each word in the second sentence set;
taking each word in the first sentence set and the first average word frequency corresponding to each word as a first ancestor to form a first ancestor set;
and taking each word in the second sentence set and the second average word frequency corresponding to each word as a second ancestor to form a second ancestor set.
3. The method of claim 2, wherein the word frequency vector comprises a target word frequency vector, a first word frequency vector, and a second word frequency vector;
the determining the word frequency vector corresponding to the target clause based on the preset first tuple set and the preset second tuple set comprises:
performing word division on the target clause to obtain a corresponding target word set;
determining a target word frequency vector according to the word frequency of each word in the target word set in the target clause;
searching first average word frequency corresponding to each word in the target word set from the first ancestor set, and generating a first word frequency vector corresponding to the target clause according to the searched first average word frequency;
and searching second average word frequency respectively corresponding to each word in the target word set from the second primitive ancestor set, and generating a second word frequency vector corresponding to the target clause according to the searched second average word frequency.
4. The method of claim 3, wherein determining whether the target clause is an end-of-paragraph clause according to the word frequency vector corresponding to the target clause comprises:
determining a first similarity of the target word frequency vector and the first word frequency vector;
determining a second similarity of the target word frequency vector and the second word frequency vector;
if the first similarity is greater than the second similarity, determining that the target clause is a paragraph ending clause;
and if the first similarity is smaller than the second similarity, determining that the target clause is not a paragraph ending clause.
5. The method of claim 2, wherein the first average word frequency for a word in the first sentence set is determined by:
determining a first word frequency of a word in the first sentence set;
determining a first ratio of the first word frequency to the number of clauses contained in the first sentence set;
and taking the first ratio as a first average word frequency corresponding to the word.
6. The method of claim 2, wherein the second average word frequency for words in the second set of sentences is determined by:
determining a second word frequency of a word in the second sentence set;
determining a second ratio of the second word frequency to the number of clauses contained in the second sentence set;
and taking the second ratio as a second average word frequency corresponding to the word.
7. The method of claim 3, wherein performing word segmentation on the target clause to obtain a corresponding target word set comprises:
performing word division on the target clause to obtain a plurality of words;
removing preset words in the plurality of words to obtain a plurality of target words;
and forming the target words into a target word set.
8. A text segmentation apparatus, comprising:
the word frequency vector determining module is used for determining a word frequency vector corresponding to a target clause based on a first tuple set and a second tuple set which are constructed in advance aiming at any target clause in a text to be segmented;
the final sentence determining module is used for determining whether the target clause is a paragraph final sentence or not according to the word frequency vector corresponding to the target clause;
and the paragraph dividing module is used for carrying out paragraph division on the text to be segmented based on the target clause when the target clause is determined to be the paragraph ending clause.
9. An electronic device, comprising at least one processor and at least one memory, bus connected to the processor; the processor and the memory complete mutual communication through a bus; the processor is configured to invoke program instructions in the memory to perform the steps of the text segmentation method of any one of claims 1-7.
10. A readable storage medium storing computer instructions for causing a computer to perform the steps of the text segmentation method of any one of claims 1-7.
CN202011540880.3A 2020-12-23 2020-12-23 Text segmentation method and device, electronic equipment and readable storage medium Pending CN114662487A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011540880.3A CN114662487A (en) 2020-12-23 2020-12-23 Text segmentation method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011540880.3A CN114662487A (en) 2020-12-23 2020-12-23 Text segmentation method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN114662487A true CN114662487A (en) 2022-06-24

Family

ID=82025099

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011540880.3A Pending CN114662487A (en) 2020-12-23 2020-12-23 Text segmentation method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN114662487A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115269795A (en) * 2022-07-20 2022-11-01 北京新纽科技有限公司 Segmentation method of electronic medical record

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115269795A (en) * 2022-07-20 2022-11-01 北京新纽科技有限公司 Segmentation method of electronic medical record

Similar Documents

Publication Publication Date Title
CN109635273B (en) Text keyword extraction method, device, equipment and storage medium
AU2018214675B2 (en) Systems and methods for automatic semantic token tagging
US10878173B2 (en) Object recognition and tagging based on fusion deep learning models
US9460117B2 (en) Image searching
CN112685565A (en) Text classification method based on multi-mode information fusion and related equipment thereof
CN111797210A (en) Information recommendation method, device and equipment based on user portrait and storage medium
EP3926531A1 (en) Method and system for visio-linguistic understanding using contextual language model reasoners
CN110533018B (en) Image classification method and device
EP3872652B1 (en) Method and apparatus for processing video, electronic device, medium and product
JP2018501579A (en) Semantic representation of image content
CN110781687B (en) Same intention statement acquisition method and device
CN111078842A (en) Method, device, server and storage medium for determining query result
CN115544303A (en) Method, apparatus, device and medium for determining label of video
CN109597982B (en) Abstract text recognition method and device
CN114662487A (en) Text segmentation method and device, electronic equipment and readable storage medium
CN112434533B (en) Entity disambiguation method, entity disambiguation device, electronic device, and computer-readable storage medium
CN117312535A (en) Method, device, equipment and medium for processing problem data based on artificial intelligence
CN112765976A (en) Text similarity calculation method, device and equipment and storage medium
CN116932730A (en) Document question-answering method and related equipment based on multi-way tree and large-scale language model
CN114547257B (en) Class matching method and device, computer equipment and storage medium
CN110826488B (en) Image identification method and device for electronic document and storage equipment
CN113222167A (en) Image processing method and device
CN112580358A (en) Text information extraction method, device, storage medium and equipment
CN111538813B (en) Classification detection method, device, equipment and storage medium
CN111191689B (en) Sample data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination