CN114444475A - Word segmentation method and device based on corpus - Google Patents

Word segmentation method and device based on corpus Download PDF

Info

Publication number
CN114444475A
CN114444475A CN202111333372.2A CN202111333372A CN114444475A CN 114444475 A CN114444475 A CN 114444475A CN 202111333372 A CN202111333372 A CN 202111333372A CN 114444475 A CN114444475 A CN 114444475A
Authority
CN
China
Prior art keywords
corpus
word
character combinations
sub
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111333372.2A
Other languages
Chinese (zh)
Inventor
李森和
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GUANGZHOU JIANHE NETWORK TECHNOLOGY CO LTD
Original Assignee
GUANGZHOU JIANHE NETWORK TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUANGZHOU JIANHE NETWORK TECHNOLOGY CO LTD filed Critical GUANGZHOU JIANHE NETWORK TECHNOLOGY CO LTD
Priority to CN202111333372.2A priority Critical patent/CN114444475A/en
Publication of CN114444475A publication Critical patent/CN114444475A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention relates to a word segmentation method based on corpora, which comprises the following steps: acquiring a target corpus; splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks; intercepting adjacent characters of each sub-language material segment according to a preset rule to obtain a plurality of character combinations with different lengths; counting the times of occurrence of a plurality of words with different lengths in user comments in a preset time period; and segmenting the target corpus according to the times of occurrence respectively. The invention can intelligently divide words, has higher new word finding speed and can not generate word division ambiguity.

Description

Word segmentation method and device based on corpus
Technical Field
The invention relates to the field of artificial intelligence, in particular to a corpus-based word segmentation method and device.
Background
The word segmentation is used as the basis of the current text analysis, and almost all the applications based on the text analysis need the word segmentation.
The existing word segmentation mode is based on a word bank, and qualified keywords need to be filtered and screened manually and periodically. The word segmentation is generally based on a word stock like mmseg, ik, and coding, and the word segmentation based on the word stock has the following defects that the word stock needs to be manually maintained, the new word discovery speed is slow, and word segmentation ambiguity is easy to generate.
Disclosure of Invention
The present invention is directed to at least solve one of the deficiencies of the prior art, and provides a method and an apparatus for segmenting words based on corpus.
In order to achieve the purpose, the invention adopts the following technical scheme:
specifically, a corpus-based word segmentation method is provided, which comprises the following steps:
acquiring a target corpus;
splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;
intercepting adjacent characters of each sub-language material segment according to a preset rule to obtain a plurality of character combinations with different lengths;
counting the times of occurrence of a plurality of words with different lengths in user comments in a preset time period;
and segmenting the target corpus according to the times of occurrence respectively.
Further, specifically, intercepting adjacent characters of each sub-language segment according to a preset rule to obtain a plurality of character combinations with different lengths, following the following rule,
and each sub-language segment is assumed to be ABCA ' B ' C ', and is divided according to the mode of adjacent 2 characters and adjacent 3 characters to obtain a plurality of character combinations of 2 characters and 3 characters with AB, BC and ABC as basic structures, wherein A, B, C is any single character.
Further, specifically, the preset user comments in the time period are the number of comments of all users in a half year.
Further, specifically, performing word segmentation on the target corpus according to the respective occurrence times, including analyzing each basic structure to obtain the number n0 of the words of ABC characters occurring in a preset time period, the number n1 of AB character combinations occurring, and the number n2 of BC character combinations occurring;
if the difference between n1 and n2 is larger than a first threshold value, determining AB as a word;
if the difference between n2 and n1 is larger than a first threshold value, judging that A is a single word, and BC is a word or BCA' is a word;
and if the absolute value of the difference between n1 and n2 is not greater than the first threshold, judging ABC as a word or ABCA' as a word.
Further, the method may further comprise,
when the difference between n2 and n1 is larger than a first threshold value, acquiring the number of BC and the number of BCA ', if the difference between the number of BC and the number of BCA' is lower than a second threshold value, judging that the BCA 'is a word, and if the difference between the number of BC and the number of BCA' is not lower than the second threshold value, judging that BC is a word;
and when the absolute value of the difference between n1 and n2 is not larger than a first threshold, acquiring the number of ABC and the number of ABCA ', if the difference between the number of ABC and the number of ABCA' is lower than a second threshold, judging ABCA 'as a word, and if the difference between the number of BC and the number of BCA' is not lower than the second threshold, judging ABC as a word.
Further, the method further comprises the steps of intercepting adjacent characters of each sub-speech segment according to a preset rule to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and rapidly counting the times of occurrence of the character combinations in the user comments in a preset time period according to the formed hash structures.
The invention also provides a word segmentation device based on the corpus, which is characterized by comprising the following steps:
the target corpus acquiring module is used for acquiring target corpora;
the sub-corpus splitting module is used for splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;
the character combination acquisition module is used for intercepting adjacent characters of each sub-language material segment according to a preset rule to obtain a plurality of character combinations with different lengths;
the quantity counting module is used for counting the times of occurrence of a plurality of words with different lengths in user comments in a preset time period;
and the word segmentation module is used for segmenting the target corpus according to the times of occurrence respectively.
Further, the device also comprises a control unit,
and the Hash structure generation module is used for intercepting adjacent characters according to a preset rule for each sub-language segment to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding Hash structures, and rapidly counting the times of occurrence of the character combinations in the user comments in a preset time period according to the formed Hash structures.
The invention also proposes a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of the preceding claims.
The invention has the beneficial effects that:
the method comprises the steps of establishing a word segmentation mathematical model for a corpus, and intercepting adjacent characters of each sub-corpus according to a preset rule to obtain a plurality of character combinations with different lengths; counting the times of occurrence of a plurality of words with different lengths in user comments in a preset time period; and segmenting the target corpus according to the times which respectively appear, intelligently segmenting words, having higher new word discovery speed and not generating word segmentation ambiguity.
Drawings
The foregoing and other features of the present disclosure will become more apparent from the detailed description of the embodiments shown in conjunction with the drawings in which like reference characters designate the same or similar elements throughout the several views, and it is apparent that the drawings in the following description are merely some examples of the present disclosure and that other drawings may be derived therefrom by those skilled in the art without the benefit of any inventive faculty, and in which:
FIG. 1 is a flow chart of the corpus-based word segmentation method according to the present invention;
Detailed Description
The conception, the specific structure and the technical effects produced by the present invention will be clearly and completely described in conjunction with the embodiments and the attached drawings, so as to fully understand the objects, the schemes and the effects of the present invention. It should be noted that, in the present application, the embodiments and features of the embodiments may be combined with each other without conflict. The same reference numbers will be used throughout the drawings to refer to the same or like parts.
Referring to fig. 1, an embodiment 1 provides a corpus-based word segmentation method, including the following steps:
step 110, obtaining a target corpus;
step 120, splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;
step 130, intercepting adjacent characters of each sub-language material segment according to a preset rule to obtain a plurality of character combinations with different lengths;
step 140, counting the times of occurrence of a plurality of words with different lengths in user comments in a preset time period;
and 150, segmenting the target corpus according to the times of occurrence respectively.
As a preferred embodiment of the present invention, specifically, adjacent characters are intercepted according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths, and following the following rules,
and each sub-language segment is assumed to be ABCA ' B ' C ', and is divided according to the mode of adjacent 2 characters and adjacent 3 characters to obtain a plurality of character combinations of 2 characters and 3 characters with AB, BC and ABC as basic structures, wherein A, B, C is any single character.
Specifically, the preset user comments in the time period are the number of comments of all users in a half year.
As a preferred embodiment of the present invention, in particular, the target corpus is participled according to the times of respective occurrences, including,
analyzing each basic structure, and acquiring the number n0 of the words of ABC characters appearing in a preset time period, the number n1 of AB character combinations and the number n2 of BC character combinations;
if the difference between n1 and n2 is larger than a first threshold value, determining AB as a word;
if the difference between n2 and n1 is larger than a first threshold value, judging that A is a single word, and BC is a word or BCA' is a word;
and if the absolute value of the difference between n1 and n2 is not greater than the first threshold, judging ABC as a word or ABCA' as a word.
The specific first threshold is defined by the user, and in general, if the threshold is more than 5 times or more, the combination may be considered more reasonable than the separation;
in the preferred embodiment, the corpus that is analyzed first is considered to be reduced, which is not just the reduction of the students' homework. "
The linguistic data in the aspect are firstly separated according to punctuation marks, if people think that the 'burden reduction' is not only the reduction of the work load of students "
Establishment of mathematical model
Each character symbol is represented as ABC three letters, then
ABC three characters can be cut AB, BC two.
The quantities can be recorded as a statistical statement on the corpus as follows in Table 1
Figure RE-GDA0003535442300000041
TABLE 1
The first condition is as follows:
the number of n1 is far greater than that of n2, and the probability of representing AB as a word is extremely high
Case two:
n1 is much smaller than n2, indicating that A is likely to be a word, and BC is likely to be but not certain to be a word, possibly part of a later word
Case three:
ABC may also be a word or part of a longer word if n1 and n2 do not differ much. At this time, ABC cannot be considered as a word, and the latter word needs to be added for analysis as a whole.
In particular, the method comprises the following steps of,
example analysis one: see Table 2
Combination of characters Number of
Everyone (ABC) 2855
Big (AB) 10530
Jiadu (BC) 4442
TABLE 2
In the statistics based on corpora, everybody can find that the times of 'everybody' in the corpora are more than the times of 'hometown' in the corpora. So it can be assumed that "everybody" is more like a word, and "everybody" becomes a single word. Namely, it is
The word segmentation of everyone should be everyone "
Example analysis two: see Table 3
Combination of characters Number of
All consider (ABC) 101
Du Jiang (AB) 271
Consider (BC) 3547
TABLE 3
Similarly, in the corpus analysis, the occurrence frequency of "think" is much higher than "all recognize", that is, "think" is more like a word, and "all" is more like a word. The analysis mode is consistent with the example analysis one, and the recursive processing is conveniently carried out by using a program.
Example analytical results:
the word "everyone considers" should be divided into "everyone", "all" and "thought". The word segmentation result is correct.
The above participle corpus is used as an example
Example analysis three (4 characters): see Table 4
Character combination Number of
Not only (ABC) 387
Not only (AB) 977
Only (BC) 1302
TABLE 4
In this case, the numbers of occurrences of "not only" and "only" are not related to each other, and the numbers of occurrences of "not only" and "only" are not small, it is necessary to introduce the following character and perform the above analysis. According to the language materials of the above, the Chinese characters,
the recombination is as follows: "not only", then re-analyze at this time any two of them as a whole, and then re-assemble into the ABC digital model structure, namely: see Table 5
Figure RE-GDA0003535442300000061
TABLE 5
Final analysis of the results
In view of the above analysis, it is,
the AB combination mode is more suitable than the BC combination mode. I.e. "not only" should be understood as a word. Correct word segmentation
As a preferred embodiment of the present invention, the method further comprises,
when the difference between n2 and n1 is larger than a first threshold, acquiring the number of BC and the number of BCA ', if the difference between the number of BC and the number of BCA' is lower than a second threshold, judging that the BCA 'is a word, and if the difference between the number of BC and the number of BCA' is not lower than the second threshold, judging that BC is a word;
and when the absolute value of the difference between n1 and n2 is not larger than a first threshold, acquiring the number of ABC and the number of ABCA ', if the difference between the number of ABC and the number of ABCA' is lower than a second threshold, judging ABCA 'as a word, and if the difference between the number of BC and the number of BCA' is not lower than the second threshold, judging ABC as a word.
As a preferred embodiment of the present invention, the method further includes, while intercepting adjacent characters according to a preset rule for each sub-corpus, to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and performing fast statistics on the character combinations according to the formed hash structures.
In the present preferred embodiment of the present invention,
if the linguistic data word segmentation is performed only through traversal matching, the process is definitely a very slow process, namely, the linguistic data are made into a combined Hash structure, the word segmentation formula can be accelerated, and the analysis speed can be comparable to that based on a word stock. And performing character arbitrary combination on the linguistic data. Also according to the above corpus
"the burden is reduced not only the students' homework is reduced"
In this corpus, character permutation and combination with length of 2 to 4 is adopted. See table 6
Figure RE-GDA0003535442300000071
Figure RE-GDA0003535442300000081
TABLE 6
By such permutation and combination generation, the hash structure is generated on the corpus. The number can be counted very quickly.
The invention also provides a word segmentation device based on the corpus, which is characterized by comprising the following steps:
the target corpus acquiring module is used for acquiring target corpora;
the sub-corpus splitting module is used for splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;
the character combination acquisition module is used for intercepting adjacent characters of each sub-language material segment according to a preset rule to obtain a plurality of character combinations with different lengths;
the quantity counting module is used for counting the times of occurrence of a plurality of words with different lengths in user comments in a preset time period;
and the word segmentation module is used for segmenting the target corpus according to the times of occurrence respectively.
As a preferred embodiment of the present invention, the apparatus further comprises,
and the Hash structure generation module is used for intercepting adjacent characters according to a preset rule for each sub-language segment to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding Hash structures, and rapidly counting the times of occurrence of the character combinations in the user comments in a preset time period according to the formed Hash structures.
The invention also proposes a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of the preceding claims.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.
The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium and can implement the steps of the above-described method embodiments when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.
While the present invention has been described in considerable detail and with particular reference to a few illustrative embodiments thereof, it is not intended to be limited to any such details or embodiments or any particular embodiments, but it is to be construed as effectively covering the intended scope of the invention by providing a broad, potential interpretation of such claims in view of the prior art with reference to the appended claims. Furthermore, the foregoing describes the invention in terms of embodiments foreseen by the inventor for which an enabling description was available, notwithstanding that insubstantial modifications of the invention, not presently foreseen, may nonetheless represent equivalent modifications thereto.
The above description is only a preferred embodiment of the present invention, and the present invention is not limited to the above embodiment, and the present invention shall fall within the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means. The invention is capable of other modifications and variations in its technical solution and/or its implementation, within the scope of protection of the invention.

Claims (9)

1. The method for segmenting words based on the corpus is characterized by comprising the following steps:
acquiring a target corpus;
splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;
intercepting adjacent characters of each sub-language material segment according to a preset rule to obtain a plurality of character combinations with different lengths; counting the times of the occurrence of a plurality of character combinations with different lengths in user comments in a preset time period;
and segmenting the target corpus according to the times of occurrence respectively.
2. The corpus-based word segmentation method according to claim 1, wherein, in detail, adjacent characters are cut out for each sub-corpus according to a preset rule to obtain a plurality of character combinations with different lengths, following the following rule,
and each sub-language segment is assumed to be ABCA ' B ' C ', and is divided according to the mode of adjacent 2 characters and adjacent 3 characters to obtain a plurality of character combinations of 2 characters and 3 characters with AB, BC and ABC as basic structures, wherein A, B, C is any single character.
3. The corpus-based word segmentation method according to claim 2, wherein the user comments in the preset time period are the number of comments of all users in a half year.
4. The corpus-based participle method according to claim 2, wherein said target corpus is participled according to said respective occurrence times, comprising,
analyzing each basic structure, and acquiring the number n0 of the words of ABC characters appearing in a preset time period, the number n1 of AB character combinations and the number n2 of BC character combinations;
if the difference between n1 and n2 is larger than a first threshold value, determining AB as a word;
if the difference between n2 and n1 is larger than a first threshold value, judging that A is a single word, and BC is a word or BCA' is a word;
and if the absolute value of the difference between n1 and n2 is not greater than the first threshold, judging ABC as a word or ABCA' as a word.
5. The corpus-based participle method according to claim 4, wherein said method further comprises,
when the difference between n2 and n1 is larger than a first threshold, acquiring the number of BC and the number of BCA ', if the difference between the number of BC and the number of BCA' is lower than a second threshold, judging that the BCA 'is a word, and if the difference between the number of BC and the number of BCA' is not lower than the second threshold, judging that BC is a word;
and when the absolute value of the difference between n1 and n2 is not larger than a first threshold, acquiring the number of ABC and the number of ABCA ', if the difference between the number of ABC and the number of ABCA' is lower than a second threshold, judging ABCA 'as a word, and if the difference between the number of BC and the number of BCA' is not lower than the second threshold, judging ABC as a word.
6. The corpus-based word segmentation method according to claim 1, wherein the method further includes, while intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and performing fast statistics on respective occurrence times of the character combinations in user comments in a preset time period according to the formed hash structures.
7. The word segmentation device based on the corpus is characterized by comprising the following components:
the target corpus acquiring module is used for acquiring target corpora;
the sub-corpus splitting module is used for splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;
the character combination acquisition module is used for intercepting adjacent characters of each sub-language material segment according to a preset rule to obtain a plurality of character combinations with different lengths;
the number counting module is used for counting the times of the occurrence of a plurality of character combinations with different lengths in user comments in a preset time period;
and the word segmentation module is used for segmenting the target corpus according to the times of occurrence respectively.
8. The corpus-based participle device according to claim 7, further comprising,
and the Hash structure generation module is used for intercepting adjacent characters according to a preset rule for each sub-language segment to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding Hash structures, and rapidly counting the times of occurrence of the character combinations in the user comments in a preset time period according to the formed Hash structures.
9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
CN202111333372.2A 2021-11-11 2021-11-11 Word segmentation method and device based on corpus Pending CN114444475A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111333372.2A CN114444475A (en) 2021-11-11 2021-11-11 Word segmentation method and device based on corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111333372.2A CN114444475A (en) 2021-11-11 2021-11-11 Word segmentation method and device based on corpus

Publications (1)

Publication Number Publication Date
CN114444475A true CN114444475A (en) 2022-05-06

Family

ID=81364001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111333372.2A Pending CN114444475A (en) 2021-11-11 2021-11-11 Word segmentation method and device based on corpus

Country Status (1)

Country Link
CN (1) CN114444475A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115130472A (en) * 2022-08-31 2022-09-30 北京澜舟科技有限公司 Method, system and readable storage medium for segmenting subwords based on BPE

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115130472A (en) * 2022-08-31 2022-09-30 北京澜舟科技有限公司 Method, system and readable storage medium for segmenting subwords based on BPE

Similar Documents

Publication Publication Date Title
US11113234B2 (en) Semantic extraction method and apparatus for natural language, and computer storage medium
CN110874531B (en) Topic analysis method and device and storage medium
Ferrer-i-Cancho et al. Random texts do not exhibit the real Zipf's law-like rank distribution
Creutz et al. Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0
CN110457672B (en) Keyword determination method and device, electronic equipment and storage medium
CN109710947B (en) Electric power professional word bank generation method and device
CN104408191B (en) The acquisition methods and device of the association keyword of keyword
CN106897258B (en) Text difference calculation method and device
US9817812B2 (en) Identifying word collocations in natural language texts
Shirani-Mehr SMS spam detection using machine learning approach
CN111767403A (en) Text classification method and device
CN108628822B (en) Semantic-free text recognition method and device
CN109993216B (en) Text classification method and device based on K nearest neighbor KNN
JP2019082841A (en) Generation program, generation method and generation device
CN107885717B (en) Keyword extraction method and device
Cotelo et al. A modular approach for lexical normalization applied to Spanish tweets
CN106649338B (en) Information filtering strategy generation method and device
Abuaiadah et al. Clustering Arabic tweets for sentiment analysis
CN110555440A (en) Event extraction method and device
CN114444475A (en) Word segmentation method and device based on corpus
CN109214445A (en) A kind of multi-tag classification method based on artificial intelligence
CN104077320B (en) method and device for generating information to be issued
CN110347934B (en) Text data filtering method, device and medium
Gunawan et al. Building automatic customer complaints filtering application based on Twitter in Bahasa Indonesia
CN110688481A (en) Text classification feature selection method based on chi-square statistic and IDF

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination