CN114444475A

CN114444475A - Word segmentation method and device based on corpus

Info

Publication number: CN114444475A
Application number: CN202111333372.2A
Authority: CN
Inventors: 李森和
Original assignee: GUANGZHOU JIANHE NETWORK TECHNOLOGY CO LTD
Current assignee: GUANGZHOU JIANHE NETWORK TECHNOLOGY CO LTD
Priority date: 2021-11-11
Filing date: 2021-11-11
Publication date: 2022-05-06

Abstract

The invention relates to a word segmentation method based on corpora, which comprises the following steps: acquiring a target corpus; splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks; intercepting adjacent characters of each sub-language material segment according to a preset rule to obtain a plurality of character combinations with different lengths; counting the times of occurrence of a plurality of words with different lengths in user comments in a preset time period; and segmenting the target corpus according to the times of occurrence respectively. The invention can intelligently divide words, has higher new word finding speed and can not generate word division ambiguity.

Description

Word segmentation method and device based on corpus

Technical Field

The invention relates to the field of artificial intelligence, in particular to a corpus-based word segmentation method and device.

Background

The word segmentation is used as the basis of the current text analysis, and almost all the applications based on the text analysis need the word segmentation.

The existing word segmentation mode is based on a word bank, and qualified keywords need to be filtered and screened manually and periodically. The word segmentation is generally based on a word stock like mmseg, ik, and coding, and the word segmentation based on the word stock has the following defects that the word stock needs to be manually maintained, the new word discovery speed is slow, and word segmentation ambiguity is easy to generate.

Disclosure of Invention

The present invention is directed to at least solve one of the deficiencies of the prior art, and provides a method and an apparatus for segmenting words based on corpus.

In order to achieve the purpose, the invention adopts the following technical scheme:

specifically, a corpus-based word segmentation method is provided, which comprises the following steps:

acquiring a target corpus;

splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;

intercepting adjacent characters of each sub-language material segment according to a preset rule to obtain a plurality of character combinations with different lengths;

counting the times of occurrence of a plurality of words with different lengths in user comments in a preset time period;

and segmenting the target corpus according to the times of occurrence respectively.

Further, specifically, intercepting adjacent characters of each sub-language segment according to a preset rule to obtain a plurality of character combinations with different lengths, following the following rule,

and each sub-language segment is assumed to be ABCA ' B ' C ', and is divided according to the mode of adjacent 2 characters and adjacent 3 characters to obtain a plurality of character combinations of 2 characters and 3 characters with AB, BC and ABC as basic structures, wherein A, B, C is any single character.

Further, specifically, the preset user comments in the time period are the number of comments of all users in a half year.

Further, specifically, performing word segmentation on the target corpus according to the respective occurrence times, including analyzing each basic structure to obtain the number n0 of the words of ABC characters occurring in a preset time period, the number n1 of AB character combinations occurring, and the number n2 of BC character combinations occurring;

if the difference between n1 and n2 is larger than a first threshold value, determining AB as a word;

if the difference between n2 and n1 is larger than a first threshold value, judging that A is a single word, and BC is a word or BCA' is a word;

and if the absolute value of the difference between n1 and n2 is not greater than the first threshold, judging ABC as a word or ABCA' as a word.

Further, the method may further comprise,

when the difference between n2 and n1 is larger than a first threshold value, acquiring the number of BC and the number of BCA ', if the difference between the number of BC and the number of BCA' is lower than a second threshold value, judging that the BCA 'is a word, and if the difference between the number of BC and the number of BCA' is not lower than the second threshold value, judging that BC is a word;

and when the absolute value of the difference between n1 and n2 is not larger than a first threshold, acquiring the number of ABC and the number of ABCA ', if the difference between the number of ABC and the number of ABCA' is lower than a second threshold, judging ABCA 'as a word, and if the difference between the number of BC and the number of BCA' is not lower than the second threshold, judging ABC as a word.

Further, the method further comprises the steps of intercepting adjacent characters of each sub-speech segment according to a preset rule to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and rapidly counting the times of occurrence of the character combinations in the user comments in a preset time period according to the formed hash structures.

The invention also provides a word segmentation device based on the corpus, which is characterized by comprising the following steps:

the target corpus acquiring module is used for acquiring target corpora;

the sub-corpus splitting module is used for splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;

the character combination acquisition module is used for intercepting adjacent characters of each sub-language material segment according to a preset rule to obtain a plurality of character combinations with different lengths;

the quantity counting module is used for counting the times of occurrence of a plurality of words with different lengths in user comments in a preset time period;

and the word segmentation module is used for segmenting the target corpus according to the times of occurrence respectively.

Further, the device also comprises a control unit,

and the Hash structure generation module is used for intercepting adjacent characters according to a preset rule for each sub-language segment to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding Hash structures, and rapidly counting the times of occurrence of the character combinations in the user comments in a preset time period according to the formed Hash structures.

The invention also proposes a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of the preceding claims.

The invention has the beneficial effects that:

the method comprises the steps of establishing a word segmentation mathematical model for a corpus, and intercepting adjacent characters of each sub-corpus according to a preset rule to obtain a plurality of character combinations with different lengths; counting the times of occurrence of a plurality of words with different lengths in user comments in a preset time period; and segmenting the target corpus according to the times which respectively appear, intelligently segmenting words, having higher new word discovery speed and not generating word segmentation ambiguity.

Drawings

The foregoing and other features of the present disclosure will become more apparent from the detailed description of the embodiments shown in conjunction with the drawings in which like reference characters designate the same or similar elements throughout the several views, and it is apparent that the drawings in the following description are merely some examples of the present disclosure and that other drawings may be derived therefrom by those skilled in the art without the benefit of any inventive faculty, and in which:

FIG. 1 is a flow chart of the corpus-based word segmentation method according to the present invention;

Detailed Description

The conception, the specific structure and the technical effects produced by the present invention will be clearly and completely described in conjunction with the embodiments and the attached drawings, so as to fully understand the objects, the schemes and the effects of the present invention. It should be noted that, in the present application, the embodiments and features of the embodiments may be combined with each other without conflict. The same reference numbers will be used throughout the drawings to refer to the same or like parts.

Referring to fig. 1, an embodiment 1 provides a corpus-based word segmentation method, including the following steps:

step 110, obtaining a target corpus;

step 120, splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;

step 130, intercepting adjacent characters of each sub-language material segment according to a preset rule to obtain a plurality of character combinations with different lengths;

step 140, counting the times of occurrence of a plurality of words with different lengths in user comments in a preset time period;

and 150, segmenting the target corpus according to the times of occurrence respectively.

As a preferred embodiment of the present invention, specifically, adjacent characters are intercepted according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths, and following the following rules,

Specifically, the preset user comments in the time period are the number of comments of all users in a half year.

As a preferred embodiment of the present invention, in particular, the target corpus is participled according to the times of respective occurrences, including,

analyzing each basic structure, and acquiring the number n0 of the words of ABC characters appearing in a preset time period, the number n1 of AB character combinations and the number n2 of BC character combinations;

The specific first threshold is defined by the user, and in general, if the threshold is more than 5 times or more, the combination may be considered more reasonable than the separation;

in the preferred embodiment, the corpus that is analyzed first is considered to be reduced, which is not just the reduction of the students' homework. "

The linguistic data in the aspect are firstly separated according to punctuation marks, if people think that the 'burden reduction' is not only the reduction of the work load of students "

Establishment of mathematical model

Each character symbol is represented as ABC three letters, then

ABC three characters can be cut AB, BC two.

The quantities can be recorded as a statistical statement on the corpus as follows in Table 1

TABLE 1

The first condition is as follows:

the number of n1 is far greater than that of n2, and the probability of representing AB as a word is extremely high

Case two:

n1 is much smaller than n2, indicating that A is likely to be a word, and BC is likely to be but not certain to be a word, possibly part of a later word

Case three:

ABC may also be a word or part of a longer word if n1 and n2 do not differ much. At this time, ABC cannot be considered as a word, and the latter word needs to be added for analysis as a whole.

In particular, the method comprises the following steps of,

example analysis one: see Table 2

Combination of characters	Number of
		Everyone (ABC)	2855
Big (AB)	10530
		Jiadu (BC)	4442

TABLE 2

In the statistics based on corpora, everybody can find that the times of 'everybody' in the corpora are more than the times of 'hometown' in the corpora. So it can be assumed that "everybody" is more like a word, and "everybody" becomes a single word. Namely, it is

The word segmentation of everyone should be everyone "

Example analysis two: see Table 3

Combination of characters	Number of
		All consider (ABC)	101
Du Jiang (AB)	271
		Consider (BC)	3547

TABLE 3

Similarly, in the corpus analysis, the occurrence frequency of "think" is much higher than "all recognize", that is, "think" is more like a word, and "all" is more like a word. The analysis mode is consistent with the example analysis one, and the recursive processing is conveniently carried out by using a program.

Example analytical results:

the word "everyone considers" should be divided into "everyone", "all" and "thought". The word segmentation result is correct.

The above participle corpus is used as an example

Example analysis three (4 characters): see Table 4

Character combination	Number of
		Not only (ABC)	387
Not only (AB)	977
		Only (BC)	1302

TABLE 4

In this case, the numbers of occurrences of "not only" and "only" are not related to each other, and the numbers of occurrences of "not only" and "only" are not small, it is necessary to introduce the following character and perform the above analysis. According to the language materials of the above, the Chinese characters,

the recombination is as follows: "not only", then re-analyze at this time any two of them as a whole, and then re-assemble into the ABC digital model structure, namely: see Table 5

TABLE 5

Final analysis of the results

In view of the above analysis, it is,

the AB combination mode is more suitable than the BC combination mode. I.e. "not only" should be understood as a word. Correct word segmentation

As a preferred embodiment of the present invention, the method further comprises,

when the difference between n2 and n1 is larger than a first threshold, acquiring the number of BC and the number of BCA ', if the difference between the number of BC and the number of BCA' is lower than a second threshold, judging that the BCA 'is a word, and if the difference between the number of BC and the number of BCA' is not lower than the second threshold, judging that BC is a word;

As a preferred embodiment of the present invention, the method further includes, while intercepting adjacent characters according to a preset rule for each sub-corpus, to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and performing fast statistics on the character combinations according to the formed hash structures.

In the present preferred embodiment of the present invention,

if the linguistic data word segmentation is performed only through traversal matching, the process is definitely a very slow process, namely, the linguistic data are made into a combined Hash structure, the word segmentation formula can be accelerated, and the analysis speed can be comparable to that based on a word stock. And performing character arbitrary combination on the linguistic data. Also according to the above corpus

"the burden is reduced not only the students' homework is reduced"

In this corpus, character permutation and combination with length of 2 to 4 is adopted. See table 6

TABLE 6

By such permutation and combination generation, the hash structure is generated on the corpus. The number can be counted very quickly.

the target corpus acquiring module is used for acquiring target corpora;

As a preferred embodiment of the present invention, the apparatus further comprises,

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium and can implement the steps of the above-described method embodiments when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

While the present invention has been described in considerable detail and with particular reference to a few illustrative embodiments thereof, it is not intended to be limited to any such details or embodiments or any particular embodiments, but it is to be construed as effectively covering the intended scope of the invention by providing a broad, potential interpretation of such claims in view of the prior art with reference to the appended claims. Furthermore, the foregoing describes the invention in terms of embodiments foreseen by the inventor for which an enabling description was available, notwithstanding that insubstantial modifications of the invention, not presently foreseen, may nonetheless represent equivalent modifications thereto.

The above description is only a preferred embodiment of the present invention, and the present invention is not limited to the above embodiment, and the present invention shall fall within the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means. The invention is capable of other modifications and variations in its technical solution and/or its implementation, within the scope of protection of the invention.

Claims

1. The method for segmenting words based on the corpus is characterized by comprising the following steps:

acquiring a target corpus;

intercepting adjacent characters of each sub-language material segment according to a preset rule to obtain a plurality of character combinations with different lengths; counting the times of the occurrence of a plurality of character combinations with different lengths in user comments in a preset time period;

2. The corpus-based word segmentation method according to claim 1, wherein, in detail, adjacent characters are cut out for each sub-corpus according to a preset rule to obtain a plurality of character combinations with different lengths, following the following rule,

3. The corpus-based word segmentation method according to claim 2, wherein the user comments in the preset time period are the number of comments of all users in a half year.

4. The corpus-based participle method according to claim 2, wherein said target corpus is participled according to said respective occurrence times, comprising,

5. The corpus-based participle method according to claim 4, wherein said method further comprises,

6. The corpus-based word segmentation method according to claim 1, wherein the method further includes, while intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and performing fast statistics on respective occurrence times of the character combinations in user comments in a preset time period according to the formed hash structures.

7. The word segmentation device based on the corpus is characterized by comprising the following components:

the target corpus acquiring module is used for acquiring target corpora;

the number counting module is used for counting the times of the occurrence of a plurality of character combinations with different lengths in user comments in a preset time period;

8. The corpus-based participle device according to claim 7, further comprising,

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.