CN113449082A

CN113449082A - New word discovery method, system, electronic device and medium

Info

Publication number: CN113449082A
Application number: CN202110805642.9A
Authority: CN
Inventors: 付金伟; 梁吉光
Original assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Current assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2021-09-28

Abstract

The application discloses a new word discovery method, a system, an electronic device and a medium, wherein the new word discovery method comprises the following steps: calculating the cohesion of the candidate words: after candidate word frequency and split word frequency are calculated, candidate word cohesion is calculated according to the candidate word frequency and the split word frequency; calculating the freedom degree of the candidate words: calculating left adjacent character information entropy and right adjacent character information entropy of the candidate words, and selecting information entropy with small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as candidate word freedom; and a new word judgment step, namely calculating vocabulary scores according to the cohesion degree and the freedom degree of the candidate words, selecting the vocabulary with the vocabulary score larger than a vocabulary score threshold value from the candidate words to obtain words, comparing the words with the words in a word bank, and then obtaining new words according to a comparison result. The method improves the accuracy of new word discovery and makes the new word discovery process more logical.

Description

New word discovery method, system, electronic device and medium

Technical Field

The present application relates to the field of data capability technologies, and in particular, to a method, a system, an electronic device, and a medium for discovering new words.

Background

In the field of Chinese word segmentation, new word discovery is a very important NLP topic. On one hand, under the background that people increasingly grow the material culture demand, the development of word richness is extremely rapid, and a large number of new words appear every year; on the other hand, the generation mechanism of these new words is completely without any rules, how can the computer recognize newly appearing words such as names of people, places, organizations, brand names, proper nouns, abbreviations, network new words? In the last decade, the Chinese word segmentation field has been focused on overcoming the difficulty. The discovery and identification of new words become a key link. The traditional method for finding new words relies on the existing word segmentation device to segment words of a text, and then guesses that the remaining segments which are not successfully matched are the new words. But this method has logic holes: the accuracy of word segmentation itself depends on the completeness of the word stock, and if the word stock does not have the new words at all, it is conceivable that the result of word segmentation is not reliable, so that the effect of finding the new words is not good, and even the found new words may not be counted as words. Therefore, the prior art cannot provide a new word discovery method with high efficiency and high new word discovery accuracy.

Disclosure of Invention

The embodiment of the application provides a new word discovery method, a system, electronic equipment and a medium, and at least solves the problems that the new word discovery process depends on the existing word stock, the accuracy rate of new word discovery is low, the logic of the new word discovery method is low and the like.

The invention provides a new word discovery method, which comprises the following steps:

calculating the cohesion of the candidate words: after candidate word frequency and split word frequency are calculated, candidate word cohesion is calculated according to the candidate word frequency and the split word frequency;

calculating the freedom degree of the candidate words: calculating left adjacent character information entropy and right adjacent character information entropy of the candidate words, and selecting information entropy with small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as candidate word freedom;

and a new word judgment step, namely calculating vocabulary scores according to the cohesion degree and the freedom degree of the candidate words, selecting the vocabulary with the vocabulary score larger than a vocabulary score threshold value from the candidate words to obtain words, comparing the words with the words in a word bank, and then obtaining new words according to a comparison result.

In the above new word discovery method, the candidate word cohesion calculation step includes:

a candidate word obtaining step: after a word frequency threshold value and a word length threshold value are preset, extracting corpus fragments of which the occurrence frequency is greater than the word frequency threshold value and the word length is less than the word length threshold value from the corpus to obtain the candidate words;

candidate word ratio calculation step: after the candidate word frequency of the candidate word in the corpus is calculated, splitting the candidate word and calculating the split word frequency of the split candidate word in the corpus;

acquiring the cohesion degree of the candidate words: and calculating the cohesion degree of the candidate words according to the frequency of the candidate words and the frequency of the split words.

In the above new word discovery method, the candidate word degree of freedom calculation step includes:

a word set acquisition step: summarizing characters appearing on the left and right of the candidate words into a left adjacent character set and a right adjacent character set;

and (3) information entropy calculation: calculating the left adjacent character information entropy and the right adjacent character information entropy of the candidate words;

candidate word freedom degree obtaining step: and selecting the information entropy with a small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as the candidate word freedom.

In the above method for discovering new words, the step of determining new words includes:

selecting words: setting a candidate word cohesion degree weight and a candidate word freedom degree weight, calculating a word score according to the candidate word cohesion degree, the candidate word freedom degree, the candidate word cohesion degree weight and the candidate word freedom degree weight, setting a word score threshold value, and selecting words with the word score larger than the word score threshold value from the candidate words as words;

a new word obtaining step: and comparing the words with the words in the word stock, and judging the words as new words when the words are not in the word stock.

The present invention also provides a new word discovery system, which is suitable for the new word discovery method described above, and the new word discovery system includes:

a candidate word cohesion calculation unit: after candidate word frequency and split word frequency are calculated, candidate word cohesion is calculated according to the candidate word frequency and the split word frequency;

a candidate word degree of freedom calculation unit: calculating left adjacent character information entropy and right adjacent character information entropy of the candidate words, and selecting information entropy with small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as candidate word freedom;

and the new word judgment unit is used for calculating vocabulary scores according to the cohesion degree and the freedom degree of the candidate words, selecting the vocabulary with the vocabulary score larger than a vocabulary score threshold value from the candidate words to obtain words, comparing the words with the words in the word bank, and obtaining new words according to the comparison result.

In the above new word discovery system, the candidate word cohesion calculation unit includes:

a candidate word acquisition module: after a word frequency threshold value and a word length threshold value are preset, extracting corpus fragments of which the occurrence frequency is greater than the word frequency threshold value and the word length is less than the word length threshold value from the corpus to obtain the candidate words;

candidate word ratio calculation module: after the candidate word frequency of the candidate word in the corpus is calculated, splitting the candidate word and calculating the split word frequency of the split candidate word in the corpus;

candidate word cohesion acquisition module: and calculating the cohesion degree of the candidate words according to the frequency of the candidate words and the frequency of the split words.

In the above new word discovery system, the candidate word degree of freedom calculation unit includes:

a word set acquisition module: summarizing characters appearing on the left and right of the candidate words into a left adjacent character set and a right adjacent character set;

the information entropy calculation module: calculating the left adjacent character information entropy and the right adjacent character information entropy of the candidate words;

candidate word degree of freedom acquisition module: and selecting the information entropy with a small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as the candidate word freedom.

In the above system for discovering new words, the new word determining unit includes:

the word selecting module: setting a candidate word cohesion degree weight and a candidate word freedom degree weight, calculating a word score according to the candidate word cohesion degree, the candidate word freedom degree, the candidate word cohesion degree weight and the candidate word freedom degree weight, setting a word score threshold value, and selecting words with the word score larger than the word score threshold value from the candidate words as words;

a new word acquisition module: and comparing the words with the words in the word stock, and judging the words as new words when the words are not in the word stock.

The present invention also provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements any of the new word discovery methods described above when executing the computer program.

The present invention also provides an electronic device readable storage medium having stored thereon computer program instructions, which, when executed by the processor, implement any of the new word discovery methods described above.

Compared with the prior art, the new word discovery method, the system, the electronic equipment and the medium provided by the invention do not depend on any existing word bank, and the newly appeared words can be found by comparing all extracted words with the existing word bank, so that the new word discovery accuracy, the new word discovery method logicality and the data mining capability are improved.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flow diagram of a new word discovery method according to an embodiment of the application;

FIG. 2 is a schematic diagram of the configuration of the neologism discovery system of the present invention;

fig. 3 is a block diagram of an electronic device according to an embodiment of the present application.

Wherein the reference numerals are:

a candidate word cohesion calculation unit: 51;

a candidate word degree of freedom calculation unit: 52;

a new word judgment step unit: 53;

a candidate word acquisition module: 511;

candidate word ratio calculation module: 512;

candidate word cohesion acquisition module: 513;

a word set acquisition module: 521, respectively;

the information entropy calculation module: 522;

candidate word degree of freedom acquisition module: 523;

the word selecting module: 531;

a new word acquisition module: 532;

80 parts of a bus;

a processor: 81;

a memory: 82;

a communication interface: 83.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that such a development effort might be complex and tedious, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as a limitation of this disclosure.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The method does not depend on any existing word stock, extracts all text fragments which can become words in a large-scale corpus according to the common characteristics of the words, and then compares all the extracted words with the existing word stock to find out the newly appeared words.

The present invention will be described with reference to specific examples.

Example one

The present embodiment provides a new word discovery method. Referring to fig. 1, fig. 1 is a flowchart of a new word discovery method according to an embodiment of the present application, and as shown in fig. 1, the new word discovery method includes the following steps:

candidate word cohesion calculation step S1: after candidate word frequency and split word frequency are calculated, candidate word cohesion is calculated according to the candidate word frequency and the split word frequency;

candidate word degree of freedom calculation step S2: calculating left adjacent character information entropy and right adjacent character information entropy of the candidate words, and selecting information entropy with small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as candidate word freedom;

and a new word judgment step S3, calculating vocabulary scores according to the candidate word cohesion and the candidate word freedom, selecting the vocabulary with the vocabulary score larger than the vocabulary score threshold from the candidate words to obtain the words, comparing the words with the words in the word bank, and obtaining the new words according to the comparison result.

In an embodiment, the candidate word cohesion calculation step S1 includes:

candidate word obtaining step S11: after a word frequency threshold value and a word length threshold value are preset, extracting corpus fragments of which the occurrence frequency is greater than the word frequency threshold value and the word length is less than the word length threshold value from the corpus to obtain the candidate words;

candidate word ratio calculating step S12: after the candidate word frequency of the candidate word in the corpus is calculated, splitting the candidate word and calculating the split word frequency of the split candidate word in the corpus;

candidate word cohesion degree obtaining step S13: and calculating the cohesion degree of the candidate words according to the frequency of the candidate words and the frequency of the split words.

In the specific implementation, a word frequency threshold value is set as frequency, a word length threshold value is set as length, all corpus segments with frequency greater than frequency and word length less than length in a corpus are extracted, and after the corpus segments are used as candidate words of the corpus, the candidate word frequency of the candidate words in the corpus is calculated, namely R; and splitting the candidate word, and respectively calculating the split word frequencies of the split parts in the corpus, namely r1 and r 2. Like the term "cinema", the term can be divided into two terms of "movie" and "hospital", wherein the frequency of "cinema" is R, and the frequencies of "movie" and "hospital" are R1 and R2 respectively; the ratio p is calculated as R/(R1R 2), and when k is greater than the threshold value m, it is considered that the degree of cohesion between the divided portions is high, indicating that there is a relationship between the divided portions, and it is not irrelevant.

In an embodiment, the candidate word degree of freedom calculating step S2 includes:

word set acquisition step S21: summarizing characters appearing on the left and right of the candidate words into a left adjacent character set and a right adjacent character set;

information entropy calculation step S22: calculating the left adjacent character information entropy and the right adjacent character information entropy of the candidate words;

candidate word degree of freedom acquisition step S23: and selecting the information entropy with a small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as the candidate word freedom.

In a specific implementation, a left adjacent character set and a right adjacent character set of candidate words are sorted, for example, in the process of eating grape skin, not eating grape skin and inversely eating grape skin, a word of grape appears for four times, wherein the left adjacent character set is { eating, spitting, eating and spitting }, and the right adjacent character set is { non, skin, inverted and skin }; calculating the entropy of the left-adjacent character information, for example, the entropy of the left-adjacent character information of the word of the grape is (1/2) · log (1/2) - (1/2) · log (1/2) ≈ 0.693; calculating right-adjacent word information entropy, for example, the right-adjacent word information entropy of the word of 'grape' is- (1/2) · log (1/2) - (1/4) · log (1/4) - (1/4) · log (1/4) ≈ 1.04; defining the degree of freedom q of a candidate word as the smaller value of the left adjacent word information entropy and the right adjacent word information entropy, for example, the degree of freedom q of a word of 'grape' is 0.693 of the left information entropy; setting a threshold n of degree of freedom, and once the degree of freedom q of the candidate word is greater than the threshold n, the matching before and after the word is very rich and the word is more likely to be a single word. On the contrary, if the entropy of a word is smaller than the threshold, it means that the collocation with the left and right words is very fixed, it means that there is a smaller probability that the word is a word, such as "ancestor", and the common collocation only has a few rare usages, such as "ancestor", "this ancestor", "next ancestor", "eight ancestor", "several ancestors", "this ancestor", and "n ancestor", "two ancestors", so the entropy of the word is very small and is unlikely to be a word.

In an embodiment, the new word judging step S3 includes:

word selecting step S31: setting a candidate word cohesion degree weight and a candidate word freedom degree weight, calculating a word score according to the candidate word cohesion degree, the candidate word freedom degree, the candidate word cohesion degree weight and the candidate word freedom degree weight, setting a word score threshold value, and selecting words with the word score larger than the word score threshold value from the candidate words as words;

new word obtaining step S32: and comparing the words with the words in the word stock, and judging the words as new words when the words are not in the word stock.

In a specific implementation, the vocabulary score is calculated by integrating the cohesion degree and the freedom degree: score is w1 p + w2 q. Setting a score threshold k, and when score > k, considering the word as a word; and (4) checking the selected words in the candidate words in the word bank in sequence, wherein the words which do not exist in the word bank are new words.

Example two

Referring to fig. 2, fig. 2 is a schematic structural diagram of a new word discovery system according to the present invention. As shown in fig. 2, the new word discovery system of the present invention, which is suitable for the above new word discovery method, includes:

the candidate word cohesion calculation unit 51: after candidate word frequency and split word frequency are calculated, candidate word cohesion is calculated according to the candidate word frequency and the split word frequency;

candidate word degree of freedom calculation unit 52: calculating left adjacent character information entropy and right adjacent character information entropy of the candidate words, and selecting information entropy with small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as candidate word freedom;

and the new word judging unit 53 calculates vocabulary scores according to the cohesion degree and the freedom degree of the candidate words, selects the vocabulary with the vocabulary score larger than the vocabulary score threshold value from the candidate words to obtain words, compares the words with the words in the word bank, and obtains new words according to the comparison result.

In an embodiment, the candidate word cohesion calculation unit 51 includes:

the candidate word obtaining module 511: after a word frequency threshold value and a word length threshold value are preset, extracting corpus fragments of which the occurrence frequency is greater than the word frequency threshold value and the word length is less than the word length threshold value from the corpus to obtain the candidate words;

candidate word ratio calculation module 512: after the candidate word frequency of the candidate word in the corpus is calculated, splitting the candidate word and calculating the split word frequency of the split candidate word in the corpus;

candidate word cohesion acquisition module 513: and calculating the cohesion degree of the candidate words according to the frequency of the candidate words and the frequency of the split words.

In an embodiment, the candidate word degree of freedom calculation unit 52 includes:

word set acquisition module 521: summarizing characters appearing on the left and right of the candidate words into a left adjacent character set and a right adjacent character set;

the information entropy calculation module 522: calculating the left adjacent character information entropy and the right adjacent character information entropy of the candidate words;

candidate word degree of freedom acquisition module 523: and selecting the information entropy with a small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as the candidate word freedom.

In an embodiment, the new word judgment unit 53 includes:

word selection module 531: setting a candidate word cohesion degree weight and a candidate word freedom degree weight, calculating a word score according to the candidate word cohesion degree, the candidate word freedom degree, the candidate word cohesion degree weight and the candidate word freedom degree weight, setting a word score threshold value, and selecting words with the word score larger than the word score threshold value from the candidate words as words;

the new word obtaining module 532: and comparing the words with the words in the word stock, and judging the words as new words when the words are not in the word stock.

EXAMPLE III

Referring to fig. 3, this embodiment discloses an embodiment of an electronic device. The electronic device may include a processor 81 and a memory 82 storing computer program instructions.

Specifically, the processor 81 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

Memory 82 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 82 may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 82 may include removable or non-removable (or fixed) media, where appropriate. The memory 82 may be internal or external to the anomaly data monitoring device, where appropriate. In a particular embodiment, the memory 82 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 82 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (FPROM), Electrically Erasable PROM (EFPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.

The memory 82 may be used to store or cache various data files for processing and/or communication use, as well as possible computer program instructions executed by the processor 81.

The processor 81 implements any of the new word discovery methods in the above embodiments by reading and executing computer program instructions stored in the memory 82.

In some of these embodiments, the electronic device may also include a communication interface 83 and a bus 80. As shown in fig. 3, the processor 81, the memory 82, and the communication interface 83 are connected via the bus 80 to complete communication therebetween.

The communication interface 83 is used for implementing communication between modules, devices, units and/or equipment in the embodiment of the present application. The communication port 83 may also be implemented with other components such as: and data communication is carried out among external equipment, image/abnormal data monitoring equipment, a database, external storage, an image/abnormal data monitoring workstation and the like.

The bus 80 includes hardware, software, or both to couple the components of the electronic device to one another. Bus 80 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 80 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a Video Electronics Bus (audio Electronics Association), abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 80 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

The electronic device may connect to the new word discovery system to implement the method in conjunction with fig. 1.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

In summary, the invention does not depend on any existing word stock, and the newly appeared words can be found by extracting all text fragments which are possibly formed into words in a large-scale corpus and then comparing all the extracted words with the existing word stock only according to the common characteristics of the words, so that the problems of dependence on the existing word stock, low accuracy of finding the new words, low logic of a new word finding method and the like in the new word finding process are solved at least through the invention.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the protection scope of the appended claims.

Claims

1. A method for discovering new words, comprising:

2. The new word discovery method according to claim 1, wherein the candidate word cohesion calculation step includes:

3. The new word discovery method according to claim 1, wherein said candidate word degree of freedom calculation step includes:

4. The method according to claim 1, wherein the new word judgment step comprises:

5. A new word discovery system, adapted to the new word discovery method according to any one of claims 1 to 4, said new word discovery system comprising:

6. The new word discovery system according to claim 5, wherein the candidate word cohesion calculation unit includes:

7. The new word discovery system according to claim 6, wherein said candidate word degree of freedom calculation unit includes:

8. The system according to claim 7, wherein the new word judgment unit includes:

9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the new word discovery method of any one of claims 1 to 4 when executing the computer program.

10. An electronic device-readable storage medium having stored thereon computer program instructions which, when executed by the processor, implement the new word discovery method of any one of claims 1 to 4.