CN113449082A - New word discovery method, system, electronic device and medium - Google Patents
New word discovery method, system, electronic device and medium Download PDFInfo
- Publication number
- CN113449082A CN113449082A CN202110805642.9A CN202110805642A CN113449082A CN 113449082 A CN113449082 A CN 113449082A CN 202110805642 A CN202110805642 A CN 202110805642A CN 113449082 A CN113449082 A CN 113449082A
- Authority
- CN
- China
- Prior art keywords
- word
- candidate
- words
- information entropy
- new
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
Abstract
The application discloses a new word discovery method, a system, an electronic device and a medium, wherein the new word discovery method comprises the following steps: calculating the cohesion of the candidate words: after candidate word frequency and split word frequency are calculated, candidate word cohesion is calculated according to the candidate word frequency and the split word frequency; calculating the freedom degree of the candidate words: calculating left adjacent character information entropy and right adjacent character information entropy of the candidate words, and selecting information entropy with small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as candidate word freedom; and a new word judgment step, namely calculating vocabulary scores according to the cohesion degree and the freedom degree of the candidate words, selecting the vocabulary with the vocabulary score larger than a vocabulary score threshold value from the candidate words to obtain words, comparing the words with the words in a word bank, and then obtaining new words according to a comparison result. The method improves the accuracy of new word discovery and makes the new word discovery process more logical.
Description
Technical Field
The present application relates to the field of data capability technologies, and in particular, to a method, a system, an electronic device, and a medium for discovering new words.
Background
In the field of Chinese word segmentation, new word discovery is a very important NLP topic. On one hand, under the background that people increasingly grow the material culture demand, the development of word richness is extremely rapid, and a large number of new words appear every year; on the other hand, the generation mechanism of these new words is completely without any rules, how can the computer recognize newly appearing words such as names of people, places, organizations, brand names, proper nouns, abbreviations, network new words? In the last decade, the Chinese word segmentation field has been focused on overcoming the difficulty. The discovery and identification of new words become a key link. The traditional method for finding new words relies on the existing word segmentation device to segment words of a text, and then guesses that the remaining segments which are not successfully matched are the new words. But this method has logic holes: the accuracy of word segmentation itself depends on the completeness of the word stock, and if the word stock does not have the new words at all, it is conceivable that the result of word segmentation is not reliable, so that the effect of finding the new words is not good, and even the found new words may not be counted as words. Therefore, the prior art cannot provide a new word discovery method with high efficiency and high new word discovery accuracy.
Disclosure of Invention
The embodiment of the application provides a new word discovery method, a system, electronic equipment and a medium, and at least solves the problems that the new word discovery process depends on the existing word stock, the accuracy rate of new word discovery is low, the logic of the new word discovery method is low and the like.
The invention provides a new word discovery method, which comprises the following steps:
calculating the cohesion of the candidate words: after candidate word frequency and split word frequency are calculated, candidate word cohesion is calculated according to the candidate word frequency and the split word frequency;
calculating the freedom degree of the candidate words: calculating left adjacent character information entropy and right adjacent character information entropy of the candidate words, and selecting information entropy with small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as candidate word freedom;
and a new word judgment step, namely calculating vocabulary scores according to the cohesion degree and the freedom degree of the candidate words, selecting the vocabulary with the vocabulary score larger than a vocabulary score threshold value from the candidate words to obtain words, comparing the words with the words in a word bank, and then obtaining new words according to a comparison result.
In the above new word discovery method, the candidate word cohesion calculation step includes:
a candidate word obtaining step: after a word frequency threshold value and a word length threshold value are preset, extracting corpus fragments of which the occurrence frequency is greater than the word frequency threshold value and the word length is less than the word length threshold value from the corpus to obtain the candidate words;
candidate word ratio calculation step: after the candidate word frequency of the candidate word in the corpus is calculated, splitting the candidate word and calculating the split word frequency of the split candidate word in the corpus;
acquiring the cohesion degree of the candidate words: and calculating the cohesion degree of the candidate words according to the frequency of the candidate words and the frequency of the split words.
In the above new word discovery method, the candidate word degree of freedom calculation step includes:
a word set acquisition step: summarizing characters appearing on the left and right of the candidate words into a left adjacent character set and a right adjacent character set;
and (3) information entropy calculation: calculating the left adjacent character information entropy and the right adjacent character information entropy of the candidate words;
candidate word freedom degree obtaining step: and selecting the information entropy with a small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as the candidate word freedom.
In the above method for discovering new words, the step of determining new words includes:
selecting words: setting a candidate word cohesion degree weight and a candidate word freedom degree weight, calculating a word score according to the candidate word cohesion degree, the candidate word freedom degree, the candidate word cohesion degree weight and the candidate word freedom degree weight, setting a word score threshold value, and selecting words with the word score larger than the word score threshold value from the candidate words as words;
a new word obtaining step: and comparing the words with the words in the word stock, and judging the words as new words when the words are not in the word stock.
The present invention also provides a new word discovery system, which is suitable for the new word discovery method described above, and the new word discovery system includes:
a candidate word cohesion calculation unit: after candidate word frequency and split word frequency are calculated, candidate word cohesion is calculated according to the candidate word frequency and the split word frequency;
a candidate word degree of freedom calculation unit: calculating left adjacent character information entropy and right adjacent character information entropy of the candidate words, and selecting information entropy with small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as candidate word freedom;
and the new word judgment unit is used for calculating vocabulary scores according to the cohesion degree and the freedom degree of the candidate words, selecting the vocabulary with the vocabulary score larger than a vocabulary score threshold value from the candidate words to obtain words, comparing the words with the words in the word bank, and obtaining new words according to the comparison result.
In the above new word discovery system, the candidate word cohesion calculation unit includes:
a candidate word acquisition module: after a word frequency threshold value and a word length threshold value are preset, extracting corpus fragments of which the occurrence frequency is greater than the word frequency threshold value and the word length is less than the word length threshold value from the corpus to obtain the candidate words;
candidate word ratio calculation module: after the candidate word frequency of the candidate word in the corpus is calculated, splitting the candidate word and calculating the split word frequency of the split candidate word in the corpus;
candidate word cohesion acquisition module: and calculating the cohesion degree of the candidate words according to the frequency of the candidate words and the frequency of the split words.
In the above new word discovery system, the candidate word degree of freedom calculation unit includes:
a word set acquisition module: summarizing characters appearing on the left and right of the candidate words into a left adjacent character set and a right adjacent character set;
the information entropy calculation module: calculating the left adjacent character information entropy and the right adjacent character information entropy of the candidate words;
candidate word degree of freedom acquisition module: and selecting the information entropy with a small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as the candidate word freedom.
In the above system for discovering new words, the new word determining unit includes:
the word selecting module: setting a candidate word cohesion degree weight and a candidate word freedom degree weight, calculating a word score according to the candidate word cohesion degree, the candidate word freedom degree, the candidate word cohesion degree weight and the candidate word freedom degree weight, setting a word score threshold value, and selecting words with the word score larger than the word score threshold value from the candidate words as words;
a new word acquisition module: and comparing the words with the words in the word stock, and judging the words as new words when the words are not in the word stock.
The present invention also provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements any of the new word discovery methods described above when executing the computer program.
The present invention also provides an electronic device readable storage medium having stored thereon computer program instructions, which, when executed by the processor, implement any of the new word discovery methods described above.
Compared with the prior art, the new word discovery method, the system, the electronic equipment and the medium provided by the invention do not depend on any existing word bank, and the newly appeared words can be found by comparing all extracted words with the existing word bank, so that the new word discovery accuracy, the new word discovery method logicality and the data mining capability are improved.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flow diagram of a new word discovery method according to an embodiment of the application;
FIG. 2 is a schematic diagram of the configuration of the neologism discovery system of the present invention;
fig. 3 is a block diagram of an electronic device according to an embodiment of the present application.
Wherein the reference numerals are:
a candidate word cohesion calculation unit: 51;
a candidate word degree of freedom calculation unit: 52;
a new word judgment step unit: 53;
a candidate word acquisition module: 511;
candidate word ratio calculation module: 512;
candidate word cohesion acquisition module: 513;
a word set acquisition module: 521, respectively;
the information entropy calculation module: 522;
candidate word degree of freedom acquisition module: 523;
the word selecting module: 531;
a new word acquisition module: 532;
80 parts of a bus;
a processor: 81;
a memory: 82;
a communication interface: 83.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that such a development effort might be complex and tedious, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as a limitation of this disclosure.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The method does not depend on any existing word stock, extracts all text fragments which can become words in a large-scale corpus according to the common characteristics of the words, and then compares all the extracted words with the existing word stock to find out the newly appeared words.
The present invention will be described with reference to specific examples.
Example one
The present embodiment provides a new word discovery method. Referring to fig. 1, fig. 1 is a flowchart of a new word discovery method according to an embodiment of the present application, and as shown in fig. 1, the new word discovery method includes the following steps:
candidate word cohesion calculation step S1: after candidate word frequency and split word frequency are calculated, candidate word cohesion is calculated according to the candidate word frequency and the split word frequency;
candidate word degree of freedom calculation step S2: calculating left adjacent character information entropy and right adjacent character information entropy of the candidate words, and selecting information entropy with small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as candidate word freedom;
and a new word judgment step S3, calculating vocabulary scores according to the candidate word cohesion and the candidate word freedom, selecting the vocabulary with the vocabulary score larger than the vocabulary score threshold from the candidate words to obtain the words, comparing the words with the words in the word bank, and obtaining the new words according to the comparison result.
In an embodiment, the candidate word cohesion calculation step S1 includes:
candidate word obtaining step S11: after a word frequency threshold value and a word length threshold value are preset, extracting corpus fragments of which the occurrence frequency is greater than the word frequency threshold value and the word length is less than the word length threshold value from the corpus to obtain the candidate words;
candidate word ratio calculating step S12: after the candidate word frequency of the candidate word in the corpus is calculated, splitting the candidate word and calculating the split word frequency of the split candidate word in the corpus;
candidate word cohesion degree obtaining step S13: and calculating the cohesion degree of the candidate words according to the frequency of the candidate words and the frequency of the split words.
In the specific implementation, a word frequency threshold value is set as frequency, a word length threshold value is set as length, all corpus segments with frequency greater than frequency and word length less than length in a corpus are extracted, and after the corpus segments are used as candidate words of the corpus, the candidate word frequency of the candidate words in the corpus is calculated, namely R; and splitting the candidate word, and respectively calculating the split word frequencies of the split parts in the corpus, namely r1 and r 2. Like the term "cinema", the term can be divided into two terms of "movie" and "hospital", wherein the frequency of "cinema" is R, and the frequencies of "movie" and "hospital" are R1 and R2 respectively; the ratio p is calculated as R/(R1R 2), and when k is greater than the threshold value m, it is considered that the degree of cohesion between the divided portions is high, indicating that there is a relationship between the divided portions, and it is not irrelevant.
In an embodiment, the candidate word degree of freedom calculating step S2 includes:
word set acquisition step S21: summarizing characters appearing on the left and right of the candidate words into a left adjacent character set and a right adjacent character set;
information entropy calculation step S22: calculating the left adjacent character information entropy and the right adjacent character information entropy of the candidate words;
candidate word degree of freedom acquisition step S23: and selecting the information entropy with a small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as the candidate word freedom.
In a specific implementation, a left adjacent character set and a right adjacent character set of candidate words are sorted, for example, in the process of eating grape skin, not eating grape skin and inversely eating grape skin, a word of grape appears for four times, wherein the left adjacent character set is { eating, spitting, eating and spitting }, and the right adjacent character set is { non, skin, inverted and skin }; calculating the entropy of the left-adjacent character information, for example, the entropy of the left-adjacent character information of the word of the grape is (1/2) · log (1/2) - (1/2) · log (1/2) ≈ 0.693; calculating right-adjacent word information entropy, for example, the right-adjacent word information entropy of the word of 'grape' is- (1/2) · log (1/2) - (1/4) · log (1/4) - (1/4) · log (1/4) ≈ 1.04; defining the degree of freedom q of a candidate word as the smaller value of the left adjacent word information entropy and the right adjacent word information entropy, for example, the degree of freedom q of a word of 'grape' is 0.693 of the left information entropy; setting a threshold n of degree of freedom, and once the degree of freedom q of the candidate word is greater than the threshold n, the matching before and after the word is very rich and the word is more likely to be a single word. On the contrary, if the entropy of a word is smaller than the threshold, it means that the collocation with the left and right words is very fixed, it means that there is a smaller probability that the word is a word, such as "ancestor", and the common collocation only has a few rare usages, such as "ancestor", "this ancestor", "next ancestor", "eight ancestor", "several ancestors", "this ancestor", and "n ancestor", "two ancestors", so the entropy of the word is very small and is unlikely to be a word.
In an embodiment, the new word judging step S3 includes:
word selecting step S31: setting a candidate word cohesion degree weight and a candidate word freedom degree weight, calculating a word score according to the candidate word cohesion degree, the candidate word freedom degree, the candidate word cohesion degree weight and the candidate word freedom degree weight, setting a word score threshold value, and selecting words with the word score larger than the word score threshold value from the candidate words as words;
new word obtaining step S32: and comparing the words with the words in the word stock, and judging the words as new words when the words are not in the word stock.
In a specific implementation, the vocabulary score is calculated by integrating the cohesion degree and the freedom degree: score is w1 p + w2 q. Setting a score threshold k, and when score > k, considering the word as a word; and (4) checking the selected words in the candidate words in the word bank in sequence, wherein the words which do not exist in the word bank are new words.
Example two
Referring to fig. 2, fig. 2 is a schematic structural diagram of a new word discovery system according to the present invention. As shown in fig. 2, the new word discovery system of the present invention, which is suitable for the above new word discovery method, includes:
the candidate word cohesion calculation unit 51: after candidate word frequency and split word frequency are calculated, candidate word cohesion is calculated according to the candidate word frequency and the split word frequency;
candidate word degree of freedom calculation unit 52: calculating left adjacent character information entropy and right adjacent character information entropy of the candidate words, and selecting information entropy with small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as candidate word freedom;
and the new word judging unit 53 calculates vocabulary scores according to the cohesion degree and the freedom degree of the candidate words, selects the vocabulary with the vocabulary score larger than the vocabulary score threshold value from the candidate words to obtain words, compares the words with the words in the word bank, and obtains new words according to the comparison result.
In an embodiment, the candidate word cohesion calculation unit 51 includes:
the candidate word obtaining module 511: after a word frequency threshold value and a word length threshold value are preset, extracting corpus fragments of which the occurrence frequency is greater than the word frequency threshold value and the word length is less than the word length threshold value from the corpus to obtain the candidate words;
candidate word ratio calculation module 512: after the candidate word frequency of the candidate word in the corpus is calculated, splitting the candidate word and calculating the split word frequency of the split candidate word in the corpus;
candidate word cohesion acquisition module 513: and calculating the cohesion degree of the candidate words according to the frequency of the candidate words and the frequency of the split words.
In an embodiment, the candidate word degree of freedom calculation unit 52 includes:
word set acquisition module 521: summarizing characters appearing on the left and right of the candidate words into a left adjacent character set and a right adjacent character set;
the information entropy calculation module 522: calculating the left adjacent character information entropy and the right adjacent character information entropy of the candidate words;
candidate word degree of freedom acquisition module 523: and selecting the information entropy with a small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as the candidate word freedom.
In an embodiment, the new word judgment unit 53 includes:
word selection module 531: setting a candidate word cohesion degree weight and a candidate word freedom degree weight, calculating a word score according to the candidate word cohesion degree, the candidate word freedom degree, the candidate word cohesion degree weight and the candidate word freedom degree weight, setting a word score threshold value, and selecting words with the word score larger than the word score threshold value from the candidate words as words;
the new word obtaining module 532: and comparing the words with the words in the word stock, and judging the words as new words when the words are not in the word stock.
EXAMPLE III
Referring to fig. 3, this embodiment discloses an embodiment of an electronic device. The electronic device may include a processor 81 and a memory 82 storing computer program instructions.
Specifically, the processor 81 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
The memory 82 may be used to store or cache various data files for processing and/or communication use, as well as possible computer program instructions executed by the processor 81.
The processor 81 implements any of the new word discovery methods in the above embodiments by reading and executing computer program instructions stored in the memory 82.
In some of these embodiments, the electronic device may also include a communication interface 83 and a bus 80. As shown in fig. 3, the processor 81, the memory 82, and the communication interface 83 are connected via the bus 80 to complete communication therebetween.
The communication interface 83 is used for implementing communication between modules, devices, units and/or equipment in the embodiment of the present application. The communication port 83 may also be implemented with other components such as: and data communication is carried out among external equipment, image/abnormal data monitoring equipment, a database, external storage, an image/abnormal data monitoring workstation and the like.
The bus 80 includes hardware, software, or both to couple the components of the electronic device to one another. Bus 80 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 80 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a Video Electronics Bus (audio Electronics Association), abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 80 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.
The electronic device may connect to the new word discovery system to implement the method in conjunction with fig. 1.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
In summary, the invention does not depend on any existing word stock, and the newly appeared words can be found by extracting all text fragments which are possibly formed into words in a large-scale corpus and then comparing all the extracted words with the existing word stock only according to the common characteristics of the words, so that the problems of dependence on the existing word stock, low accuracy of finding the new words, low logic of a new word finding method and the like in the new word finding process are solved at least through the invention.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the protection scope of the appended claims.
Claims (10)
1. A method for discovering new words, comprising:
calculating the cohesion of the candidate words: after candidate word frequency and split word frequency are calculated, candidate word cohesion is calculated according to the candidate word frequency and the split word frequency;
calculating the freedom degree of the candidate words: calculating left adjacent character information entropy and right adjacent character information entropy of the candidate words, and selecting information entropy with small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as candidate word freedom;
and a new word judgment step, namely calculating vocabulary scores according to the cohesion degree and the freedom degree of the candidate words, selecting the vocabulary with the vocabulary score larger than a vocabulary score threshold value from the candidate words to obtain words, comparing the words with the words in a word bank, and then obtaining new words according to a comparison result.
2. The new word discovery method according to claim 1, wherein the candidate word cohesion calculation step includes:
a candidate word obtaining step: after a word frequency threshold value and a word length threshold value are preset, extracting corpus fragments of which the occurrence frequency is greater than the word frequency threshold value and the word length is less than the word length threshold value from the corpus to obtain the candidate words;
candidate word ratio calculation step: after the candidate word frequency of the candidate word in the corpus is calculated, splitting the candidate word and calculating the split word frequency of the split candidate word in the corpus;
acquiring the cohesion degree of the candidate words: and calculating the cohesion degree of the candidate words according to the frequency of the candidate words and the frequency of the split words.
3. The new word discovery method according to claim 1, wherein said candidate word degree of freedom calculation step includes:
a word set acquisition step: summarizing characters appearing on the left and right of the candidate words into a left adjacent character set and a right adjacent character set;
and (3) information entropy calculation: calculating the left adjacent character information entropy and the right adjacent character information entropy of the candidate words;
candidate word freedom degree obtaining step: and selecting the information entropy with a small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as the candidate word freedom.
4. The method according to claim 1, wherein the new word judgment step comprises:
selecting words: setting a candidate word cohesion degree weight and a candidate word freedom degree weight, calculating a word score according to the candidate word cohesion degree, the candidate word freedom degree, the candidate word cohesion degree weight and the candidate word freedom degree weight, setting a word score threshold value, and selecting words with the word score larger than the word score threshold value from the candidate words as words;
a new word obtaining step: and comparing the words with the words in the word stock, and judging the words as new words when the words are not in the word stock.
5. A new word discovery system, adapted to the new word discovery method according to any one of claims 1 to 4, said new word discovery system comprising:
a candidate word cohesion calculation unit: after candidate word frequency and split word frequency are calculated, candidate word cohesion is calculated according to the candidate word frequency and the split word frequency;
a candidate word degree of freedom calculation unit: calculating left adjacent character information entropy and right adjacent character information entropy of the candidate words, and selecting information entropy with small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as candidate word freedom;
and the new word judgment unit is used for calculating vocabulary scores according to the cohesion degree and the freedom degree of the candidate words, selecting the vocabulary with the vocabulary score larger than a vocabulary score threshold value from the candidate words to obtain words, comparing the words with the words in the word bank, and obtaining new words according to the comparison result.
6. The new word discovery system according to claim 5, wherein the candidate word cohesion calculation unit includes:
a candidate word acquisition module: after a word frequency threshold value and a word length threshold value are preset, extracting corpus fragments of which the occurrence frequency is greater than the word frequency threshold value and the word length is less than the word length threshold value from the corpus to obtain the candidate words;
candidate word ratio calculation module: after the candidate word frequency of the candidate word in the corpus is calculated, splitting the candidate word and calculating the split word frequency of the split candidate word in the corpus;
candidate word cohesion acquisition module: and calculating the cohesion degree of the candidate words according to the frequency of the candidate words and the frequency of the split words.
7. The new word discovery system according to claim 6, wherein said candidate word degree of freedom calculation unit includes:
a word set acquisition module: summarizing characters appearing on the left and right of the candidate words into a left adjacent character set and a right adjacent character set;
the information entropy calculation module: calculating the left adjacent character information entropy and the right adjacent character information entropy of the candidate words;
candidate word degree of freedom acquisition module: and selecting the information entropy with a small information entropy value from the left adjacent character information entropy and the right adjacent character information entropy as the candidate word freedom.
8. The system according to claim 7, wherein the new word judgment unit includes:
the word selecting module: setting a candidate word cohesion degree weight and a candidate word freedom degree weight, calculating a word score according to the candidate word cohesion degree, the candidate word freedom degree, the candidate word cohesion degree weight and the candidate word freedom degree weight, setting a word score threshold value, and selecting words with the word score larger than the word score threshold value from the candidate words as words;
a new word acquisition module: and comparing the words with the words in the word stock, and judging the words as new words when the words are not in the word stock.
9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the new word discovery method of any one of claims 1 to 4 when executing the computer program.
10. An electronic device-readable storage medium having stored thereon computer program instructions which, when executed by the processor, implement the new word discovery method of any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110805642.9A CN113449082A (en) | 2021-07-16 | 2021-07-16 | New word discovery method, system, electronic device and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110805642.9A CN113449082A (en) | 2021-07-16 | 2021-07-16 | New word discovery method, system, electronic device and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113449082A true CN113449082A (en) | 2021-09-28 |
Family
ID=77816393
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110805642.9A Pending CN113449082A (en) | 2021-07-16 | 2021-07-16 | New word discovery method, system, electronic device and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113449082A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114218938A (en) * | 2021-12-13 | 2022-03-22 | 北京智齿众服技术咨询有限公司 | Word segmentation method and device, electronic equipment and storage medium |
CN115034211A (en) * | 2022-05-19 | 2022-09-09 | 一点灵犀信息技术(广州)有限公司 | Unknown word discovery method and device, electronic equipment and storage medium |
CN117077670A (en) * | 2023-10-16 | 2023-11-17 | 深圳市东信时代信息技术有限公司 | New word determining method, device, computer equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102930055A (en) * | 2012-11-18 | 2013-02-13 | 浙江大学 | New network word discovery method in combination with internal polymerization degree and external discrete information entropy |
CN105260362A (en) * | 2015-10-30 | 2016-01-20 | 小米科技有限责任公司 | New word extraction method and device |
CN108038119A (en) * | 2017-11-01 | 2018-05-15 | 平安科技(深圳)有限公司 | Utilize the method, apparatus and storage medium of new word discovery investment target |
CN109408818A (en) * | 2018-10-12 | 2019-03-01 | 平安科技(深圳)有限公司 | New word identification method, device, computer equipment and storage medium |
CN110110322A (en) * | 2019-03-29 | 2019-08-09 | 泰康保险集团股份有限公司 | Network new word discovery method, apparatus, electronic equipment and storage medium |
-
2021
- 2021-07-16 CN CN202110805642.9A patent/CN113449082A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102930055A (en) * | 2012-11-18 | 2013-02-13 | 浙江大学 | New network word discovery method in combination with internal polymerization degree and external discrete information entropy |
CN105260362A (en) * | 2015-10-30 | 2016-01-20 | 小米科技有限责任公司 | New word extraction method and device |
CN108038119A (en) * | 2017-11-01 | 2018-05-15 | 平安科技(深圳)有限公司 | Utilize the method, apparatus and storage medium of new word discovery investment target |
CN109408818A (en) * | 2018-10-12 | 2019-03-01 | 平安科技(深圳)有限公司 | New word identification method, device, computer equipment and storage medium |
CN110110322A (en) * | 2019-03-29 | 2019-08-09 | 泰康保险集团股份有限公司 | Network new word discovery method, apparatus, electronic equipment and storage medium |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114218938A (en) * | 2021-12-13 | 2022-03-22 | 北京智齿众服技术咨询有限公司 | Word segmentation method and device, electronic equipment and storage medium |
CN115034211A (en) * | 2022-05-19 | 2022-09-09 | 一点灵犀信息技术(广州)有限公司 | Unknown word discovery method and device, electronic equipment and storage medium |
CN117077670A (en) * | 2023-10-16 | 2023-11-17 | 深圳市东信时代信息技术有限公司 | New word determining method, device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113449082A (en) | New word discovery method, system, electronic device and medium | |
US11544459B2 (en) | Method and apparatus for determining feature words and server | |
US7962486B2 (en) | Method and system for discovery and modification of data cluster and synonyms | |
US8290975B2 (en) | Graph-based keyword expansion | |
US8301437B2 (en) | Tokenization platform | |
US9311389B2 (en) | Finding indexed documents | |
CN107784110B (en) | Index establishing method and device | |
JP2005251206A (en) | Word collection method and system for use in word segmentation | |
CN108595679B (en) | Label determining method, device, terminal and storage medium | |
US20030158725A1 (en) | Method and apparatus for identifying words with common stems | |
CN106033417B (en) | Method and device for sequencing series of video search | |
CN110222015B (en) | File data reading and querying method and device and readable storage medium | |
WO2020144491A2 (en) | Machine learning approach to cross-language translation and search | |
CN106919554B (en) | Method and device for identifying invalid words in document | |
CN113128205A (en) | Script information processing method and device, electronic equipment and storage medium | |
CN112527950A (en) | MapReduce-based graph data deleting method and system | |
CN113468879A (en) | Method, system, electronic device and medium for judging unknown words | |
CN109918661A (en) | Synonym acquisition methods and device | |
CN115455425A (en) | Method, system, equipment and storage medium for generating protection patch | |
JP5145288B2 (en) | Synonym dictionary construction apparatus and method, computer program | |
CN113326699A (en) | Data detection method, electronic device and storage medium | |
CN107239517B (en) | Multi-condition searching method and device based on Hbase database | |
KR101195950B1 (en) | System and method for providing search service | |
CN109710938B (en) | Voice and word conversion method and device and electronic equipment | |
CN105824803A (en) | Method and device for determining hotspot event name |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |