CN110929009B - Method and device for acquiring new words - Google Patents

Method and device for acquiring new words Download PDF

Info

Publication number
CN110929009B
CN110929009B CN201911162192.5A CN201911162192A CN110929009B CN 110929009 B CN110929009 B CN 110929009B CN 201911162192 A CN201911162192 A CN 201911162192A CN 110929009 B CN110929009 B CN 110929009B
Authority
CN
China
Prior art keywords
word
words
library
corpus
segmented
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911162192.5A
Other languages
Chinese (zh)
Other versions
CN110929009A (en
Inventor
崔小波
陈奇宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN201911162192.5A priority Critical patent/CN110929009B/en
Publication of CN110929009A publication Critical patent/CN110929009A/en
Application granted granted Critical
Publication of CN110929009B publication Critical patent/CN110929009B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines

Abstract

The invention provides a method and a device for acquiring a new word, wherein the method for acquiring the new word comprises the following steps: segmenting the target corpus according to a preset first word length, screening the segmented words according to a coagulability algorithm and a left-right entropy algorithm, and selecting words meeting the requirements of new words from the target corpus to form a corpus word library; acquiring words on the network, screening the network words to obtain new network words, extracting the new words in the target corpus by using the new network words to find out the new words screened by the coagulability algorithm and the left-right entropy algorithm, and finding out the new words contained in the target corpus to the maximum extent; the method for acquiring the new words provided by the invention has high accuracy and is not easy to miss the new words.

Description

Method and device for acquiring new words
Technical Field
The invention relates to the technical field of new word expansion, in particular to a method and a device for acquiring new words.
Background
With the continuous development of society, new words and phrases are emerging in daily life. Linguists have done a statistic and have opened up a reform that produces on average over 800 new words each year. These new words are mainly foreign words and are spread through newspaper, television and other media. With the rise of the internet, the creativity of individuals has a plurality of platforms for exhibition, so that more new words are created and rapidly spread by the platform of the internet. Due to the appearance of the new words, excessive scattered strings appear in the word segmentation result, and the word segmentation result is wrong. Recent studies have shown that 60% of word segmentation errors are caused by new words. The existing new word acquisition method is generally to segment a target corpus according to word length, calculate the external solidity of words according to the frequency of segmenting words, and screen out words with high external solidity as new words.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for obtaining new words, so as to improve the accuracy of the obtained target material including the new words and reduce word missing.
In a first aspect, an embodiment of the present invention provides a method for obtaining a new word, including the following steps:
segmenting the target corpus according to a preset first word length to obtain a first segmentation word library;
screening the first segmentation word library based on a coagulability algorithm and a left-right entropy algorithm to obtain a corpus word library;
acquiring a new network word library contained in the network word library according to a currently captured network word library, a last captured network word library and a pre-stored local word library;
extracting words matched with the network new word library from the target corpus to obtain a corpus new word library;
and merging the corpus new word library and the corpus word library, and deleting the substring words contained in the corpus new word library from the merged word library, wherein the substring words are not contained in the corpus word library but the character strings corresponding to the substring words are contained in the corpus word library, so as to obtain the new words contained in the target corpus.
With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where a reverse maximum matching segmentation algorithm is used to extract a term that matches the network new term corpus from the target corpus.
With reference to the first aspect, a second possible implementation manner of the first aspect is provided in an embodiment of the present invention, where the obtaining a new network word library included in the network word library according to a currently captured network word library, a last captured network word library, and a pre-stored local word library includes:
acquiring network words in a currently captured network word library, and screening out network words with the frequency within a preset network word threshold value;
deleting the network words matched with the local word library from the screened network words to obtain filtered network words;
and deleting the filtered network words matched with the last captured network word library from the filtered network words to obtain the new network word library.
With reference to the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where the first word length includes: word minimum length and word maximum length; the target corpus is segmented according to a preset first word length to obtain a first segmentation word library, and the method comprises the following steps:
segmenting the target corpus, and acquiring segmented words with word lengths between the minimum word lengths and the maximum word lengths from segmented words obtained by segmenting;
counting the frequency of the segmented words in the target corpus;
and constructing the first segmentation word library based on the segmentation words and the frequency of the segmentation words in the target corpus.
With reference to the first aspect and any one possible implementation manner of the first to third possible implementation manners of the first aspect, an embodiment of the present invention provides a fourth possible implementation manner of the first aspect, where the screening the first corpus based on a coagulability algorithm and a left-right entropy algorithm to obtain a corpus, and the screening includes:
calculating the degree of solidification of each segmented word in the first segmented word library based on the frequency of the segmented word in the target corpus;
extracting the segmented words with the solidification degree within the solidification degree threshold value corresponding to the word length from the segmented words with the word length according to the word length corresponding to the segmented words to obtain an initially screened segmented word library;
calculating left and right entropies of each primarily screened word in the primarily screened word library in the target corpus, and extracting the primarily screened words with the left and right entropies within a preset left and right entropy threshold value to obtain a re-screened word library;
segmenting the target corpus in sequence according to a preset second word length to obtain a second segmentation word library;
extracting second segmented words matched with the rescreened segmented word corpus from the second segmented word corpus, and acquiring second segmented words which are adjacent to each other in the target corpus and have the same character string of the tail of the previous second segmented word as the character string of the head of the next second segmented word from the extracted second segmented words;
combining the obtained front and back second segmentation words to obtain a second segmentation combination word;
segmenting each second segmentation combination word according to the second word length to obtain a third segmentation word library corresponding to each second segmentation combination word;
comparing each third segmented word library with the rescreened segmented word library respectively, and if each third segmented word in a third segmented word library is contained in the rescreened segmented word library, placing a second segmented combined word corresponding to the third segmented word library in a potential word library;
and merging the potential word and phrase library and the rescreened and segmented word and phrase library to obtain the corpus word and phrase library.
With reference to the fourth possible implementation manner of the first aspect, an embodiment of the present invention provides a fifth possible implementation manner of the first aspect, where after the obtaining of the first thesaurus, the method further includes:
comparing a preset stop word bank with the first segmentation word bank, and removing words matched with the stop word bank from the first segmentation word bank so as to update the first segmentation word bank;
after the corpus word bank is obtained, the method further comprises the following steps:
and comparing the corpus word library with a pre-stored common word library, and removing words matched with the common word library so as to update the corpus word library.
In a second aspect, an embodiment of the present invention further provides an apparatus for acquiring a new word, including:
the corpus length segmentation module is used for segmenting the target corpus according to a preset first word length to obtain a first segmentation word corpus;
the word operation screening module is used for screening the first segmentation word library based on a coagulability algorithm and a left-right entropy algorithm to obtain a corpus word library;
the network new word acquisition module is used for acquiring a network new word library contained in the network word library according to the currently captured network word library, the last captured network word library and a pre-stored local word library;
the network new word segmentation module is used for extracting words matched with the network new word corpus from the target corpus to obtain a corpus new word corpus;
and the merging statistical module is used for merging the corpus new word library and the corpus word library, deleting sub-string words contained in the corpus new word library from the merged word library, wherein the sub-string words are not contained in the corpus word library but the character strings corresponding to the sub-string words are contained in the corpus word library, and obtaining the new words contained in the target corpus.
With reference to the second aspect, an embodiment of the present invention provides a first possible implementation manner of the second aspect, where the network new word obtaining module includes:
the network word capturing unit is used for acquiring the network words in the currently captured network word library and screening out the network words with the frequency within a preset network word threshold value;
the local word screening unit is used for deleting the network words matched with the local word library from the screened network words to obtain filtered network words;
and the history new word screening unit is used for deleting the filtering network words matched with the last captured network word library from the filtering network words to obtain the network new word library.
In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method when executing the computer program.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, performs the steps of the method described above.
According to the method and the device for acquiring the new words, provided by the embodiment of the invention, the target corpus is segmented through a preset first word length, the segmented words are screened according to a coagulability algorithm and a left-right entropy algorithm, and words which meet the requirements of the new words in the target corpus are selected to form a corpus word library; acquiring words on the network, screening the network words to obtain new network words, and extracting the new words in the target corpus by using the new network words to find out the new words screened by the coagulability algorithm and the left-right entropy algorithm so as to maximally find out the new words contained in the target corpus; the method for acquiring the new words provided by the invention has high accuracy and is not easy to miss the new words.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flow chart of a method for acquiring new words according to an embodiment of the present invention;
fig. 2 is a schematic flow chart illustrating a process of segmenting a target corpus to obtain a first segmentation word corpus in the method according to the embodiment of the present invention;
fig. 3 is a schematic flow chart illustrating a process of screening the first segmented word corpus to obtain a corpus word corpus in the method according to the embodiment of the present invention;
fig. 4 is a schematic flow chart illustrating the process of extracting the network new word library in the method according to the embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a device for acquiring new words according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a method and a device for acquiring new words, which are described by the embodiment below.
As shown in fig. 1, the method for acquiring a new word provided in this embodiment includes the following steps:
s100: segmenting the target corpus according to a preset first word length to obtain a first segmentation word library;
s200: screening the first segmentation word library based on a coagulability algorithm and a left-right entropy algorithm to obtain a corpus word library;
s300: acquiring a network new word library contained in the network word library according to a currently captured network word library, a last captured network word library and a pre-stored local word library;
s400: extracting words matched with the network new word language library from the target language material to obtain a language material new word language library;
s500: and merging the corpus new word library and the corpus word library, and deleting the substring words contained in the corpus new word library from the merged word library, wherein the substring words are not contained in the corpus word library but the character strings corresponding to the substring words are contained in the corpus word library, so as to obtain the new words contained in the target corpus.
Wherein S400: and extracting words matched with the network new word library from the target corpus, and extracting words matched with the network new word library from the target corpus by adopting a reverse maximum matching word segmentation algorithm. And by adopting a reverse maximum matching word segmentation algorithm, words matched with the network new word library can be rapidly screened from the target corpus, so that the operation speed is saved.
As shown in fig. 2, the first word length in this embodiment includes: word minimum length and word maximum length; and S100: segmenting the target corpus according to a preset first word length to obtain a first segmentation word library, and the method comprises the following steps of:
s101: segmenting the target corpus, and acquiring segmented words with word lengths between the minimum word lengths and the maximum word lengths from segmented words obtained by segmenting;
s102: counting the frequency of the segmented words in the target corpus;
s103: and constructing the first segmentation word library based on the segmentation words and the frequency of the segmentation words in the target corpus.
Here, the S100: and segmenting the target corpus according to a preset first word length to obtain a first segmentation word corpus, wherein a Nagao algorithm is adopted to segment the target corpus.
As shown in fig. 3, in the present embodiment, S200: the method comprises the following steps of screening the first segmentation word library based on a coagulability algorithm and a left-right entropy algorithm to obtain a corpus word library, and specifically comprises the following steps:
s201: calculating the degree of solidification of each segmented word in the first segmented word library based on the frequency of the segmented word in the target corpus;
s202: extracting the segmented words with the solidification degree within the solidification degree threshold value corresponding to the word length from the segmented words with the word length according to the word length corresponding to the segmented words to obtain an initially screened segmented word library;
s203: calculating left and right entropies of each primarily screened word in the primarily screened word library in the target corpus, and extracting the primarily screened words with the left and right entropies within a preset left and right entropy threshold value to obtain a re-screened word library;
s204: segmenting the target corpus in sequence according to a preset second word length to obtain a second segmented word library;
s205: extracting second segmented words matched with the rescreened segmented word corpus from the second segmented word corpus, and acquiring second segmented words which are adjacent to each other in the target corpus and have the same character string of the tail of the previous second segmented word as the character string of the head of the next second segmented word from the extracted second segmented words; combining the obtained front and back second segmentation words to obtain a second segmentation combination word;
s206: segmenting each second segmentation combination word according to the second word length to obtain a third segmentation word library corresponding to each second segmentation combination word;
s207: comparing each third segmented word library with the rescreened segmented word library respectively, and if each third segmented word in a third segmented word library is contained in the rescreened word library, placing a second segmented combination word corresponding to the third segmented word library in a potential word library;
s208: and merging the potential word and phrase library and the rescreened and segmented word and phrase library to obtain the corpus word and phrase library.
In this embodiment, the obtaining of the third segmented word library corresponding to each second segmented and combined word is to obtain a corresponding third segmented word library for each second segmented and combined word. Extracting the segmentation words with the solidification degrees within the solidification degree threshold values corresponding to the word lengths from the segmentation words with the word lengths according to the word lengths corresponding to the segmentation words, wherein the solidification degree threshold values corresponding to the segmentation words with different word lengths are different; namely, during extraction, according to the word length corresponding to the solidification degree threshold, the segmentation words in the segmentation words of the word length within the solidification degree threshold are extracted.
In this embodiment, the degree of solidification of the segmented word is calculated based on the frequency of the segmented word in the target corpus; for calculating the internal solidity of the segmented word, the following three-character word is taken as an example, and the internal solidity calculation formula of the segmented word is as follows:
D(abc)=min{P(abc)/P(ab)P(c),P(abc)/P(a)P(bc)};
in the formula, P (abc) represents the probability of three words appearing together, P (ab) and P (bc) represent the probability of two words appearing together, and P (a) and P (c) represent the probability of one word appearing alone.
In this embodiment, the left-right entropy of each of the primarily screened terms in the primarily screened term corpus is calculated and recorded as E (W) l ) And the right entropy is denoted as E (W) r ) The left entropy is the entropy of the left word (the word adjacent to the left) and the right entropy is the entropy of the right word (the word adjacent to the right). The calculation formula of the information entropy is as follows:
Figure BDA0002286399910000101
in the formula, w represents a left word or a right word, a is a set of all the left words (right words) after de-duplication, and P (w) is the probability of the left word (right word). w is a left word when the left entropy is calculated and a right word when the right entropy is calculated. For example: calculating the left entropy of the national grid, wherein a left word list T of the national grid is as follows: { congratulatory, announcement, presence, and, holding, presence, congratulatory, right, and issue \8230;); t is a list of all the appeared left words, i.e. all the appeared left words are added to the list T without duplication. Calculating left word probability distribution
Figure BDA0002286399910000102
Wherein C (w) is the frequency of the current word in T, and S is the total number of the words in T. Thus, when the left entropy of the national grid is calculated, p (w) of each word in A is calculated respectively and is substituted into the information entropy calculation formula. The calculation method of the right entropy is the same as that of the left entropy, and the right entropy of the character string can be calculated only by counting all right words to obtain a right word list.
As shown in fig. 4, in the present embodiment, S300: acquiring a network new word library contained in a network word library according to a currently captured network word library, a last captured network word library and a pre-stored local word library, wherein the method comprises the following steps:
s301: acquiring network words in a currently captured network word library, and screening out network words with the frequency within a preset network word threshold value;
s302: deleting the network words matched with the local word library from the screened network words to obtain filtered network words;
s303: and deleting the filtering network words matched with the last captured network word library from the filtering network words to obtain the network new word library.
In this embodiment, the network term library is a public network term library, and the selection of the network term library may be selected by a person skilled in the art according to actual needs, which is not described in detail herein.
In the embodiment, network words are captured, and then the local word bank and the last captured network words are respectively compared with the captured network words so as to select new words in the network words to form a network new word bank, and the words in the screened network new word bank are ensured to be network new words through two comparisons. The time interval between the current capture and the last capture is set by a person skilled in the art according to actual needs, generally one month, and certainly may be any time interval such as one week, two weeks, two months, or one year.
In this embodiment, after the obtaining the first keyword library, the method further includes: comparing a preset stop word bank with the first segmentation word bank, and removing words matched with the stop word bank from the first segmentation word bank so as to update the first segmentation word bank; and updating the first segmentation word library by using the stop word library, removing the common words in the first segmentation word library, reducing the calculation amount of a freezing degree algorithm and a left-right entropy algorithm, and being beneficial to improving the speed of acquiring new words.
In this embodiment, after the obtaining the corpus word library, the method further includes: and comparing the corpus word library with a pre-stored common word library, and removing words matched with the common word library so as to update the corpus word library. The screening selection of the common words in the target corpus is performed after the corpus word library is obtained, instead of directly screening the first segmentation word library, the common words in the second segmentation combined words can be removed, and the accuracy of the finally obtained new words is ensured.
In this embodiment, the deactivation word bank: in order for some words to appear independently rather than as part of a word in information retrieval and Chinese text processing, the words or words are called stop words, and the thesaurus of words is the stop thesaurus. The words are filtered by the deactivated lexicon, and the noise words (e.g., yes, and then, vice versa, that is, o, etc.) in the words can be filtered.
In this embodiment, the minimum word length and the maximum word length of the first word length and the second word length are the same, the minimum word length is usually 2, the maximum word length can be set to any number greater than 2 according to actual needs, and when the maximum word length is 5, the first word length and the second word length are both 2, 3, 4, and 5; the minimum length of the word and the maximum length of the word may be arbitrarily set by those skilled in the art according to actual needs, and certainly, the maximum length of the word cannot be smaller than the minimum length of the word, and the first length of the word and the second length of the word may be set by those skilled in the art according to actual needs, and will not be described in detail herein.
The accuracy of the selected words is ensured through the screening of the internal freezing degree algorithm and the left-right entropy algorithm and the screening of the stop word bank and the pre-stored common word bank; multidirectional repeated screening ensures that the screened words are all new words, and meanwhile, the target word bank is segmented again, and then the segmented words are screened and combined by the new words screened by the solidification degree algorithm and the left-right entropy algorithm, so that the new words are prevented from being missed and missed through high-requirement screening of the solidification degree and the left-right entropy; the merged new words are subjected to backward detection to prevent new words with wrong synthesis; and when the target corpus is subjected to frequency screening, performing word segmentation on the target corpus by using network new words, merging the corpus new word library screened by the network new word library with the corpus word library, and removing the substring words while merging, wherein the substring words are not words in the corpus word library, but character strings forming the substring words are part of a certain word character string in the corpus word library, so that the accuracy of acquiring new words from the target corpus is further improved.
As shown in fig. 5, an embodiment of the present application further provides an apparatus for acquiring a new word, including:
the corpus length segmentation module 601 is configured to segment a target corpus according to a preset first term length to obtain a first segmentation term corpus;
a first segmentation word library updating module 611, configured to compare a preset disabled word library with the first segmentation word library, and remove a word matching the disabled word library from the first segmentation word library to update the first segmentation word library;
the word operation screening module 621 is configured to screen the first segmentation word library based on a coagulability algorithm and a left-right entropy algorithm to obtain a corpus word library;
a corpus word bank updating module 631, configured to compare the corpus word bank with a pre-stored common word bank, and remove words matched with the common word bank to update the corpus word bank;
a network new word obtaining module 641, configured to obtain a network new word library included in the network word library according to the currently captured network word library, the last captured network word library, and a local word library stored in advance;
a network new word segmentation module 651, configured to extract, from the target corpus, words matched with the network new word corpus, so as to obtain a corpus new word corpus;
a merging statistics module 661, configured to merge the corpus new term corpus and the corpus term corpus, and delete a sub-string term included in the corpus new term corpus from the merged term corpus, where the sub-string term is not included in the corpus term corpus but a character string corresponding to the sub-string term is included in the corpus term corpus, so as to obtain a new term included in the target corpus.
As shown in fig. 5, the network new word acquiring module 641 in this embodiment includes:
the network word capturing unit is used for acquiring the network words in the currently captured network word library and screening out the network words with the frequency within a preset network word threshold value;
the local word screening unit is used for deleting the network words matched with the local word library from the screened network words to obtain filtered network words;
and the history new word screening unit is used for deleting the filtering network words matched with the last captured network word library from the filtering network words to obtain the network new word library.
As shown in fig. 5, the corpus-length participle module 601 in this embodiment includes:
the first word segmentation unit is used for segmenting the target corpus and acquiring segmented words with word lengths between the minimum word length and the maximum word length from the segmented words obtained by segmentation;
the frequency calculation unit is used for counting the frequency of the segmented words in the target corpus; and constructing the first segmentation word library based on the segmentation words and the frequency of the segmentation words in the target corpus.
As shown in fig. 5, the word operation filtering module 621 in this embodiment includes:
the solidity calculation unit is used for calculating the solidity of each segmented word in the first segmented word library based on the frequency of the segmented word in the target corpus;
the coagulation degree screening unit is used for extracting the segmented words with the coagulation degree within the coagulation degree threshold value corresponding to the word length from the segmented words with the word length according to the word length corresponding to the segmented words to obtain an initially screened segmented word library;
the left-right entropy calculation unit is used for calculating left-right entropy of each primarily-screened word in the primarily-screened word library in the target corpus;
the left-right entropy screening unit is used for extracting the primarily screened words with left-right entropy within a preset left-right entropy threshold value to obtain a rescreened word library;
the second word segmentation unit is used for sequentially segmenting the target corpus according to a preset second word length to obtain a second segmentation word library; extracting second segmented words matched with the rescreened segmented word library from the second segmented word library, and acquiring second segmented words which are adjacent in the target corpus and have the same character string of the tail of the previous second segmented word and the same character string of the head of the next second segmented word from the extracted second segmented words; merging the obtained front and back second segmentation words to obtain a second segmentation combination word;
the backward-pushing detection unit is used for segmenting each second segmentation combination word according to the second word length to obtain a third segmentation word library corresponding to each second segmentation combination word; comparing each third segmented word library with the rescreened segmented word library respectively, and if each third segmented word in a third segmented word library is contained in the rescreened word library, placing a second segmented combination word corresponding to the third segmented word library in a potential word library;
and the corpus word merging unit is used for merging the potential word library and the rescreened segmented word library to obtain the corpus word library.
As shown in fig. 5, in the corpus length segmentation module of this embodiment, the first segmentation unit is electrically connected to the frequency calculation unit; the degree of solidification calculation unit is electrically connected with the degree of solidification screening unit, the left-right entropy calculation unit is electrically connected with the left-right entropy screening unit, the second word segmentation unit and the corpus word merging unit are respectively electrically connected with the backward-pushing detection unit, the degree of solidification screening unit is also electrically connected with the left-right entropy calculation unit, and the left-right entropy screening unit is also electrically connected with the second word segmentation unit; the network word capturing unit and the history new word screening unit in the network new word acquisition module are respectively and electrically connected with the local word screening unit. The frequency calculation unit of the corpus length word segmentation module and the coagulation degree calculation unit of the word operation screening module are respectively and electrically connected with the first segmentation word library updating module; the historical new word screening unit of the network new word acquisition module is electrically connected with the network new word segmentation module; the corpus word merging unit of the word operation screening module is electrically connected with the corpus word library updating module; the network new word segmentation module and the corpus word library updating module are respectively and electrically connected with the merging and counting module.
As shown in fig. 6, an embodiment of the present application provides a computer device 700 for executing the method for obtaining a new word in fig. 1, the device includes a memory 701, a processor 702, and a computer program stored on the memory 701 and executable on the processor 702, where the processor 702 implements the steps of the method for obtaining a new word when executing the computer program.
Specifically, the memory 701 and the processor 702 can be general-purpose memory and processor, which are not limited in particular, and when the processor 702 runs the computer program stored in the memory 701, the method for obtaining a new word can be performed.
Corresponding to the method for acquiring new words in fig. 1, the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the method for acquiring new words.
In particular, the storage medium can be a general-purpose storage medium, such as a removable disk, a hard disk, or the like, and when the computer program on the storage medium is executed, the method for acquiring the new word can be executed.
In the embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. The above-described system embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and there may be other divisions in actual implementation, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of systems or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments provided in the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solutions of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures, and moreover, the terms "first," "second," "third," etc. are used merely to distinguish one description from another, and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: those skilled in the art can still make modifications or changes to the embodiments described in the foregoing embodiments, or make equivalent substitutions for some features, within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the present disclosure, which should be construed in light of the above teachings. Are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (9)

1. A method for obtaining new words is characterized by comprising the following steps:
segmenting the target corpus according to a preset first word length to obtain a first segmentation word library;
screening the first segmentation word library based on a coagulability algorithm and a left-right entropy algorithm to obtain a corpus word library;
acquiring a network new word library contained in the network word library according to a currently captured network word library, a last captured network word library and a pre-stored local word library;
extracting words matched with the network new word language library from the target language material to obtain a language material new word language library;
merging the corpus new word library and the corpus word library, and deleting the substring words contained in the corpus new word library from the merged word library, wherein the substring words are not contained in the corpus word library but the character strings corresponding to the substring words are contained in the corpus word library, so as to obtain the new words contained in the target corpus;
the method comprises the following steps of screening the first segmentation word library based on a coagulability algorithm and a left-right entropy algorithm to obtain a corpus word library, wherein the method comprises the following steps:
calculating the degree of solidification of each segmented word in the first segmented word corpus based on the frequency of the segmented word in the target corpus;
extracting the segmented words with the solidification degree within the solidification degree threshold value corresponding to the word length from the segmented words with the word length according to the word length corresponding to the segmented words to obtain an initially screened segmented word library;
calculating left and right entropies of each primarily screened word in the primarily screened word library in the target corpus, and extracting the primarily screened words with the left and right entropies within a preset left and right entropy threshold value to obtain a re-screened word library;
segmenting the target corpus in sequence according to a preset second word length to obtain a second segmented word library;
extracting second segmented words matched with the rescreened segmented word corpus from the second segmented word corpus, and acquiring second segmented words which are adjacent to each other in the target corpus and have the same character string of the tail of the previous second segmented word as the character string of the head of the next second segmented word from the extracted second segmented words;
merging the obtained front and back second segmentation words to obtain a second segmentation combination word;
segmenting each second segmentation combination word according to the second word length to obtain a third segmentation word library corresponding to each second segmentation combination word;
comparing each third segmented word library with the rescreened segmented word library respectively, and if each third segmented word in a third segmented word library is contained in the rescreened word library, placing a second segmented combination word corresponding to the third segmented word library in a potential word library;
and merging the potential word and phrase library and the rescreened and segmented word and phrase library to obtain the corpus word and phrase library.
2. The method according to claim 1, wherein a reverse maximum matching segmentation algorithm is used to extract words from the target corpus that match the new network corpus.
3. The method according to claim 1, wherein said obtaining a new network word library included in the network word library according to a currently captured network word library, a last captured network word library and a pre-stored local word library comprises:
acquiring network words in a currently captured network word library, and screening out network words with the frequency within a preset network word threshold value;
deleting the network words matched with the local word library from the screened network words to obtain filtered network words;
and deleting the filtered network words matched with the last captured network word library from the filtered network words to obtain the new network word library.
4. The method of claim 1, wherein the first word length comprises: word minimum length and word maximum length; the target corpus is segmented according to a preset first word length to obtain a first segmentation word library, and the method comprises the following steps:
segmenting the target corpus, and acquiring segmented words with the word length between the minimum word length and the maximum word length from segmented words obtained by segmentation;
counting the frequency of the segmented words in the target corpus;
and constructing the first segmentation word library based on the segmentation words and the frequency of the segmentation words in the target corpus.
5. The method of claim 1, wherein after said obtaining the first corpus of segmented words, the method further comprises:
comparing a preset stop word bank with the first segmentation word bank, and removing words matched with the stop word bank from the first segmentation word bank so as to update the first segmentation word bank;
after the obtaining of the corpus word library, further comprising:
and comparing the corpus word library with a pre-stored common word library, and removing words matched with the common word library so as to update the corpus word library.
6. An apparatus for obtaining new words, comprising:
the corpus length segmentation module is used for segmenting the target corpus according to a preset first word length to obtain a first segmentation word corpus;
the word operation screening module is used for screening the first segmentation word library based on a coagulability algorithm and a left-right entropy algorithm to obtain a corpus word library;
the network new word acquisition module is used for acquiring a network new word library contained in the network word library according to the currently captured network word library, the last captured network word library and a pre-stored local word library;
the network new word segmentation module is used for extracting words matched with the network new word corpus from the target corpus to obtain a corpus new word corpus;
a merging statistical module, configured to merge the corpus new-word corpus and the corpus word corpus, and delete a substring word included in the corpus new-word corpus from the merged word corpus, where the substring word is not included in the corpus word corpus but a character string corresponding to the substring word is included in the corpus word corpus, so as to obtain a new word included in the target corpus;
the word operation screening module is used for screening the first segmentation word library based on the coagulability algorithm and the left-right entropy algorithm to obtain a corpus word library, and the word operation screening module is specifically used for:
calculating the degree of solidification of each segmented word in the first segmented word library based on the frequency of the segmented word in the target corpus;
extracting the segmented words with the solidification degree within the solidification degree threshold value corresponding to the word length from the segmented words with the word length according to the word length corresponding to the segmented words to obtain an initially screened segmented word library;
calculating left-right entropy of each primarily-screened word in the primarily-screened word corpus in the target corpus, and extracting the primarily-screened words with left-right entropy within a preset left-right entropy threshold value to obtain a rescreened word corpus;
segmenting the target corpus in sequence according to a preset second word length to obtain a second segmentation word library;
extracting second segmented words matched with the rescreened segmented word library from the second segmented word library, and acquiring second segmented words which are adjacent in the target corpus and have the same character string of the tail of the previous second segmented word and the same character string of the head of the next second segmented word from the extracted second segmented words;
merging the obtained front and back second segmentation words to obtain a second segmentation combination word;
segmenting each second segmentation combination word according to the second word length to obtain a third segmentation word library corresponding to each second segmentation combination word;
comparing each third segmented word library with the rescreened segmented word library respectively, and if each third segmented word in a third segmented word library is contained in the rescreened word library, placing a second segmented combination word corresponding to the third segmented word library in a potential word library;
and merging the potential word and phrase library and the rescreened word and phrase library to obtain the corpus word and phrase library.
7. The apparatus of claim 6, wherein the network new word obtaining module comprises:
the network word capturing unit is used for acquiring the network words in the currently captured network word library and screening out the network words with the frequency within a preset network word threshold value;
the local word screening unit is used for deleting the network words matched with the local word library from the screened network words to obtain filtered network words;
and the history new word screening unit is used for deleting the filtering network words matched with the last captured network word library from the filtering network words to obtain the network new word library.
8. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the method of retrieving new words as claimed in any one of claims 1 to 5.
9. A computer-readable storage medium, having stored thereon a computer program for performing, when being executed by a processor, the steps of the method for obtaining new words according to any one of claims 1 to 5.
CN201911162192.5A 2019-11-25 2019-11-25 Method and device for acquiring new words Active CN110929009B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911162192.5A CN110929009B (en) 2019-11-25 2019-11-25 Method and device for acquiring new words

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911162192.5A CN110929009B (en) 2019-11-25 2019-11-25 Method and device for acquiring new words

Publications (2)

Publication Number Publication Date
CN110929009A CN110929009A (en) 2020-03-27
CN110929009B true CN110929009B (en) 2023-04-07

Family

ID=69850811

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911162192.5A Active CN110929009B (en) 2019-11-25 2019-11-25 Method and device for acquiring new words

Country Status (1)

Country Link
CN (1) CN110929009B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114218938A (en) * 2021-12-13 2022-03-22 北京智齿众服技术咨询有限公司 Word segmentation method and device, electronic equipment and storage medium
CN115034211B (en) * 2022-05-19 2023-04-18 一点灵犀信息技术(广州)有限公司 Unknown word discovery method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635296A (en) * 2018-12-08 2019-04-16 广州荔支网络技术有限公司 Neologisms method for digging, device computer equipment and storage medium
CN110110322A (en) * 2019-03-29 2019-08-09 泰康保险集团股份有限公司 Network new word discovery method, apparatus, electronic equipment and storage medium
CN110222157A (en) * 2019-06-20 2019-09-10 贵州电网有限责任公司 A kind of new word discovery method based on mass text

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635296A (en) * 2018-12-08 2019-04-16 广州荔支网络技术有限公司 Neologisms method for digging, device computer equipment and storage medium
CN110110322A (en) * 2019-03-29 2019-08-09 泰康保险集团股份有限公司 Network new word discovery method, apparatus, electronic equipment and storage medium
CN110222157A (en) * 2019-06-20 2019-09-10 贵州电网有限责任公司 A kind of new word discovery method based on mass text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
武装.大数据时代的网络舆情分析.《大数据时代的网络舆情分析》.北京理工大学出版社,2018,第217-218页. *

Also Published As

Publication number Publication date
CN110929009A (en) 2020-03-27

Similar Documents

Publication Publication Date Title
CN110929009B (en) Method and device for acquiring new words
EP3189469B1 (en) A method for selecting frames from video sequences based on incremental improvement
CN109858040B (en) Named entity identification method and device and computer equipment
CN109271641B (en) Text similarity calculation method and device and electronic equipment
CN110889379B (en) Expression package generation method and device and terminal equipment
CN109241523B (en) Method, device and equipment for identifying variant cheating fields
Treeratpituk et al. Name-ethnicity classification and ethnicity-sensitive name matching
Winter et al. Fast indexing strategies for robust image hashes
CN108536676B (en) Data processing method and device, electronic equipment and storage medium
CN111191454A (en) Entity matching method and device
WO2015062377A1 (en) Device and method for detecting similar text, and application
CN113255621B (en) Face image filtering method, electronic device and computer-readable storage medium
CN111666768A (en) Chinese named entity recognition method and device and electronic equipment
CN113963197A (en) Image recognition method and device, electronic equipment and readable storage medium
CN110457707B (en) Method and device for extracting real word keywords, electronic equipment and readable storage medium
CN106919554B (en) Method and device for identifying invalid words in document
CN111339778A (en) Text processing method, device, storage medium and processor
JP7133085B2 (en) Database update method and device, electronic device, and computer storage medium
De Santo et al. An unsupervised algorithm for anchor shot detection
CN106649367B (en) Method and device for detecting keyword popularization degree
CN114611496A (en) Dictionary generation method and device, storage medium and electronic device
CN109511000B (en) Bullet screen category determination method, bullet screen category determination device, bullet screen category determination equipment and storage medium
CN115189922B (en) Risk identification method and apparatus, and electronic device
JPWO2018159361A1 (en) Attack pattern extraction device, attack pattern extraction method and attack pattern extraction program
CN111061924A (en) Phrase extraction method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant