WO2020073523A1

WO2020073523A1 - New word recognition method and apparatus, computer device, and computer readable storage medium

Info

Publication number: WO2020073523A1
Application number: PCT/CN2018/124797
Authority: WO
Inventors: 马骏; 王少军
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-10-12
Filing date: 2018-12-28
Publication date: 2020-04-16
Also published as: CN109408818B; CN109408818A

Abstract

A new word recognition method and apparatus, a computer device, and a computer readable storage medium. The method comprises: obtaining a text corpus, segmenting the text corpus into candidate words having a length of 2-N by means of N-element segmentation according to a preset sentence endpoint, N being a natural number, N >= 2 (S210); determining whether the candidate words satisfy a preset condition (S220); determining the candidate word as a candidate new word if the candidate word satisfies the preset condition (S230); determining whether the candidate new word is comprised in a preset vocabulary (S240); and determining the candidate new word as a new word if the candidate new word is not comprised in the preset vocabulary (S250).

Description

New word recognition method, device, computer equipment and computer readable storage medium

This application requires the priority of the Chinese patent application filed on October 12, 2018 with the Chinese Patent Office, the application number is 201811191755.9, and the application name is "new word recognition method, device, computer equipment and storage medium", the entire content of which is cited by reference Incorporated in this application.

Technical field

This application relates to the technical field of natural language processing, and in particular, to a new word recognition method, device, computer equipment, and computer-readable storage medium.

Background technique

Chinese word segmentation is the basic technology of the current NLP (NLP, English for Natural Language Processing) project, and its accuracy is directly related to the final performance of the NLP project. New word discovery has a direct impact on the accuracy of the word segmentation system. In traditional new word discovery technology, the text is usually segmented first, and then the remaining fragments that fail to match successfully are guessed as new words, but the accuracy of word segmentation depends on the integrity of the lexicon, so the effect of new word discovery is poor .

Summary of the invention

The embodiments of the present application provide a new word recognition method, device, computer equipment, and computer readable storage medium, which can solve the problem of the low effect of the discovery of new words in traditional technologies.

In a first aspect, an embodiment of the present application provides a new word recognition method, including: obtaining a text corpus, and dividing the text corpus into candidate words with a length of 2-N through N-ary segmentation according to a preset sentence endpoint, Where N is a natural number and N≥2, the candidate word refers to a text segment obtained by dividing the text corpus; determine whether the candidate word meets a preset condition; if the candidate word meets the preset condition , Determine the candidate word as a candidate new word; determine whether the candidate new word is included in the preset thesaurus; and if the candidate new word is not included in the preset thesaurus, The candidate new words are determined as new words.

In a second aspect, an embodiment of the present application further provides a new word recognition device, including: a segmentation unit, configured to obtain a text corpus, and segmenting the text corpus into lengths by N-ary segmentation according to a preset sentence endpoint 2-N candidate words, where N is a natural number, and N ≥ 2, the candidate words refer to text segments obtained by segmenting the text corpus; a judgment unit is used to judge whether the candidate words meet a preset condition The first recognition unit is used to determine the candidate word as a candidate new word if the candidate word meets the preset condition; the filtering unit is used to determine whether the candidate new word is included in the preset word In the library; and a second recognition unit for determining that the candidate new word is a new word if the candidate new word is not included in the preset word library.

In a third aspect, an embodiment of the present application further provides a computer device, which includes a memory and a processor, a computer program is stored on the memory, and the new word recognition method is implemented when the processor executes the computer program.

According to a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, causes the processor to execute the new word recognition methods.

BRIEF DESCRIPTION

In order to more clearly explain the technical solutions of the embodiments of the present application, the following will briefly introduce the drawings required in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative work.

1 is a schematic diagram of an application scenario of a new word recognition method provided by an embodiment of this application;

2 is a schematic flowchart of a new word recognition method provided by an embodiment of the present application;

3 is a schematic flowchart of a new word recognition method provided by another embodiment of this application;

4 is a schematic block diagram of a new word recognition device provided by an embodiment of this application;

5 is a schematic block diagram of a new word recognition device provided by another embodiment of this application; and

6 is a schematic block diagram of a computer device provided by an embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application will be described clearly and completely in the following with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative work fall within the scope of protection of this application.

Please refer to FIG. 1, which is a schematic diagram of an application scenario of a new word recognition method provided by an embodiment of the present application. The application scenarios include:

(1) Computer equipment. The computer device shown in FIG. 1 is a device for recognizing new words, on which an application for recognizing new words is installed, and the computer device is manually operated. The computer device may be an electronic device such as a notebook computer, a tablet computer, a desktop computer, or a server.

The working process of each subject in Figure 1 is as follows: artificially use a computer device for new word recognition, a new word recognition application is installed on the computer device, the computer device obtains a text corpus, and divides the sentence by N-ary segmentation according to the preset sentence endpoint The text corpus is divided into candidate words with a length of 2-N, to determine whether the candidate words meet the preset conditions, and if the candidate words meet the preset conditions, the candidate words are determined as candidate new words, and the Whether the candidate new word is included in the preset thesaurus, if the candidate new word is not included in the preset thesaurus, the candidate new word is determined as a new word, and the computer device displays the recognition result to the human To complete the recognition of new words in the text corpus.

It should be noted that FIG. 1 only illustrates a desktop computer as a computer device. In actual operation, the type of computer device is not limited to that shown in FIG. 1. The computer device may also be an electronic device such as a notebook computer or tablet computer. The application scenario of the above new word recognition method is only used to explain the technical solution of the present application, and is not used to limit the technical solution of the present application.

FIG. 2 is a schematic flowchart of a new word recognition method provided by an embodiment of the present application. The new word recognition method is applied to the terminal in FIG. 1 to complete all or part of the functions of the new word recognition method.

2 is a schematic flowchart of a new word recognition method provided by an embodiment of the present application. As shown in FIG. 2, the method includes the following steps S210-S250:

S210: Obtain a text corpus, and divide the text corpus into candidate words with a length of 2-N according to a preset sentence endpoint, and N is a natural number, and N≥2, where the candidate words Refers to a text segment obtained by segmenting the text corpus.

Among them, the new word refers to a given piece of text, randomly take a segment, if the segment has an independent meaning, and is not included in the existing thesaurus or dictionary, is not a known word, the segment is judged as a new word. To determine whether a segment in the text is a new word, you can take a segment randomly by giving a segment of text. If the left and right collocation of the segment is very rich, that is, the left and right ends of the segment can be matched with different words or words to complete The full meaning of the expression, and the internal composition of the segment is very fixed, that is, the segment often appears as a fixed whole, you can judge that the segment is a vocabulary, if the word does not exist in the existing dictionary, the segment is a new word. For example, in a text, if the left or right end of "Pu Hui Finance" is matched with different words or words to describe semantically, and "Pu Hui Finance" is not included in the existing dictionary, judge "Pu Hui Finance" "Finance" is a new word.

N-ary segmentation refers to the sequential segmentation or division of adjacent N Chinese characters in the text corpus to obtain a text segment containing N Chinese characters, for example, 2-element segmentation refers to sequentially performing two adjacent Chinese characters on the text corpus Segmentation or division, to obtain a text segment containing 2 Chinese characters, 3 yuan segmentation, refers to the text corpus is divided into three adjacent Chinese characters or division, to obtain a text segment containing 3 Chinese characters and so on. For example, the text corpus "I am a person" is divided into two parts, and the obtained text fragments are "I am", "Is one", "one" and "person", and the text corpus "I am a person" Three yuan segmentation, the obtained text fragments are "I am one", "Is one" and "one person".

Text corpus refers to the language material to which the text for new word recognition belongs. The text corpus may be a piece of text, an article, a web page of a website, or a book. The text corpus may be an electronic book or text stored in a mobile memory, a computer device, or the Internet, for example, text saved in Word format, or a web page of a designated website.

The candidate word refers to a text segment obtained by segmenting the text corpus. After segmenting the text corpus according to the preset sentence endpoints, multiple text fragments may be obtained. The text fragments may be words or may not be words, and need to be filtered according to preset conditions to determine whether they are words. The text segment meets the preset condition and is determined to be a word, and if the text segment does not satisfy the preset condition, it is determined to be not a word. Since the text segment is in a candidate state for words, it is called a candidate word. 2-N candidate words refer to candidate words with a length of 2 to N, that is, the number of Chinese characters included in the candidate words is 2, 3, 4 ... N, for example, 2-5 candidate words refer to candidate words The lengths of are 2, 3, 4, and 5, respectively, that is, the candidate words include 2 Chinese characters, 3 Chinese characters, 4 Chinese characters, and 5 Chinese characters, respectively.

Further, the preset sentence endpoint refers to setting the word boundaries of the candidate words in advance, and using these word boundaries as endpoints to segment the text corpus to obtain candidate words. The preset sentence endpoint includes a punctuation mark and a preset segmentation endpoint. The preset segmentation endpoint refers to a component of the text corpus that is previously set as a sentence endpoint except for punctuation. The fixed components with independent meaning in the corpus are used as the word boundaries that segment the text corpus, and belong to the endpoints of artificial sentences. Among them, the text corpus generally includes text, punctuation marks, carriage returns, spaces and other components.

Punctuation refers to the punctuation of sentences in the text corpus after complete description of the meaning, which serves as a pause to form a sentence break, such as comma, semicolon, double quotation mark, and period.

The preset segmentation endpoint includes symbols in the text corpus that have a pause or sentence-breaking function in addition to punctuation, such as space characters and carriage returns, and stop words and stop words with independent semantics. Stop word refers to the word used for pause in the text corpus. Stop word refers to the word used for pause in the text. The stop word and the stop word generally have independent meanings. For example, commonly used stop words Including words such as "you", "me", "her" and "de", stop words include words such as "we", "based", "said" and "act". The preset segmentation endpoints can be regarded as an extension of punctuation marks. Punctuation symbols are generally used to form sentence breaks between sentences. The preset segmentation endpoints can be regarded as the formation of pauses or sentence breaks between sentence components within sentences. Word boundaries can be identified as endpoints of sentences like punctuation marks. The common words and common words are used as independent features in the segmented text, and the stop words and stop words are used as the segmented text corpus to obtain the left and right boundaries of the candidate words, such as "在" , "I", "的", etc., stop words include "we" "you" "these" and other fixed words that are already inseparable and have independent semantics, use them as the left or right boundary of the word boundary of the candidate words Boundaries, through preset segmentation endpoints, can effectively improve the accuracy of new word discovery and improve the efficiency of new word recognition.

When segmenting a text corpus to obtain candidate words, add a preset segmentation endpoint belonging to an artificial sentence endpoint as an independent feature, and combine punctuation marks as a preset sentence endpoint for segmenting the text corpus, without relying on any existing thesaurus , Based on the common characteristics of the words, extract all possible text fragments in a large-scale corpus as candidate words, and then identify the candidate words that meet the preset conditions to identify candidate new words with independent semantics Words, and then compare all the extracted candidate new words with the existing thesaurus, and filter out the candidate new words that are not contained in the thesaurus as the identified new words, which can effectively improve the accuracy of the new word discovery and the recall rate. .

Specifically, when the new word recognition is performed by the computer device, the text corpus to be recognized for the new word is obtained, and the text corpus may be a piece of text, an article, a web page of a website, or a book. The text corpus is segmented according to preset sentence endpoints including punctuation marks, spaces, carriage returns, stop words and stop words, and the text corpus is divided into candidate words with a length of 2-N by N-ary segmentation, Among them, N is a natural number, and N ≥ 2, to obtain the candidate words after segmentation. For example, if N is 5, the obtained text corpus is divided into candidate words of length 2, 3, 4, and 5, respectively, that is, the candidate words are two characters, three characters, four characters, and five characters, etc. Perform segmentation separately to obtain the candidate words after segmentation. For example, the text corpus "I am very keen to use statistical methods to analyze Chinese data.", Perform N-ary segmentation (assuming N = 3), and presuppose that the endpoints of the sentence are punctuation marks and "I", "I" and "Very ", The candidate words obtained after segmentation include:" Passion "," Passion "," Passion "," Passion "," Using "," Using system "," Using system "," Using statistics "," Statistics "," Statistical "," Counted "," Fang "," Method "," Method "," Method to "," Method to "," Method to divide "," To divide "," Go "Analysis", "Analysis", "Analysis of Chinese", "Analysis of Chinese", "Analysis of Chinese", "Chinese", "Chinese Language", "Language", "Language Information", "Information", "Information", "Expected".

Further, N needs to be set according to the specific text, for example, "the gentleman's cross is as light as water" is a seven-word word, "Baizhigantou further" is an eight-word word, and some company names can have more Word, you need to set a specific number N according to different text corpus. Further, for the same text corpus, you can set different numbers N, compare different recognition results, remove the same recognition results, so as to filter out long-grained new words, according to the recognition results of long-grained new words, the results are more ideal New word recognition.

S220. Determine whether the candidate word meets a preset condition.

Wherein, the preset condition refers to a condition for identifying the candidate word as the candidate new word. If the candidate word meets the preset condition, the candidate word is identified as a word, and the candidate word is determined as a candidate New words, if the candidate words do not satisfy the preset condition, identify that the candidate words are not words, and judge that the candidate words are not candidate new words. In the preset condition, the candidate words satisfy the first preset thresholds of the word frequency, mutual information, and left and right information entropy, respectively, or the candidate words satisfy the second presets of the word frequency, mutual information, and sentence endpoints, respectively. Set the threshold.

Specifically, the computer device divides the obtained text corpus to obtain text fragments as candidate words. Some candidate words are words, and some are not words. Therefore, it is necessary to filter the candidate words using preset conditions. The text fragments of the words are filtered out, and the text fragments of the words are retained for further recognition. Therefore, it is determined whether the candidate word satisfies the preset condition to identify whether the candidate word becomes a word, a candidate word of the word, and a new word candidate. Step S230 is entered. If the candidate word does not satisfy the preset condition, It indicates that the candidate word cannot be a word. Step S221 is entered, and the candidate word filtering points are discarded to further narrow the scope of new word recognition and improve the efficiency and accuracy of new word recognition.

S230. If the candidate word meets the preset condition, determine the candidate word as a candidate new word.

Wherein, the candidate new word refers to a candidate word recognized as a word. Through N-ary segmentation, the text corpus is divided into 2-N candidate words, some of which cannot be words. For example, the text corpus "I am very keen to use statistical methods to analyze Chinese data.", Perform N-ary segmentation (assuming N = 3), and presuppose that the endpoints of the sentence are punctuation marks and "I", "I" and "Very Very" ", After segmentation, candidate words such as" Yu Yong "," Yu Tong Tong "," Yong Tong "," Guo De "," Dai Fang ", and" Method "are obtained from the candidate words, which is obviously impossible by human judgment based on experience Become a word, therefore, the obtained candidate words need to be filtered according to a preset condition, and the candidate word is identified as a candidate new word, and the candidate new word refers to the candidate word recognized as a word to filter out the words that cannot become words Text fragments to narrow the scope of new word recognition.

Specifically, the computer device filters the obtained candidate words. For the candidate words that cannot be words, it is necessary to filter through the preset conditions, remove the candidate words that cannot be words, and only keep the candidate words that can be words, to further proceed New word recognition. If the candidate word satisfies the preset condition, the candidate word is identified as a word, and the candidate word is determined to be a new candidate word; if the candidate word does not satisfy the preset condition, the candidate word is not identified as a word, It is determined that the candidate word is not a candidate new word, thereby further narrowing the scope of new word recognition, and improving the accuracy, efficiency, and recall rate of new word recognition.

For example, see Table 1. If the preset threshold of the minimum left and right information entropy of the candidate word is set to 1, the preset threshold of the lowest mutual information is 1, and the preset threshold of the sentence endpoint is 3. Wherein, the preset threshold value of the lowest left and right information entropy is 1 means that the smaller value of the left and right neighbor information entropy of the candidate word is 1. The preset threshold of the sentence endpoint refers to the number of occurrences of the sentence endpoint of the left or right boundary of the candidate word. Please refer to Table 1. After recognizing a corpus, the results are shown in Table 1. According to the above judgment criteria, "Nanshan District", "Nanshan", and "Puhui" are recognized as candidate new words, "Go to Nanshan" "No word, excluded from candidate new words.

Table 1

候选词Candidate	词频Word frequency	互信息Mutual information	左右信息熵Left and right information entropy	端点数Number of endpoints	是否成词Whether it is a word
南山区Nanshan District	175175	5.75485.7548	2.28812.2881	88	是Yes

去南山Go to Nanshan	23twenty three	0.82560.8256	3.37513.3751	33	否no
南山Nanshan	27742774	9.63109.6310	5.72005.7200	2828	是Yes
普惠Pratt & Whitney	1818	2.38112.3811	0.83320.8332	33	是Yes

S240: Determine whether the candidate new word is included in the preset vocabulary.

The preset thesaurus can also be called an existing thesaurus, which refers to a set of known words that have been determined as words, and can be a preset dictionary.

Specifically, the computer device divides the text corpus into candidate words with a length of 2-N through N-ary segmentation, and the candidate words are text segments obtained by segmenting the text corpus. If the candidate word satisfies the preset condition, the candidate word is determined as a candidate new word, and the candidate new word is identified, but only text fragments that can become words are selected from the candidate word. The candidate new words include words that have been confirmed as words in the natural language processing technology and newly recognized words. Therefore, the words that have been confirmed as words need to be filtered out and screened out as recognized new words. The preset thesaurus contains words that have been confirmed as words in natural language processing technology. The preset thesaurus is a variety of existing dictionaries already in the traditional technology, or may be a manually set thesaurus, such as setting The collection of several dictionaries existing in the conventional technology is a thesaurus, and the preset thesaurus may also contain new words that have been recognized in the past. Filtering the candidate new words using a preset thesaurus to determine whether the candidate new words are included in the preset thesaurus, that is, detecting whether the candidate new words are included in the preset thesaurus, can In a matching manner, if the candidate new word can be matched in the preset thesaurus, it is determined that the candidate new word is included in the preset thesaurus, and step S241 is entered, where the candidate new word is Known words, filter out the existing words; if the candidate new words cannot be matched in the preset vocabulary, it is judged that the candidate new words are not included in the preset vocabulary, the candidate new The word is an unknown word, and for the recognized new word, step S250 is entered. For example, please refer to Table 1. After recognizing a corpus, the results obtained are shown in Table 1. If "Nanshan District", "Nanshan" and "Puhui" are identified as candidate new words, the default dictionary is used to filter The candidate new words "Nanshan District", "Nanshan" and "Pu Hui" are obtained. If the candidate new words "Nanshan District" and "Nanshan" are in the preset vocabulary, "Nanshan District" and "Nanshan" are already Known old words, if the candidate new word "Puhui" is not included in the preset vocabulary, it is determined that "Puhui" is a recognized new word.

S250. If the candidate new word is not included in the preset thesaurus, determine the candidate new word as a new word.

Specifically, if the candidate new word is not included in the preset thesaurus, the candidate new word is considered not to be a known word, the candidate new word is an unknown word, and is a recognized new word, The recognized new words are generally words that have not been seen in the past, so as to complete the recognition of the new words of the given text corpus. For example, please refer to Table 1. After recognizing a corpus, the result is shown in Table 1. If the identified candidate new word "Puhui" is not included in the preset lexicon, judge "Puhui" For the new words identified.

Embodiments of the present application are based on natural language processing in speech semantics. When text corpus is segmented to obtain candidate words, the text corpus is accurately segmented through N-ary segmentation combined with preset sentence endpoints to obtain a candidate with a length of 2-N Words, do not depend on any existing thesaurus, just extract all possible text fragments in a large-scale corpus based on the common characteristics of the words, as candidate words, through the preset sentence endpoints as independent features, as cut Divide the word boundaries of the text corpus, reduce the number of candidate words, improve the accuracy and efficiency of segmentation, and then identify whether the candidate words meet the preset conditions. If the candidate words meet the preset conditions, they are recognized as candidates New words, as candidate new words with independent semantics, narrow the scope of new word recognition, and then compare all extracted candidate new words with existing thesaurus, if the candidate new words are not included in the preset thesaurus, recognize For new words, screening candidate new words that are not included in the thesaurus as recognized new words can effectively improve the accuracy, efficiency and recall rate of new word discovery.

In one embodiment, the step of determining whether the candidate word meets a preset condition includes:

Acquiring the mutual information and left and right information entropy of the candidate words, and acquiring the word frequency of the candidate words, wherein the left and right information entropy refers to the comparison between the left entropy information entropy and the right entropy information entropy of the candidate words Small value

Judging whether the word frequency, mutual information and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset threshold of left and right information entropy;

If the word frequency, mutual information and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset threshold of left and right information entropy, it is determined that the candidate word meets Preset conditions.

Among them, mutual information refers to the internal cohesion of the candidate words, and may also be referred to as the degree of internal coagulation or the degree of cohesion of the candidate words. The formula for mutual information is:

Where w represents the candidate word, p (x) is the probability that the candidate word x appears in the entire corpus, l represents the left character string that constitutes the candidate word, and r represents the right character string that constitutes the candidate word.

For example, in a corpus containing the candidate word "Nanshan District", if "Nanshan District" is composed of "Nanshan" and "District", "Nanshan" is the left string of the candidate word "Nanshan District", and "District" is the candidate The right string of the word "Nanshan District". If "Nanshan" and "District" appear independently and randomly in the text corpus, what is the probability that they will just fit together? If in the entire 24 million-word data of the text corpus, "Nanshan" appears a total of 2774 times, the probability of occurrence is about 0.000113, and "District" appears 4797 times, the probability of occurrence is about 0.0001969. If there is no relationship between the two, the probability that they happen to fit together should be 0.000113 × 0.0001969, which is about 2.223 × 10 ^-8 power. However, "Nanshan District" appeared 175 times in the corpus, with an occurrence probability of about 7.183 × 10 ^-6 , which is more than 300 times the predicted value. Similarly, in the text corpus, statistics show that the occurrence probability of the word "Go" is about 0.0166, so the theoretical probability value that "Go" and "Nanshan" are randomly combined is 0.0166 × 0.000113, which is about 1.875 × 10. ^-6 , which is very close to the true probability of "Going to Nanshan". The true probability is about 1.6 × 10 ^-5 power, which is 8.5 times the predicted value. The results show that "Nanshan District" is more likely to be a meaningful collocation, and "Go to Nanshan" is more like the two components of "Go" and "Nanshan" accidentally joined together. However, in the recognition of new words, it is impossible to judge that "Nanshan District" is derived from "Nanshan" plus "District", nor can it be judged that "Go to Nanshan" is derived from "Go" plus "Nanshan". The wrong segmentation method will overestimate the degree of condensation of the fragment. If "Nanshan District" is regarded as the result of "Nan" plus "Mountain District", the resulting degree of cohesion will be higher. Therefore, in order to calculate the degree of cohesion of a candidate word, it is necessary to enumerate its condensed way, which two parts of this candidate word are combined. Let p (x) be the probability of the candidate word x appearing in the entire corpus, then the degree of cohesion that defines "Nanshan District" is the ratio of p (Nanshan District) to p (South) * p (Mountain District) and p (Nanshan District) The smaller of the ratios with p (南山) * p (区), the condensing degree of "Go to Nanshan" is p (Go to Nanshan) divided by p (Go) * p (南山) and p (Go to South) ) * p (mountain) The smaller value of the mutual information. Can be obtained, the most condensed candidate words are words such as "bat", "spider", "wandering", "uneasy", "rose" and so on, almost every word in these words will always be with another The words appear at the same time and are never used in other occasions.

Information entropy refers to the degree of freedom of the candidate word, that is, the richness of the left or right neighbor of the candidate word, and the information entropy of the candidate word becomes the number of left or right neighbor of the candidate word. Proportionally, if the candidate word can be matched with more left or right neighbor words, the corresponding information entropy of the candidate word is greater, if the candidate word can be matched with fewer left or right neighbor words, The smaller the corresponding information entropy of the candidate words. The information entropy of a candidate word can also be called left and right information entropy. The left and right information entropy of a candidate word, that is, the degree of freedom of the candidate word is defined as the smaller value of the information entropy of its left neighbor word and right neighbor word.

Further, the left neighbor information entropy, also known as the left information entropy, refers to the richness of the left neighbor of the candidate word, that is, the number of words that can be matched on the left side of the candidate word, and the formula of the left information entropy for:

HL (W) =-∑ _a p (a | W) logp (a | W), formula (2)

Where W represents the candidate word, a represents the word on the left of the candidate word, and p (a | W) represents the probability of the word a appearing on the left of the candidate word, where p (a | W) is the conditional probability. Conditional probability refers to the occurrence probability of event A under the condition that another event B has occurred. The conditional probability is expressed as: p (A | B), read as "probability of A under the condition of B".

Right neighbor information entropy, also known as right left information entropy, refers to the richness of the right neighbor of the candidate word, that is, the number of words that can be matched on the right side of the candidate word. The formula for the right information entropy is:

HR (W) =-∑bp (b | W) logp (b | W), formula (3)

Where W represents the candidate word, b represents the word on the right of the candidate word, and p (b | W) represents the probability of the word b appearing on the right of the candidate word.

Whether a word can appear in various environments, has a very rich set of left and right neighbors, the degree of freedom is expressed by information entropy, which can be reflected by information entropy, and the average will be given after knowing the result of an event How much information do you bring. If the probability of a certain result is p, when you know that it did happen, the amount of information you get is defined as -log (p). Information entropy is used to measure how random the set of left and right neighbors of a candidate word is. For example, "eating grapes without spitting grape skins and not eating grapes but spitting grape skins", the word "grape" appears four times. The left neighboring words are {eating, vomiting, eating, vomiting}, and the right neighboring words are {no, skin, pour, skin}. According to formulas (2) and (3), the information entropy of the left neighbor of the word "grape" is-(1/2) * log (1/2)-(1/2) * log (1/2) ≈ 0.693, the information entropy of its right neighbor is-(1/2) * log (1/2)-(1/4) * log (1/4)-(1/4) * log (1/4 ) ≈1.04. It can be seen that in this sentence, the right neighbor of the word "Grape" is more abundant.

Word frequency, English is Term Frequency, abbreviated as TF, refers to the number of times a given word appears in the text corpus in a given text corpus, the importance of the word follows it in the file The frequency increases proportionally. After obtaining a text corpus that recognizes new words, the number of occurrences of each candidate word is counted. For example, in a piece of data of 24 million words, "Nanshan" appeared a total of 2,774 times, the word frequency of "Nanshan" was 2,774, the word "region" appeared 4,797 times, and the word frequency of "region" was 4,797.

Specifically, the parameters that can reflect the word boundary information of the candidate word include the sentence endpoint of the candidate word and left and right information entropy. Since the sentence endpoints and the left and right information entropies reflect the word boundary information of the candidate words, the sentence endpoints and the left and right information entropies play the same role in the recognition of the candidate words. In the process of new word recognition, both meet One condition is sufficient. In this embodiment, if the word frequency, mutual information, and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information, and the first preset threshold of left and right information entropy, It is determined that the candidate word meets a preset condition, and the candidate word is determined as a candidate new word as an example, that is, the word frequency of the candidate word satisfies the first preset threshold of the word frequency, and the mutual information satisfies the first preset of the mutual information Threshold, the left and right information entropy meets the first preset threshold of the left and right information entropy. For example, if the first preset threshold value of the word frequency of the candidate word is 10, the first preset threshold value of the mutual information of the candidate word is 1, and the first preset threshold value of the left and right information entropy of the candidate word is 1. The divided candidate words satisfy the first preset thresholds of the word frequency, mutual information, and left and right information entropy, respectively, which means that the word frequency of the candidate word is greater than or equal to 10, the mutual information is greater than or equal to 1, and the left and right information entropy is greater than or equal to 1. For example, if the word frequency of the segmented candidate words is greater than 10, the mutual information is greater than 1, and the left information entropy is greater than 1, then the candidate word is judged as a candidate new word, or if the mutual information of the segmented candidate words is greater than 1 and the right information entropy If it is greater than 1, the candidate word can also be determined as a candidate new word. Please continue to refer to Table 1. In Table 1, since the mutual information and left and right information entropy of "Nanshan District" and "Nanshan" are both greater than 1, "Nanshan District" and "Nanshan" are identified as candidate new words. However, since the mutual information of "Go to Nanshan" is less than 1, and the left and right information entropy of "Puhui" is less than 1, when identifying new words through mutual information and left and right information entropy, "Go to Nanshan" and "Puhui" are not candidate new words.

In summary, in the new word recognition performed by the corpus referred to in Table 1, “Nanshan District” and “Nanshan” are candidate new words, and “Go to Nanshan” and “Puhui” are not candidate new words. In the embodiment of the present application, by using the preset segmentation endpoints including stop words, stop words, spaces, carriage returns, and punctuation marks as the left and right word boundaries of the candidate words, through the statistics of the word boundaries, the left and right information entropy of the candidate words is counted Because the granularity of new word recognition is refined, the statistics of the information entropy around the candidate words can effectively find low-frequency and long-grained new words, which can effectively improve the efficiency and accuracy of new word recognition.

Further, if the first preset threshold of the word frequency of the candidate word, the first preset threshold of mutual information, and the first preset threshold of left and right information entropy are set larger, the more accurate candidate new words are identified, the candidate The smaller the first preset threshold of the word frequency of the word, the first preset threshold of mutual information, and the first preset threshold of left and right information entropy, the more candidate new words are identified.

Acquiring mutual information of the candidate words, and acquiring the word frequency of the candidate words and the number of sentence endpoints of the candidate words, the number of sentence endpoints refers to the number of left endpoints of the candidate words or the number of right endpoints of the candidate words, The number of left endpoints refers to the number of occurrences of the left endpoint of the candidate word, and the number of right endpoints refers to the number of occurrences of the right endpoint of the candidate word;

Judging whether the word frequency, mutual information and the number of sentence endpoints of the candidate words are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset threshold of sentence endpoints;

If the word frequency, mutual information and the number of sentence endpoints of the candidate word are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset threshold of number of sentence endpoints, it is determined that the candidate word meets Preset conditions.

Wherein, the endpoints of the candidate words refer to the left and right neighbor word boundaries of the candidate words, wherein the word boundaries refer to the boundary edges of the words, that is, the boundary lines of the words, and the text corpus is processed through the boundary lines Divided into different candidate words. The left end point of the candidate word refers to the left neighbor word boundary of the candidate word, and the right end point of the candidate word refers to the right neighbor word boundary of the candidate word. The number of left and right end points refers to the number of occurrences of the left and right word boundaries of the candidate word respectively. The word boundary includes punctuation marks and spaces included in the preset segmentation end point, carriage return, stop words, stop words, etc. In order to facilitate counting the number of left and right endpoints of the candidate word, the preset sentence endpoints as word boundaries in the text corpus may be replaced with a unified identifier. If the word boundary is replaced with a unified identifier, the number of left and right endpoints is the number of identifiers appearing on the left of the candidate word and the number of identifiers appearing on the right of the candidate word. For example, if the text of the unified identifier is "*", the corpus is: "The movie theater is a place for the audience to show movies. With the progress and development of movies, there have been movie theaters built specifically for the screening of movies. The shape, size, proportion and acoustic technology of the movie theater have changed a lot. The movie theater must meet the technical requirements for film screening. "After replacing the sentence endpoints with a unified identifier:" Movie theater * for the audience to show the movie * venue * with the movie The progress and development of * appeared specially built for the screening of movies * Movie theater * Movie development * Movie theater * The shape, size, proportion and acoustic technology have changed a lot * Movie theater * Meeting film screening * Technical requirements * ", from which, It can be seen that the left end point of the candidate word "cinema" appears 3 times, and the right end point appears 4 times. The stop words, the default split end points of stop words, spaces, carriage returns, and punctuation marks are used as the left and right word boundaries of the candidate words. Through the word boundary statistics, the number of occurrences of the left and right end points of the candidate words is counted. The granularity of new word recognition, through the statistics of the endpoints of candidate words, can effectively find new words with low frequency and long granularity, and can effectively improve the efficiency and accuracy of new word recognition.

Specifically, since the sentence endpoints of the candidate words and the left and right information entropy reflect the word boundary information of the candidate word, the sentence endpoints and the left and right information entropy play the same role in the recognition of the candidate word. In both cases, it suffices to satisfy one of the conditions. In this embodiment, the word frequency, mutual information and the number of sentence endpoints of the candidate words are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset threshold of number of sentence endpoints, It is determined that the candidate word meets a preset condition, and the candidate word is determined as a candidate new word as an example. For example, if the second preset threshold value of the word frequency of the candidate word is 10, the second preset threshold value of the lowest left and right information entropy of the candidate word is set to 1, and the first preset threshold value of the sentence endpoint is 3. Wherein, the second preset threshold value of the lowest left and right information entropy is 1 means that the smaller value of the left word neighbor information entropy and the right word neighbor information entropy of the candidate word is 1. The first preset threshold of the sentence endpoint refers to the number of occurrences of the sentence endpoint of the left or right boundary of the candidate word. Please refer to Table 1. After recognizing a corpus, the result is shown in Table 1. If the segmented candidate words meet the second preset threshold of mutual information and the first preset threshold of sentence endpoints respectively, they refer to the candidate words The mutual information of is greater than 1 and the number of occurrences of the endpoint of the sentence is greater than 3. If the mutual information of the segmented candidate words is greater than 1 and the number of occurrences of the endpoint of the sentence is greater than 3, the candidate word is judged as a candidate new word. Please continue to refer to Table 1. In Table 1, since the mutual information of "Nanshan District", "Nanshan" and "Puhui" and the number of sentence endpoints are all greater than 3, judge "Nanshan District", "Nanshan" and "Puhui" As a candidate for new words. However, because the mutual information of "Go to Nanshan" is less than 1, and the number of sentence endpoints is equal to 3, which does not satisfy the condition greater than 3, when identifying new words through mutual information and the number of sentence endpoints, "Go to Nanshan" is not a candidate new word. According to the above judgment criteria, "Nanshan District", "Nanshan", and "Puhui" are recognized as candidate new words, and "Go to Nanshan" is not a word, and is excluded from the candidate new words. In summary, in the recognition of new words in the corpus referred to in Table 1, "Nanshan District", "Nanshan" and "Puhui" are candidate new words, and "Go to Nanshan" is not a candidate new word.

Please refer to FIG. 3, which is a schematic flowchart of a new word recognition method according to another embodiment of the present application. As shown in FIG. 3, in this embodiment, the step of determining whether the candidate word meets a preset condition includes:

S211. Acquire the mutual information and left and right information entropy of the candidate word, and acquire the word frequency of the candidate word and the number of sentence endpoints of the candidate word, where the left and right information entropy refers to the left neighbor of the candidate word The smaller of the information entropy and the right entropy information entropy;

S212. Determine whether the word frequency, mutual information, and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information, and the first preset threshold of left and right information entropy, or the candidate words Whether the word frequency, mutual information, and sentence endpoints are greater than or equal to the word frequency second preset threshold, the mutual information second preset threshold, and the sentence endpoint number first preset threshold, respectively;

S213. If the word frequency, mutual information, and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information, and the first preset threshold of left and right information entropy, or the candidate The word frequency, mutual information and the number of sentence endpoints of the word are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset threshold of the number of sentence endpoints, respectively, to determine that the candidate word meets the preset condition.

Wherein, the first preset threshold of word frequency and the second preset threshold of word frequency may correspond to the same, and the first preset threshold of mutual information and the second preset threshold of mutual information may correspond to the same.

Specifically, the computer device determines according to the word frequency, mutual information, and left and right information entropy of the candidate words that are greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information, and the first preset threshold of left and right information entropy, respectively. The candidate word satisfies the preset condition, and the first candidate new word identified, combined with the word frequency, mutual information, and the number of sentence endpoints of the candidate word is greater than or equal to the word word second preset threshold, and the mutual information second pre Set a threshold and a first preset threshold for the number of sentence endpoints, determine that the candidate word meets a preset condition, identify the candidate word as a second candidate new word, and take the first candidate new word and the second candidate new word The union of, as the final candidate new word recognition, can improve the accuracy of the candidate new word, go to step S230 to further identify the candidate new word, otherwise, if the candidate word does not meet the above conditions, go to step S221, said The candidate word cannot be a word, and the candidate word is discarded. Please continue to refer to Table 1, according to if the word frequency, mutual information and left and right information entropy of the candidate words are greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset threshold of left and right information entropy, respectively , Determine that the candidate word meets the preset conditions, determine the candidate word as a candidate new word, identify "Nanshan District" and "Nanshan" as candidate new words, "Go Nanshan" and "Puhui" are not candidate new words , And if the word frequency, mutual information, and the number of sentence endpoints of the candidate words are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information, and the first preset threshold of sentence endpoints, respectively, The candidate words meet the preset conditions, and the candidate words are determined as candidate new words. "Nanshan District", "Nanshan", and "Pu Hui" are identified as candidate new words. "Go to Nanshan" is not a word, and is excluded from the candidate new words. Beyond the words, combine the two to identify "Nanshan District", "Nanshan" and "Puhui" as new candidate words, thus avoiding identifying "Puhui" as a candidate new word and identifying "Puhui" as Candidate new words, High accuracy of recognition of new words.

In the embodiment of the present application, by using the preset segmentation endpoints including stop words, stop words, spaces, carriage returns, and punctuation marks as the left and right word boundaries of the candidate words, through the word boundary statistics, the left and right end points of the candidate words are counted The number of occurrences, due to the refinement of the granularity of new word recognition, through the statistics of the endpoints of candidate words, low frequency and long granular new words can be effectively found, which can effectively improve the efficiency and accuracy of new word recognition.

Further, if the preset threshold of the word frequency, the preset threshold of mutual information, the preset threshold of left and right information entropy, and the preset threshold of sentence endpoint information are set larger, the more accurate the recognition of candidate new words is, if The smaller the preset threshold of predicate frequency, the preset threshold of mutual information, the preset threshold of left and right information entropy, and the preset threshold of sentence endpoint information, the more candidate new words are identified.

In an embodiment, the text corpus is divided into candidate words with a length of 2-N by N-ary segmentation according to a preset sentence endpoint, where N is a natural number, and the step of N≥2 further includes : Replace the preset sentence endpoint in the text corpus with a unified identifier.

Specifically, replacing the preset sentence endpoint in the text corpus with a unified identifier refers to a punctuation mark included in the preset sentence endpoint and a pre-inclusion including stop words, stop words, carriage returns, and spaces Let the split endpoint be replaced with a unified identifier. For example, setting the unified identifier to "*", "#", or "△" and other identification symbols, replacing the punctuation marks and preset segmentation endpoints with uniform identifiers in the text, can facilitate subsequent sentence endpoint statistics, thereby improving The segmentation efficiency of the text corpus improves the efficiency of new word recognition of the text corpus. For example, using "*" as a unified identifier to replace the punctuation marks, stop words, and stop words in the endpoints of the preset sentence. Before replacing a text corpus, "I am very keen to analyze Chinese data using statistical methods." ", Using" * "as a unified identifier to replace" I "," 的 "and". "In the endpoint of the preset sentence as the text corpus" * Keen to use statistical * methods to analyze Chinese data * ".

Use stop words, stop word spaces, and carriage return as the default segmentation endpoints, replace spaces, carriage returns, stop words, stop words, and punctuation marks with "*". After replacing the preset sentence endpoints, replace The text is divided into candidate words with a length of 2-N by N yuan, and the number of occurrences of each candidate word is counted. For example, in a piece of data of 24 million words, "Nanshan" appears a total of 2774 times, and the word "District" appears 4797 times.

Please continue to refer to FIG. 3. In this embodiment, if the candidate new word is not included in the preset lexicon, the step of determining the candidate new word as a new word further includes:

S260: Obtain the length of the new word, and determine whether the length of the new word is greater than or equal to a preset length threshold;

S261 If the length of the new word is greater than or equal to the preset length threshold, identify the new word as a long-granular new word;

S262. If the length of the new word is less than the preset length threshold, identify the new word as a non-long-granularity new word.

Among them, the length of the new word refers to the number of characters contained in the new word. For example, the word "movie" contains three characters, and the length of the word "movie" is 3.

The preset length threshold refers to a preset length threshold of words. The preset length threshold can be set manually.

Specifically, the long-granularity new word refers to a word whose number of characters contained in the recognized new word is greater than or equal to a preset length threshold. For example, if the preset preset length threshold is 5, if the new word identified contains more than five characters or equals five characters, the new word is identified as a long-granular new word. For long-granularity new words, corresponding processing can be performed according to the attributes of long-granularity new words.

S270: Obtain the word frequency of the new word, and determine whether the word frequency of the new word is lower than a preset word frequency threshold;

S271. If the word frequency of the new word is lower than the preset word frequency threshold, identify the new word as a low-frequency new word;

S272. If the word frequency of the new word is greater than or equal to the preset word frequency threshold, identify the new word as a non-low frequency new word.

The low-frequency new words refer to the recognized new words whose word frequency in the text corpus is lower than a preset word frequency threshold.

Specifically, if the preset word frequency threshold is 10, among the new words recognized by the computer device, if the word frequency of the recognized new word is less than 10, the new word is a low-frequency new word. Since low-frequency new words are new words that are not commonly used, when recognizing new words, according to different text corpora, you can choose to include or exclude low-frequency new words into the preset lexicon. If you choose not to include low-frequency new words in the preset thesaurus, you can reduce the number of preset thesauruses, improve the matching efficiency of the new word recognition process with the preset thesaurus, and improve the efficiency of new word recognition.

It should be noted that the new word recognition methods described in the above embodiments can recombine the technical features contained in different embodiments as needed to obtain the combined implementation, but they are all within the scope of protection required by this application. within.

Please refer to FIG. 4, which is a schematic block diagram of a new word recognition apparatus provided by an embodiment of the present application. Corresponding to the above new word recognition method, an embodiment of the present application further provides a new word recognition device. Referring to FIG. 4, the new word recognition device includes a unit for performing the above new word recognition method, and the device can be configured in a desktop computer or other computer equipment. Specifically, referring to FIG. 4, the new word recognition device 400 includes a segmentation unit 401, a judgment unit 402, a first recognition unit 403, a filtering unit 404 and a second recognition unit 405.

Among them, the segmentation unit 401 is used to obtain a text corpus, and divide the text corpus into candidate words with a length of 2-N by N-ary segmentation according to preset sentence endpoints, where N is a natural number and N ≥ 2 , The candidate word refers to a text segment obtained by segmenting the text corpus;

The judging unit 402 is used to judge whether the candidate word meets a preset condition;

The first recognition unit 403 is configured to determine the candidate word as a candidate new word if the candidate word meets the preset condition;

The filtering unit 404 is used to determine whether the candidate new word is included in the preset thesaurus; and

The second recognition unit 405 is configured to determine the candidate new word as a new word if the candidate new word is not included in the preset thesaurus.

In one embodiment, the preset sentence endpoint includes a punctuation mark and a preset segmentation endpoint, and the preset segmentation endpoint refers to a component of the text corpus that is previously set as a sentence endpoint except for punctuation marks.

Please refer to FIG. 5, which is a schematic block diagram of a new word recognition apparatus provided by another embodiment of the present application. As shown in FIG. 5, in this embodiment, the judgment unit 402 includes:

The first obtaining subunit 4021 is configured to obtain the mutual information and left and right information entropy of the candidate words, and obtain the word frequency of the candidate words, wherein the left and right information entropy refers to the left neighbor word information entropy of the candidate words And the smaller value in the information entropy of the right neighbor;

The first judgment subunit 4022 is used to judge whether the word frequency, mutual information and left and right information entropy of the candidate words are greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset of left and right information entropy, respectively Threshold

The first determination sub-unit 4023 is used if the word frequency, mutual information and left and right information entropy of the candidate words are greater than or equal to the first preset threshold of the word frequency, the first preset threshold of mutual information and the first preselection of left and right information entropy A threshold is set to determine that the candidate word meets a preset condition.

In one embodiment, the judgment unit 402 includes:

The second obtaining subunit is used to obtain mutual information of the candidate words, and obtain the word frequency of the candidate words and the number of sentence endpoints of the candidate words. The number of sentence endpoints refers to the number of left endpoints of the candidate words The number of right endpoints of the candidate word, the number of left endpoints refers to the number of times the left endpoint of the candidate word appears, and the number of right endpoints refers to the number of times the right endpoint of the candidate word appears;

A second judgment subunit, used for judging whether the word frequency, mutual information and the number of sentence endpoints of the candidate words are greater than or equal to the word frequency second preset threshold, the mutual information second preset threshold and the sentence endpoint number first preset threshold ;

A second determination subunit, configured to: if the word frequency, mutual information and the number of sentence endpoints of the candidate word are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset number of sentence endpoints The threshold value determines that the candidate word meets a preset condition.

Please continue to refer to FIG. 5. In this embodiment, the device 400 further includes:

The replacement unit 406 is configured to replace the preset sentence endpoint in the text corpus with a unified identifier.

The third obtaining unit 407 is configured to obtain the length of the new word and determine whether the length of the new word is greater than or equal to a preset length threshold;

The third recognition unit 408 is configured to recognize that the new word is a long-grained new word if the length of the new word is greater than or equal to the preset length threshold.

The fourth obtaining unit 409 is configured to obtain the word frequency of the new word and determine whether the word frequency of the new word is lower than a preset word frequency threshold;

The fourth recognition unit 410 is configured to recognize the new word as a low-frequency new word if the word frequency of the new word is lower than the preset word frequency threshold.

It should be noted that those skilled in the art can clearly understand that the specific implementation process of the above new word recognition device and each unit can refer to the corresponding description in the foregoing method embodiments. For convenience and conciseness of description, it is not described here. Repeat again.

At the same time, the division and connection of each unit in the above-mentioned new word recognition device are for illustration only. In other embodiments, the new word recognition device may be divided into different units as needed, or each of the new word recognition device The units adopt different connection sequences and methods to complete all or part of the functions of the above new word recognition device.

The above new word recognition device may be implemented in the form of a computer program, and the computer program may run on the computer device shown in FIG. 6.

Please refer to FIG. 6, which is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 600 may be an electronic device such as a desktop computer or a tablet computer, or may be a component or part in other devices.

Referring to FIG. 6, the computer device 600 includes a processor 602, a memory, and a network interface 605 connected through a system bus 601, where the memory may include a non-volatile storage medium 603 and an internal memory 604.

The non-volatile storage medium 603 can store an operating system 6031 and a computer program 6032. When the computer program 6032 is executed, it may cause the processor 602 to execute the above-mentioned new word recognition method.

The processor 602 is used to provide computing and control capabilities to support the operation of the entire computer device 600.

The internal memory 604 provides an environment for the operation of the computer program 6032 in the non-volatile storage medium 603. When the computer program 6032 is executed by the processor 602, the processor 602 can execute the above-mentioned new word recognition method.

The network interface 605 is used for network communication with other devices. Those skilled in the art can understand that the structure shown in FIG. 6 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 600 to which the solution of the present application is applied. The specific computer device 600 may include more or less components than shown in the figure, or combine certain components, or have a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are the same as those in the embodiment shown in FIG. 6, which will not be repeated here.

Wherein, the processor 602 is used to run the computer program 6032 stored in the memory, so as to implement the new word recognition method of the embodiment of the present application.

It should be understood that in the embodiment of the present application, the processor 602 may be a central processing unit (Central Processing Unit, CPU), and the processor 602 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), Application specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may be any conventional processor.

Those of ordinary skill in the art may understand that all or part of the processes in the method for implementing the foregoing embodiments may be completed by a computer program, and the computer program may be stored in a computer-readable storage medium. The computer program is executed by at least one processor in the computer system to implement the process steps of the above method embodiments.

Therefore, the present application also provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium stores a computer program. When the computer program is executed by the processor, the processor causes the processor to perform the operations described in the foregoing embodiments. New word recognition method steps.

The computer-readable storage medium may be various storage media that can store computer programs, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk.

Those of ordinary skill in the art may realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two, in order to clearly explain the hardware and software. Interchangeability, in the above description, the composition and steps of each example have been generally described according to function. Whether these functions are executed in hardware or software depends on the specific application of the technical solution and design constraints. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

The above are only specific implementations of this application, but the scope of protection disclosed in this application is not limited to this, and any person skilled in the art can easily think of various equivalents within the technical scope disclosed in this application Modifications or replacements, these modifications or replacements should be covered within the scope of protection of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

A new word recognition method, including:

Obtain the text corpus, and divide the text corpus into candidate words with a length of 2-N according to the preset endpoints of the sentence, where N is a natural number, and N ≥ 2, the candidate word refers to the pass A text segment obtained from the text corpus;

Determine whether the candidate word meets the preset condition;

If the candidate word meets the preset condition, determine the candidate word as a candidate new word;

Determine whether the candidate new word is included in the preset thesaurus; and

If the candidate new word is not included in the preset thesaurus, the candidate new word is determined as a new word.
The new word recognition method according to claim 1, wherein the preset sentence endpoint includes a punctuation mark and a preset segmentation endpoint, and the preset segmentation endpoint refers to that the text corpus is pre-set as a sentence except punctuation symbols The composition of the endpoint.
The new word recognition method according to claim 1, wherein the step of determining whether the candidate word meets a preset condition includes:

Acquiring the mutual information and left and right information entropy of the candidate words, and acquiring the word frequency of the candidate words, wherein the left and right information entropy refers to the comparison between the left entropy information entropy and the right entropy information entropy of the candidate words Small value

Judging whether the word frequency, mutual information and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset threshold of left and right information entropy;

If the word frequency, mutual information and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset threshold of left and right information entropy, it is determined that the candidate word meets Preset conditions.
The new word recognition method according to claim 1, wherein the step of determining whether the candidate word meets a preset condition includes:

Acquiring mutual information of the candidate words, and acquiring the word frequency of the candidate words and the number of sentence endpoints of the candidate words, the number of sentence endpoints refers to the number of left endpoints of the candidate words or the number of right endpoints of the candidate words, The number of left endpoints refers to the number of occurrences of the left endpoint of the candidate word, and the number of right endpoints refers to the number of occurrences of the right endpoint of the candidate word;

Judging whether the word frequency, mutual information and the number of sentence endpoints of the candidate words are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset threshold of sentence endpoints;

If the word frequency, mutual information and the number of sentence endpoints of the candidate word are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset threshold of number of sentence endpoints, it is determined that the candidate word meets Preset conditions.
The new word recognition method according to claim 1, wherein the step of dividing the text corpus into candidate words with a length of 2-N by N-ary segmentation according to a preset sentence endpoint further includes:

Replace the preset sentence endpoint in the text corpus with a unified identifier.
The new word recognition method according to claim 1, wherein, if the candidate new word is not included in the preset lexicon, the step of determining the candidate new word as a new word further includes:

Acquiring the length of the new word, and determining whether the length of the new word is greater than or equal to a preset length threshold;

If the length of the new word is greater than or equal to the preset length threshold, the new word is identified as a long-granular new word.
The new word recognition method according to claim 1, wherein, if the candidate new word is not included in the preset lexicon, the step of determining the candidate new word as a new word further includes:

Acquiring the word frequency of the new word, and determining whether the word frequency of the new word is lower than a preset word frequency threshold;

If the word frequency of the new word is lower than the preset word frequency threshold, the new word is identified as a low-frequency new word.
A new word recognition device, including:

The segmentation unit is used to obtain a text corpus, and divide the text corpus into candidate words with a length of 2-N by N-ary segmentation according to the preset sentence endpoints, where N is a natural number and N≥2, the The candidate word refers to a text segment obtained by segmenting the text corpus;

The judging unit is used to judge whether the candidate word meets a preset condition;

A first recognition unit, configured to determine the candidate word as a candidate new word if the candidate word meets the preset condition;

A filtering unit, used to determine whether the candidate new word is included in the preset thesaurus; and

The second recognition unit is configured to determine the candidate new word as a new word if the candidate new word is not included in the preset vocabulary.
The new word recognition device according to claim 8, wherein the preset sentence endpoint includes a punctuation mark and a preset segmentation endpoint, and the preset segmentation endpoint refers to that the text corpus is pre-set as a sentence except for punctuation The composition of the endpoint.
The new word recognition device according to claim 8, wherein the judgment unit includes:

The first obtaining subunit is used to obtain the mutual information and left and right information entropy of the candidate words, and obtain the word frequency of the candidate words, where the left and right information entropy refers to the information entropy of the left neighbor words of the candidate word and The smaller value in the information entropy of the right neighbor word;

The first judgment subunit is used to judge whether the word frequency, mutual information and left and right information entropies of the candidate words are greater than or equal to the word frequency first preset threshold, the mutual information first preset threshold and the left and right information entropy first preset threshold, respectively ;

The first determination subunit is used if the word frequency, mutual information, and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information, and the first preset of left and right information entropy The threshold value determines that the candidate word meets a preset condition.
A computer device, the computer device includes a memory and a processor connected to the memory; the memory is used to store a computer program; the processor is used to run the computer program stored in the memory, the processor executes The computer program implements the following steps:

Obtain the text corpus, and divide the text corpus into candidate words with a length of 2-N according to the preset endpoints of the sentence, where N is a natural number, and N ≥ 2, the candidate word refers to the pass A text segment obtained from the text corpus;

Determine whether the candidate word meets the preset condition;

If the candidate word meets the preset condition, determine the candidate word as a candidate new word;

Determine whether the candidate new word is included in the preset thesaurus; and

If the candidate new word is not included in the preset thesaurus, the candidate new word is determined as a new word.
The computer device according to claim 11, wherein the preset sentence endpoints include punctuation marks and preset segmentation endpoints, and the preset segmentation endpoints refer to those in the text corpus that are previously set as sentence endpoints except for punctuation marks ingredient.
The computer device according to claim 11, wherein the step of determining whether the candidate word meets a preset condition includes:

Acquiring the mutual information and left and right information entropy of the candidate words, and acquiring the word frequency of the candidate words, wherein the left and right information entropy refers to the comparison between the left entropy information entropy and the right entropy information entropy of the candidate words Small value

Judging whether the word frequency, mutual information and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset threshold of left and right information entropy;

If the word frequency, mutual information and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset threshold of left and right information entropy, it is determined that the candidate word meets Preset conditions.
The new word recognition method according to claim 11, wherein the step of determining whether the candidate word meets a preset condition includes:

Acquiring mutual information of the candidate words, and acquiring the word frequency of the candidate words and the number of sentence endpoints of the candidate words, the number of sentence endpoints refers to the number of left endpoints of the candidate words or the number of right endpoints of the candidate words, The number of left endpoints refers to the number of occurrences of the left endpoint of the candidate word, and the number of right endpoints refers to the number of occurrences of the right endpoint of the candidate word;

Judging whether the word frequency, mutual information and the number of sentence endpoints of the candidate words are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset threshold of sentence endpoints;

If the word frequency, mutual information and the number of sentence endpoints of the candidate word are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset threshold of number of sentence endpoints, it is determined that the candidate word meets Preset conditions.
The new word recognition method according to claim 11, wherein the step of dividing the text corpus into candidate words with a length of 2-N by N-ary segmentation according to a preset sentence endpoint further includes:

Replace the preset sentence endpoint in the text corpus with a unified identifier.
The new word recognition method according to claim 11, wherein, if the candidate new word is not included in the preset thesaurus, the step of determining the candidate new word as a new word further includes:

Acquiring the length of the new word, and determining whether the length of the new word is greater than or equal to a preset length threshold;

If the length of the new word is greater than or equal to the preset length threshold, the new word is identified as a long-granular new word.
The new word recognition method according to claim 11, wherein, if the candidate new word is not included in the preset thesaurus, the step of determining the candidate new word as a new word further includes:

Acquiring the word frequency of the new word, and determining whether the word frequency of the new word is lower than a preset word frequency threshold;

If the word frequency of the new word is lower than the preset word frequency threshold, the new word is identified as a low-frequency new word.
A computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, causes the processor to perform the following steps:

Obtain the text corpus, and divide the text corpus into candidate words with a length of 2-N according to the preset endpoints of the sentence, where N is a natural number, and N ≥ 2, the candidate word refers to the pass A text segment obtained from the text corpus;

Determine whether the candidate word meets the preset condition;

If the candidate word meets the preset condition, determine the candidate word as a candidate new word;

Determine whether the candidate new word is included in the preset thesaurus; and

If the candidate new word is not included in the preset thesaurus, the candidate new word is determined as a new word.
The computer-readable storage medium according to claim 18, wherein the preset sentence endpoint includes a punctuation mark and a preset segmentation endpoint, and the preset segmentation endpoint refers to that the text corpus is set in advance in addition to the punctuation symbol as The component of the sentence endpoint.
The computer-readable storage medium of claim 18, wherein the step of determining whether the candidate word meets a preset condition includes:

Acquiring the mutual information and left and right information entropy of the candidate words, and acquiring the word frequency of the candidate words, wherein the left and right information entropy refers to the comparison between the left entropy information entropy and the right entropy information entropy of the candidate words Small value

Judging whether the word frequency, mutual information and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset threshold of left and right information entropy;

If the word frequency, mutual information and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset threshold of left and right information entropy, it is determined that the candidate word meets Preset conditions.