WO2020073523A1 - New word recognition method and apparatus, computer device, and computer readable storage medium - Google Patents

New word recognition method and apparatus, computer device, and computer readable storage medium Download PDF

Info

Publication number
WO2020073523A1
WO2020073523A1 PCT/CN2018/124797 CN2018124797W WO2020073523A1 WO 2020073523 A1 WO2020073523 A1 WO 2020073523A1 CN 2018124797 W CN2018124797 W CN 2018124797W WO 2020073523 A1 WO2020073523 A1 WO 2020073523A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
candidate
preset
new
words
Prior art date
Application number
PCT/CN2018/124797
Other languages
French (fr)
Chinese (zh)
Inventor
马骏
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020073523A1 publication Critical patent/WO2020073523A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • This application relates to the technical field of natural language processing, and in particular, to a new word recognition method, device, computer equipment, and computer-readable storage medium.
  • Chinese word segmentation is the basic technology of the current NLP (NLP, English for Natural Language Processing) project, and its accuracy is directly related to the final performance of the NLP project.
  • New word discovery has a direct impact on the accuracy of the word segmentation system. In traditional new word discovery technology, the text is usually segmented first, and then the remaining fragments that fail to match successfully are guessed as new words, but the accuracy of word segmentation depends on the integrity of the lexicon, so the effect of new word discovery is poor .
  • the embodiments of the present application provide a new word recognition method, device, computer equipment, and computer readable storage medium, which can solve the problem of the low effect of the discovery of new words in traditional technologies.
  • an embodiment of the present application provides a new word recognition method, including: obtaining a text corpus, and dividing the text corpus into candidate words with a length of 2-N through N-ary segmentation according to a preset sentence endpoint, Where N is a natural number and N ⁇ 2, the candidate word refers to a text segment obtained by dividing the text corpus; determine whether the candidate word meets a preset condition; if the candidate word meets the preset condition , Determine the candidate word as a candidate new word; determine whether the candidate new word is included in the preset thesaurus; and if the candidate new word is not included in the preset thesaurus, The candidate new words are determined as new words.
  • an embodiment of the present application further provides a new word recognition device, including: a segmentation unit, configured to obtain a text corpus, and segmenting the text corpus into lengths by N-ary segmentation according to a preset sentence endpoint 2-N candidate words, where N is a natural number, and N ⁇ 2, the candidate words refer to text segments obtained by segmenting the text corpus; a judgment unit is used to judge whether the candidate words meet a preset condition The first recognition unit is used to determine the candidate word as a candidate new word if the candidate word meets the preset condition; the filtering unit is used to determine whether the candidate new word is included in the preset word In the library; and a second recognition unit for determining that the candidate new word is a new word if the candidate new word is not included in the preset word library.
  • a segmentation unit configured to obtain a text corpus, and segmenting the text corpus into lengths by N-ary segmentation according to a preset sentence endpoint 2-N candidate words, where N is a natural number, and N ⁇
  • an embodiment of the present application further provides a computer device, which includes a memory and a processor, a computer program is stored on the memory, and the new word recognition method is implemented when the processor executes the computer program.
  • an embodiment of the present application further provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, causes the processor to execute the new word recognition methods.
  • FIG. 1 is a schematic diagram of an application scenario of a new word recognition method provided by an embodiment of this application;
  • FIG. 2 is a schematic flowchart of a new word recognition method provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a new word recognition method provided by another embodiment of this application.
  • FIG. 4 is a schematic block diagram of a new word recognition device provided by an embodiment of this application.
  • FIG. 5 is a schematic block diagram of a new word recognition device provided by another embodiment of this application.
  • FIG. 6 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • FIG. 1 is a schematic diagram of an application scenario of a new word recognition method provided by an embodiment of the present application.
  • the application scenarios include:
  • the computer device shown in FIG. 1 is a device for recognizing new words, on which an application for recognizing new words is installed, and the computer device is manually operated.
  • the computer device may be an electronic device such as a notebook computer, a tablet computer, a desktop computer, or a server.
  • each subject in Figure 1 The working process of each subject in Figure 1 is as follows: artificially use a computer device for new word recognition, a new word recognition application is installed on the computer device, the computer device obtains a text corpus, and divides the sentence by N-ary segmentation according to the preset sentence endpoint
  • the text corpus is divided into candidate words with a length of 2-N, to determine whether the candidate words meet the preset conditions, and if the candidate words meet the preset conditions, the candidate words are determined as candidate new words, and the Whether the candidate new word is included in the preset thesaurus, if the candidate new word is not included in the preset thesaurus, the candidate new word is determined as a new word, and the computer device displays the recognition result to the human To complete the recognition of new words in the text corpus.
  • FIG. 1 only illustrates a desktop computer as a computer device.
  • the type of computer device is not limited to that shown in FIG. 1.
  • the computer device may also be an electronic device such as a notebook computer or tablet computer.
  • the application scenario of the above new word recognition method is only used to explain the technical solution of the present application, and is not used to limit the technical solution of the present application.
  • FIG. 2 is a schematic flowchart of a new word recognition method provided by an embodiment of the present application.
  • the new word recognition method is applied to the terminal in FIG. 1 to complete all or part of the functions of the new word recognition method.
  • FIG. 2 is a schematic flowchart of a new word recognition method provided by an embodiment of the present application. As shown in FIG. 2, the method includes the following steps S210-S250:
  • S210 Obtain a text corpus, and divide the text corpus into candidate words with a length of 2-N according to a preset sentence endpoint, and N is a natural number, and N ⁇ 2, where the candidate words Refers to a text segment obtained by segmenting the text corpus.
  • the new word refers to a given piece of text, randomly take a segment, if the segment has an independent meaning, and is not included in the existing thesaurus or dictionary, is not a known word, the segment is judged as a new word.
  • a segment in the text is a new word
  • the full meaning of the expression, and the internal composition of the segment is very fixed, that is, the segment often appears as a fixed whole, you can judge that the segment is a vocabulary, if the word does not exist in the existing dictionary, the segment is a new word.
  • N-ary segmentation refers to the sequential segmentation or division of adjacent N Chinese characters in the text corpus to obtain a text segment containing N Chinese characters
  • 2-element segmentation refers to sequentially performing two adjacent Chinese characters on the text corpus Segmentation or division, to obtain a text segment containing 2 Chinese characters
  • 3 yuan segmentation refers to the text corpus is divided into three adjacent Chinese characters or division, to obtain a text segment containing 3 Chinese characters and so on.
  • the text corpus "I am a person” is divided into two parts, and the obtained text fragments are “I am”, “Is one", “one” and “person”, and the text corpus "I am a person” Three yuan segmentation, the obtained text fragments are "I am one", “Is one” and "one person”.
  • Text corpus refers to the language material to which the text for new word recognition belongs.
  • the text corpus may be a piece of text, an article, a web page of a website, or a book.
  • the text corpus may be an electronic book or text stored in a mobile memory, a computer device, or the Internet, for example, text saved in Word format, or a web page of a designated website.
  • the candidate word refers to a text segment obtained by segmenting the text corpus. After segmenting the text corpus according to the preset sentence endpoints, multiple text fragments may be obtained.
  • the text fragments may be words or may not be words, and need to be filtered according to preset conditions to determine whether they are words.
  • the text segment meets the preset condition and is determined to be a word, and if the text segment does not satisfy the preset condition, it is determined to be not a word. Since the text segment is in a candidate state for words, it is called a candidate word.
  • 2-N candidate words refer to candidate words with a length of 2 to N, that is, the number of Chinese characters included in the candidate words is 2, 3, 4 ... N, for example, 2-5 candidate words refer to candidate words The lengths of are 2, 3, 4, and 5, respectively, that is, the candidate words include 2 Chinese characters, 3 Chinese characters, 4 Chinese characters, and 5 Chinese characters, respectively.
  • the preset sentence endpoint refers to setting the word boundaries of the candidate words in advance, and using these word boundaries as endpoints to segment the text corpus to obtain candidate words.
  • the preset sentence endpoint includes a punctuation mark and a preset segmentation endpoint.
  • the preset segmentation endpoint refers to a component of the text corpus that is previously set as a sentence endpoint except for punctuation.
  • the fixed components with independent meaning in the corpus are used as the word boundaries that segment the text corpus, and belong to the endpoints of artificial sentences.
  • the text corpus generally includes text, punctuation marks, carriage returns, spaces and other components.
  • Punctuation refers to the punctuation of sentences in the text corpus after complete description of the meaning, which serves as a pause to form a sentence break, such as comma, semicolon, double quotation mark, and period.
  • the preset segmentation endpoint includes symbols in the text corpus that have a pause or sentence-breaking function in addition to punctuation, such as space characters and carriage returns, and stop words and stop words with independent semantics.
  • Stop word refers to the word used for pause in the text corpus.
  • Stop word refers to the word used for pause in the text.
  • the stop word and the stop word generally have independent meanings. For example, commonly used stop words Including words such as "you”, “me”, “her” and “de”, stop words include words such as "we”, “based”, “said” and "act”.
  • the preset segmentation endpoints can be regarded as an extension of punctuation marks. Punctuation symbols are generally used to form sentence breaks between sentences.
  • the preset segmentation endpoints can be regarded as the formation of pauses or sentence breaks between sentence components within sentences.
  • Word boundaries can be identified as endpoints of sentences like punctuation marks.
  • the common words and common words are used as independent features in the segmented text, and the stop words and stop words are used as the segmented text corpus to obtain the left and right boundaries of the candidate words, such as " ⁇ ” , "I”, " ⁇ ”, etc., stop words include "we” “you” "these” and other fixed words that are already inseparable and have independent semantics, use them as the left or right boundary of the word boundary of the candidate words Boundaries, through preset segmentation endpoints, can effectively improve the accuracy of new word discovery and improve the efficiency of new word recognition.
  • the text corpus to be recognized for the new word is obtained, and the text corpus may be a piece of text, an article, a web page of a website, or a book.
  • the text corpus is segmented according to preset sentence endpoints including punctuation marks, spaces, carriage returns, stop words and stop words, and the text corpus is divided into candidate words with a length of 2-N by N-ary segmentation, Among them, N is a natural number, and N ⁇ 2, to obtain the candidate words after segmentation. For example, if N is 5, the obtained text corpus is divided into candidate words of length 2, 3, 4, and 5, respectively, that is, the candidate words are two characters, three characters, four characters, and five characters, etc.
  • N needs to be set according to the specific text, for example, "the woman's cross is as light as water” is a seven-word word, "Baizhigantou further” is an eight-word word, and some company names can have more Word, you need to set a specific number N according to different text corpus. Further, for the same text corpus, you can set different numbers N, compare different recognition results, remove the same recognition results, so as to filter out long-grained new words, according to the recognition results of long-grained new words, the results are more ideal New word recognition.
  • the preset condition refers to a condition for identifying the candidate word as the candidate new word. If the candidate word meets the preset condition, the candidate word is identified as a word, and the candidate word is determined as a candidate New words, if the candidate words do not satisfy the preset condition, identify that the candidate words are not words, and judge that the candidate words are not candidate new words.
  • the candidate words satisfy the first preset thresholds of the word frequency, mutual information, and left and right information entropy, respectively, or the candidate words satisfy the second presets of the word frequency, mutual information, and sentence endpoints, respectively. Set the threshold.
  • the computer device divides the obtained text corpus to obtain text fragments as candidate words.
  • Some candidate words are words, and some are not words. Therefore, it is necessary to filter the candidate words using preset conditions.
  • the text fragments of the words are filtered out, and the text fragments of the words are retained for further recognition. Therefore, it is determined whether the candidate word satisfies the preset condition to identify whether the candidate word becomes a word, a candidate word of the word, and a new word candidate.
  • Step S230 is entered. If the candidate word does not satisfy the preset condition, It indicates that the candidate word cannot be a word.
  • Step S221 is entered, and the candidate word filtering points are discarded to further narrow the scope of new word recognition and improve the efficiency and accuracy of new word recognition.
  • the candidate new word refers to a candidate word recognized as a word.
  • the text corpus is divided into 2-N candidate words, some of which cannot be words.
  • candidate words such as” Yu Yong “,” Yu Tong Tong “,” Yong Tong “,” Guo De “,” Dai Fang “, and” Method "are obtained from the candidate words, which is obviously impossible by human judgment based on experience Become a word, therefore, the obtained candidate words need to be filtered according to a preset condition, and the candidate word is identified as a candidate new word, and the candidate new word refers to the candidate word recognized as a word to filter out the words that cannot become words Text fragments to narrow the scope of new word recognition.
  • the computer device filters the obtained candidate words. For the candidate words that cannot be words, it is necessary to filter through the preset conditions, remove the candidate words that cannot be words, and only keep the candidate words that can be words, to further proceed New word recognition. If the candidate word satisfies the preset condition, the candidate word is identified as a word, and the candidate word is determined to be a new candidate word; if the candidate word does not satisfy the preset condition, the candidate word is not identified as a word, It is determined that the candidate word is not a candidate new word, thereby further narrowing the scope of new word recognition, and improving the accuracy, efficiency, and recall rate of new word recognition.
  • the preset threshold of the minimum left and right information entropy of the candidate word is set to 1, the preset threshold of the lowest mutual information is 1, and the preset threshold of the sentence endpoint is 3.
  • the preset threshold value of the lowest left and right information entropy is 1 means that the smaller value of the left and right neighbor information entropy of the candidate word is 1.
  • the preset threshold of the sentence endpoint refers to the number of occurrences of the sentence endpoint of the left or right boundary of the candidate word. Please refer to Table 1. After recognizing a corpus, the results are shown in Table 1. According to the above judgment criteria, "Nanshan District”, “Nanshan”, and “Puhui” are recognized as candidate new words, "Go to Nanshan” "No word, excluded from candidate new words.
  • S240 Determine whether the candidate new word is included in the preset vocabulary.
  • the preset thesaurus can also be called an existing thesaurus, which refers to a set of known words that have been determined as words, and can be a preset dictionary.
  • the computer device divides the text corpus into candidate words with a length of 2-N through N-ary segmentation, and the candidate words are text segments obtained by segmenting the text corpus. If the candidate word satisfies the preset condition, the candidate word is determined as a candidate new word, and the candidate new word is identified, but only text fragments that can become words are selected from the candidate word.
  • the candidate new words include words that have been confirmed as words in the natural language processing technology and newly recognized words. Therefore, the words that have been confirmed as words need to be filtered out and screened out as recognized new words.
  • the preset thesaurus contains words that have been confirmed as words in natural language processing technology.
  • the preset thesaurus is a variety of existing dictionaries already in the traditional technology, or may be a manually set thesaurus, such as setting
  • the collection of several dictionaries existing in the conventional technology is a thesaurus, and the preset thesaurus may also contain new words that have been recognized in the past.
  • step S241 is entered, where the candidate new word is Known words, filter out the existing words; if the candidate new words cannot be matched in the preset vocabulary, it is judged that the candidate new words are not included in the preset vocabulary, the candidate new The word is an unknown word, and for the recognized new word, step S250 is entered.
  • the candidate new word is an unknown word, and for the recognized new word, step S250 is entered.
  • the candidate new word is not included in the preset thesaurus, the candidate new word is considered not to be a known word, the candidate new word is an unknown word, and is a recognized new word,
  • the recognized new words are generally words that have not been seen in the past, so as to complete the recognition of the new words of the given text corpus. For example, please refer to Table 1. After recognizing a corpus, the result is shown in Table 1. If the identified candidate new word "Puhui" is not included in the preset lexicon, judge "Puhui" For the new words identified.
  • Embodiments of the present application are based on natural language processing in speech semantics.
  • text corpus is segmented to obtain candidate words
  • the text corpus is accurately segmented through N-ary segmentation combined with preset sentence endpoints to obtain a candidate with a length of 2-N Words, do not depend on any existing thesaurus, just extract all possible text fragments in a large-scale corpus based on the common characteristics of the words, as candidate words, through the preset sentence endpoints as independent features, as cut Divide the word boundaries of the text corpus, reduce the number of candidate words, improve the accuracy and efficiency of segmentation, and then identify whether the candidate words meet the preset conditions.
  • candidate words meet the preset conditions, they are recognized as candidates New words, as candidate new words with independent semantics, narrow the scope of new word recognition, and then compare all extracted candidate new words with existing thesaurus, if the candidate new words are not included in the preset thesaurus, recognize For new words, screening candidate new words that are not included in the thesaurus as recognized new words can effectively improve the accuracy, efficiency and recall rate of new word discovery.
  • the step of determining whether the candidate word meets a preset condition includes:
  • the left and right information entropy refers to the comparison between the left entropy information entropy and the right entropy information entropy of the candidate words
  • the word frequency, mutual information and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset threshold of left and right information entropy, it is determined that the candidate word meets Preset conditions.
  • mutual information refers to the internal cohesion of the candidate words, and may also be referred to as the degree of internal coagulation or the degree of cohesion of the candidate words.
  • the formula for mutual information is:
  • w represents the candidate word
  • p (x) is the probability that the candidate word x appears in the entire corpus
  • l represents the left character string that constitutes the candidate word
  • r represents the right character string that constitutes the candidate word.
  • the most condensed candidate words are words such as "bat”, “spider”, “wandering”, “uneasy”, “rose” and so on, almost every word in these words will always be with another The words appear at the same time and are never used in other occasions.
  • Information entropy refers to the degree of freedom of the candidate word, that is, the richness of the left or right neighbor of the candidate word, and the information entropy of the candidate word becomes the number of left or right neighbor of the candidate word. Proportionally, if the candidate word can be matched with more left or right neighbor words, the corresponding information entropy of the candidate word is greater, if the candidate word can be matched with fewer left or right neighbor words, The smaller the corresponding information entropy of the candidate words.
  • the information entropy of a candidate word can also be called left and right information entropy.
  • the left and right information entropy of a candidate word, that is, the degree of freedom of the candidate word is defined as the smaller value of the information entropy of its left neighbor word and right neighbor word.
  • the left neighbor information entropy also known as the left information entropy, refers to the richness of the left neighbor of the candidate word, that is, the number of words that can be matched on the left side of the candidate word, and the formula of the left information entropy for:
  • Conditional probability refers to the occurrence probability of event A under the condition that another event B has occurred.
  • the conditional probability is expressed as: p (A
  • Right neighbor information entropy also known as right left information entropy, refers to the richness of the right neighbor of the candidate word, that is, the number of words that can be matched on the right side of the candidate word.
  • the formula for the right information entropy is:
  • W represents the candidate word
  • b represents the word on the right of the candidate word
  • W) represents the probability of the word b appearing on the right of the candidate word.
  • a word can appear in various environments, has a very rich set of left and right neighbors, the degree of freedom is expressed by information entropy, which can be reflected by information entropy, and the average will be given after knowing the result of an event How much information do you bring. If the probability of a certain result is p, when you know that it did happen, the amount of information you get is defined as -log (p). Information entropy is used to measure how random the set of left and right neighbors of a candidate word is. For example, "eating grapes without spitting grape skins and not eating grapes but spitting grape skins", the word "grape" appears four times.
  • the left neighboring words are ⁇ eating, vomiting, eating, vomiting ⁇ , and the right neighboring words are ⁇ no, skin, pour, skin ⁇ .
  • the information entropy of the left neighbor of the word "grape” is-(1/2) * log (1/2)-(1/2) * log (1/2) ⁇ 0.693
  • the information entropy of its right neighbor is-(1/2) * log (1/2)-(1/4) * log (1/4)-(1/4) * log (1/4 ) ⁇ 1.04. It can be seen that in this sentence, the right neighbor of the word "Grape" is more abundant.
  • Word frequency English is Term Frequency, abbreviated as TF, refers to the number of times a given word appears in the text corpus in a given text corpus, the importance of the word follows it in the file The frequency increases proportionally.
  • TF Term Frequency
  • the number of occurrences of each candidate word is counted. For example, in a piece of data of 24 million words, "Nanshan” appeared a total of 2,774 times, the word frequency of "Nanshan” was 2,774, the word “region” appeared 4,797 times, and the word frequency of "region” was 4,797.
  • the parameters that can reflect the word boundary information of the candidate word include the sentence endpoint of the candidate word and left and right information entropy. Since the sentence endpoints and the left and right information entropies reflect the word boundary information of the candidate words, the sentence endpoints and the left and right information entropies play the same role in the recognition of the candidate words. In the process of new word recognition, both meet One condition is sufficient.
  • the candidate word meets a preset condition, and the candidate word is determined as a candidate new word as an example, that is, the word frequency of the candidate word satisfies the first preset threshold of the word frequency, and the mutual information satisfies the first preset of the mutual information Threshold, the left and right information entropy meets the first preset threshold of the left and right information entropy.
  • the first preset threshold value of the word frequency of the candidate word is 10
  • the first preset threshold value of the mutual information of the candidate word is 1
  • the first preset threshold value of the left and right information entropy of the candidate word is 1.
  • the divided candidate words satisfy the first preset thresholds of the word frequency, mutual information, and left and right information entropy, respectively, which means that the word frequency of the candidate word is greater than or equal to 10, the mutual information is greater than or equal to 1, and the left and right information entropy is greater than or equal to 1.
  • the candidate word is judged as a candidate new word, or if the mutual information of the segmented candidate words is greater than 1 and the right information entropy If it is greater than 1, the candidate word can also be determined as a candidate new word.
  • Table 1 since the mutual information and left and right information entropy of "Nanshan District” and “Nanshan” are both greater than 1, "Nanshan District” and “Nanshan” are identified as candidate new words.
  • the candidate word the first preset threshold of the word frequency of the candidate word, the first preset threshold of mutual information, and the first preset threshold of left and right information entropy are set larger, the more accurate candidate new words are identified, the candidate The smaller the first preset threshold of the word frequency of the word, the first preset threshold of mutual information, and the first preset threshold of left and right information entropy, the more candidate new words are identified.
  • the step of determining whether the candidate word meets a preset condition includes:
  • the number of sentence endpoints refers to the number of left endpoints of the candidate words or the number of right endpoints of the candidate words
  • the number of left endpoints refers to the number of occurrences of the left endpoint of the candidate word
  • the number of right endpoints refers to the number of occurrences of the right endpoint of the candidate word
  • the endpoints of the candidate words refer to the left and right neighbor word boundaries of the candidate words, wherein the word boundaries refer to the boundary edges of the words, that is, the boundary lines of the words, and the text corpus is processed through the boundary lines Divided into different candidate words.
  • the left end point of the candidate word refers to the left neighbor word boundary of the candidate word
  • the right end point of the candidate word refers to the right neighbor word boundary of the candidate word.
  • the number of left and right end points refers to the number of occurrences of the left and right word boundaries of the candidate word respectively.
  • the word boundary includes punctuation marks and spaces included in the preset segmentation end point, carriage return, stop words, stop words, etc.
  • the preset sentence endpoints as word boundaries in the text corpus may be replaced with a unified identifier. If the word boundary is replaced with a unified identifier, the number of left and right endpoints is the number of identifiers appearing on the left of the candidate word and the number of identifiers appearing on the right of the candidate word. For example, if the text of the unified identifier is "*", the corpus is: "The movie theater is a place for the audience to show movies. With the progress and development of movies, there have been movie theaters built specifically for the screening of movies. The shape, size, proportion and acoustic technology of the movie theater have changed a lot.
  • the movie theater must meet the technical requirements for film screening. "After replacing the sentence endpoints with a unified identifier:" Movie theater * for the audience to show the movie * venue * with the movie The progress and development of * appeared specially built for the screening of movies * Movie theater * Movie development * Movie theater * The shape, size, proportion and acoustic technology have changed a lot * Movie theater * Meeting film screening * Technical requirements * ", from which, It can be seen that the left end point of the candidate word "cinema" appears 3 times, and the right end point appears 4 times.
  • the stop words, the default split end points of stop words, spaces, carriage returns, and punctuation marks are used as the left and right word boundaries of the candidate words.
  • the word boundary statistics Through the word boundary statistics, the number of occurrences of the left and right end points of the candidate words is counted.
  • the granularity of new word recognition through the statistics of the endpoints of candidate words, can effectively find new words with low frequency and long granularity, and can effectively improve the efficiency and accuracy of new word recognition.
  • the sentence endpoints of the candidate words and the left and right information entropy reflect the word boundary information of the candidate word
  • the sentence endpoints and the left and right information entropy play the same role in the recognition of the candidate word. In both cases, it suffices to satisfy one of the conditions.
  • the word frequency, mutual information and the number of sentence endpoints of the candidate words are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset threshold of number of sentence endpoints, It is determined that the candidate word meets a preset condition, and the candidate word is determined as a candidate new word as an example.
  • the second preset threshold value of the word frequency of the candidate word is 10
  • the second preset threshold value of the lowest left and right information entropy of the candidate word is set to 1
  • the first preset threshold value of the sentence endpoint is 3.
  • the second preset threshold value of the lowest left and right information entropy is 1 means that the smaller value of the left word neighbor information entropy and the right word neighbor information entropy of the candidate word is 1.
  • the first preset threshold of the sentence endpoint refers to the number of occurrences of the sentence endpoint of the left or right boundary of the candidate word. Please refer to Table 1. After recognizing a corpus, the result is shown in Table 1.
  • segmented candidate words meet the second preset threshold of mutual information and the first preset threshold of sentence endpoints respectively, they refer to the candidate words
  • the mutual information of is greater than 1 and the number of occurrences of the endpoint of the sentence is greater than 3. If the mutual information of the segmented candidate words is greater than 1 and the number of occurrences of the endpoint of the sentence is greater than 3, the candidate word is judged as a candidate new word. Please continue to refer to Table 1. In Table 1, since the mutual information of "Nanshan District”, “Nanshan” and “Puhui" and the number of sentence endpoints are all greater than 3, judge “Nanshan District”, “Nanshan” and “Puhui” As a candidate for new words.
  • FIG. 3 is a schematic flowchart of a new word recognition method according to another embodiment of the present application.
  • the step of determining whether the candidate word meets a preset condition includes:
  • S212 Determine whether the word frequency, mutual information, and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information, and the first preset threshold of left and right information entropy, or the candidate words Whether the word frequency, mutual information, and sentence endpoints are greater than or equal to the word frequency second preset threshold, the mutual information second preset threshold, and the sentence endpoint number first preset threshold, respectively;
  • the word frequency, mutual information, and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information, and the first preset threshold of left and right information entropy, or the candidate.
  • the word frequency, mutual information and the number of sentence endpoints of the word are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset threshold of the number of sentence endpoints, respectively, to determine that the candidate word meets the preset condition.
  • the first preset threshold of word frequency and the second preset threshold of word frequency may correspond to the same, and the first preset threshold of mutual information and the second preset threshold of mutual information may correspond to the same.
  • the computer device determines according to the word frequency, mutual information, and left and right information entropy of the candidate words that are greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information, and the first preset threshold of left and right information entropy, respectively.
  • the candidate word satisfies the preset condition, and the first candidate new word identified, combined with the word frequency, mutual information, and the number of sentence endpoints of the candidate word is greater than or equal to the word word second preset threshold, and the mutual information second pre Set a threshold and a first preset threshold for the number of sentence endpoints, determine that the candidate word meets a preset condition, identify the candidate word as a second candidate new word, and take the first candidate new word and the second candidate new word
  • the union of, as the final candidate new word recognition can improve the accuracy of the candidate new word, go to step S230 to further identify the candidate new word, otherwise, if the candidate word does not meet the above conditions, go to step S221, said The candidate word cannot be a word, and the candidate word is discarded.
  • Nanshan District “Nanshan District”, “Nanshan”, and “Pu Hui” are identified as candidate new words. "Go to Nanshan” is not a word, and is excluded from the candidate new words. Beyond the words, combine the two to identify “Nanshan District”, “Nanshan” and “Puhui” as new candidate words, thus avoiding identifying "Puhui” as a candidate new word and identifying "Puhui” as Candidate new words, High accuracy of recognition of new words.
  • the left and right end points of the candidate words are counted
  • the number of occurrences due to the refinement of the granularity of new word recognition, through the statistics of the endpoints of candidate words, low frequency and long granular new words can be effectively found, which can effectively improve the efficiency and accuracy of new word recognition.
  • the preset threshold of the word frequency, the preset threshold of mutual information, the preset threshold of left and right information entropy, and the preset threshold of sentence endpoint information are set larger, the more accurate the recognition of candidate new words is, if The smaller the preset threshold of predicate frequency, the preset threshold of mutual information, the preset threshold of left and right information entropy, and the preset threshold of sentence endpoint information, the more candidate new words are identified.
  • the text corpus is divided into candidate words with a length of 2-N by N-ary segmentation according to a preset sentence endpoint, where N is a natural number, and the step of N ⁇ 2 further includes : Replace the preset sentence endpoint in the text corpus with a unified identifier.
  • replacing the preset sentence endpoint in the text corpus with a unified identifier refers to a punctuation mark included in the preset sentence endpoint and a pre-inclusion including stop words, stop words, carriage returns, and spaces
  • stop words, stop word spaces, and carriage return as the default segmentation endpoints, replace spaces, carriage returns, stop words, stop words, and punctuation marks with "*".
  • replace The text is divided into candidate words with a length of 2-N by N yuan, and the number of occurrences of each candidate word is counted. For example, in a piece of data of 24 million words, "Nanshan” appears a total of 2774 times, and the word “District” appears 4797 times.
  • the step of determining the candidate new word as a new word further includes:
  • S260 Obtain the length of the new word, and determine whether the length of the new word is greater than or equal to a preset length threshold;
  • the length of the new word refers to the number of characters contained in the new word.
  • the word “movie” contains three characters, and the length of the word “movie” is 3.
  • the preset length threshold refers to a preset length threshold of words.
  • the preset length threshold can be set manually.
  • the long-granularity new word refers to a word whose number of characters contained in the recognized new word is greater than or equal to a preset length threshold. For example, if the preset preset length threshold is 5, if the new word identified contains more than five characters or equals five characters, the new word is identified as a long-granular new word. For long-granularity new words, corresponding processing can be performed according to the attributes of long-granularity new words.
  • the step of determining the candidate new word as a new word further includes:
  • S270 Obtain the word frequency of the new word, and determine whether the word frequency of the new word is lower than a preset word frequency threshold;
  • the low-frequency new words refer to the recognized new words whose word frequency in the text corpus is lower than a preset word frequency threshold.
  • the new word is a low-frequency new word. Since low-frequency new words are new words that are not commonly used, when recognizing new words, according to different text corpora, you can choose to include or exclude low-frequency new words into the preset lexicon. If you choose not to include low-frequency new words in the preset thesaurus, you can reduce the number of preset thesauruses, improve the matching efficiency of the new word recognition process with the preset thesaurus, and improve the efficiency of new word recognition.
  • FIG. 4 is a schematic block diagram of a new word recognition apparatus provided by an embodiment of the present application.
  • an embodiment of the present application further provides a new word recognition device.
  • the new word recognition device includes a unit for performing the above new word recognition method, and the device can be configured in a desktop computer or other computer equipment.
  • the new word recognition device 400 includes a segmentation unit 401, a judgment unit 402, a first recognition unit 403, a filtering unit 404 and a second recognition unit 405.
  • the segmentation unit 401 is used to obtain a text corpus, and divide the text corpus into candidate words with a length of 2-N by N-ary segmentation according to preset sentence endpoints, where N is a natural number and N ⁇ 2 ,
  • the candidate word refers to a text segment obtained by segmenting the text corpus;
  • the judging unit 402 is used to judge whether the candidate word meets a preset condition
  • the first recognition unit 403 is configured to determine the candidate word as a candidate new word if the candidate word meets the preset condition
  • the filtering unit 404 is used to determine whether the candidate new word is included in the preset thesaurus.
  • the second recognition unit 405 is configured to determine the candidate new word as a new word if the candidate new word is not included in the preset thesaurus.
  • the preset sentence endpoint includes a punctuation mark and a preset segmentation endpoint
  • the preset segmentation endpoint refers to a component of the text corpus that is previously set as a sentence endpoint except for punctuation marks.
  • FIG. 5 is a schematic block diagram of a new word recognition apparatus provided by another embodiment of the present application.
  • the judgment unit 402 includes:
  • the first obtaining subunit 4021 is configured to obtain the mutual information and left and right information entropy of the candidate words, and obtain the word frequency of the candidate words, wherein the left and right information entropy refers to the left neighbor word information entropy of the candidate words And the smaller value in the information entropy of the right neighbor;
  • the first judgment subunit 4022 is used to judge whether the word frequency, mutual information and left and right information entropy of the candidate words are greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset of left and right information entropy, respectively Threshold
  • the first determination sub-unit 4023 is used if the word frequency, mutual information and left and right information entropy of the candidate words are greater than or equal to the first preset threshold of the word frequency, the first preset threshold of mutual information and the first preselection of left and right information entropy A threshold is set to determine that the candidate word meets a preset condition.
  • the judgment unit 402 includes:
  • the second obtaining subunit is used to obtain mutual information of the candidate words, and obtain the word frequency of the candidate words and the number of sentence endpoints of the candidate words.
  • the number of sentence endpoints refers to the number of left endpoints of the candidate words
  • the number of right endpoints of the candidate word, the number of left endpoints refers to the number of times the left endpoint of the candidate word appears, and the number of right endpoints refers to the number of times the right endpoint of the candidate word appears;
  • a second judgment subunit used for judging whether the word frequency, mutual information and the number of sentence endpoints of the candidate words are greater than or equal to the word frequency second preset threshold, the mutual information second preset threshold and the sentence endpoint number first preset threshold ;
  • a second determination subunit configured to: if the word frequency, mutual information and the number of sentence endpoints of the candidate word are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset number of sentence endpoints The threshold value determines that the candidate word meets a preset condition.
  • the device 400 further includes:
  • the replacement unit 406 is configured to replace the preset sentence endpoint in the text corpus with a unified identifier.
  • the device 400 further includes:
  • the third obtaining unit 407 is configured to obtain the length of the new word and determine whether the length of the new word is greater than or equal to a preset length threshold;
  • the third recognition unit 408 is configured to recognize that the new word is a long-grained new word if the length of the new word is greater than or equal to the preset length threshold.
  • the device 400 further includes:
  • the fourth obtaining unit 409 is configured to obtain the word frequency of the new word and determine whether the word frequency of the new word is lower than a preset word frequency threshold;
  • the fourth recognition unit 410 is configured to recognize the new word as a low-frequency new word if the word frequency of the new word is lower than the preset word frequency threshold.
  • each unit in the above-mentioned new word recognition device are for illustration only.
  • the new word recognition device may be divided into different units as needed, or each of the new word recognition device
  • the units adopt different connection sequences and methods to complete all or part of the functions of the above new word recognition device.
  • the above new word recognition device may be implemented in the form of a computer program, and the computer program may run on the computer device shown in FIG. 6.
  • FIG. 6 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 600 may be an electronic device such as a desktop computer or a tablet computer, or may be a component or part in other devices.
  • the computer device 600 includes a processor 602, a memory, and a network interface 605 connected through a system bus 601, where the memory may include a non-volatile storage medium 603 and an internal memory 604.
  • the non-volatile storage medium 603 can store an operating system 6031 and a computer program 6032.
  • the computer program 6032 When executed, it may cause the processor 602 to execute the above-mentioned new word recognition method.
  • the processor 602 is used to provide computing and control capabilities to support the operation of the entire computer device 600.
  • the internal memory 604 provides an environment for the operation of the computer program 6032 in the non-volatile storage medium 603.
  • the processor 602 can execute the above-mentioned new word recognition method.
  • the network interface 605 is used for network communication with other devices.
  • the structure shown in FIG. 6 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 600 to which the solution of the present application is applied.
  • the specific computer device 600 may include more or less components than shown in the figure, or combine certain components, or have a different arrangement of components.
  • the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are the same as those in the embodiment shown in FIG. 6, which will not be repeated here.
  • the processor 602 is used to run the computer program 6032 stored in the memory, so as to implement the new word recognition method of the embodiment of the present application.
  • the processor 602 may be a central processing unit (Central Processing Unit, CPU), and the processor 602 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), Application specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may be any conventional processor.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • the computer-readable storage medium stores a computer program. When the computer program is executed by the processor, the processor causes the processor to perform the operations described in the foregoing embodiments. New word recognition method steps.
  • the computer-readable storage medium may be various storage media that can store computer programs, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

A new word recognition method and apparatus, a computer device, and a computer readable storage medium. The method comprises: obtaining a text corpus, segmenting the text corpus into candidate words having a length of 2-N by means of N-element segmentation according to a preset sentence endpoint, N being a natural number, N >= 2 (S210); determining whether the candidate words satisfy a preset condition (S220); determining the candidate word as a candidate new word if the candidate word satisfies the preset condition (S230); determining whether the candidate new word is comprised in a preset vocabulary (S240); and determining the candidate new word as a new word if the candidate new word is not comprised in the preset vocabulary (S250).

Description

新词识别方法、装置、计算机设备及计算机可读存储介质New word recognition method, device, computer equipment and computer readable storage medium
本申请要求于2018年10月12日提交中国专利局、申请号为201811191755.9、申请名称为“新词识别方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of the Chinese patent application filed on October 12, 2018 with the Chinese Patent Office, the application number is 201811191755.9, and the application name is "new word recognition method, device, computer equipment and storage medium", the entire content of which is cited by reference Incorporated in this application.
技术领域Technical field
本申请涉及自然语言处理技术领域,尤其涉及一种新词识别方法、装置、计算机设备及计算机可读存储介质。This application relates to the technical field of natural language processing, and in particular, to a new word recognition method, device, computer equipment, and computer-readable storage medium.
背景技术Background technique
中文分词作为目前NLP(NLP,英文为Natural Language Processing,自然语言处理)项目的基础技术,其准确率直接关系到NLP项目的最终表现。新词发现对分词系统的准确率有直接的影响。传统的新词发现技术中,通常先对文本进行分词,然后猜测未能成功匹配的剩余片段就是新词,但分词的准确性依赖于词库的完整性,因此导致新词发现的效果较差。Chinese word segmentation is the basic technology of the current NLP (NLP, English for Natural Language Processing) project, and its accuracy is directly related to the final performance of the NLP project. New word discovery has a direct impact on the accuracy of the word segmentation system. In traditional new word discovery technology, the text is usually segmented first, and then the remaining fragments that fail to match successfully are guessed as new words, but the accuracy of word segmentation depends on the integrity of the lexicon, so the effect of new word discovery is poor .
发明内容Summary of the invention
本申请实施例提供了一种新词识别方法、装置、计算机设备及计算机可读存储介质,能够解决传统技术中新词发现的效果过低的问题。The embodiments of the present application provide a new word recognition method, device, computer equipment, and computer readable storage medium, which can solve the problem of the low effect of the discovery of new words in traditional technologies.
第一方面,本申请实施例提供了一种新词识别方法,包括:获取文本语料,根据预设句子端点,通过N元切分将所述文本语料切分成长度为2-N的候选词,其中,N为自然数,且N≥2,所述候选词是指切分所述文本语料获取的文本片段;判断所述候选词是否满足预设条件;若所述候选词满足所述预设条件,将所述候选词确定为候选新词;判断所述候选新词是否包含在所述预设词库中;以及若所述候选新词不包含在所述预设词库中,将所述候选新词确定为新词。In a first aspect, an embodiment of the present application provides a new word recognition method, including: obtaining a text corpus, and dividing the text corpus into candidate words with a length of 2-N through N-ary segmentation according to a preset sentence endpoint, Where N is a natural number and N≥2, the candidate word refers to a text segment obtained by dividing the text corpus; determine whether the candidate word meets a preset condition; if the candidate word meets the preset condition , Determine the candidate word as a candidate new word; determine whether the candidate new word is included in the preset thesaurus; and if the candidate new word is not included in the preset thesaurus, The candidate new words are determined as new words.
第二方面,本申请实施例还提供了一种新词识别装置,包括:切分单元,用于获取文本语料,根据预设句子端点,通过N元切分将所述文本语料切分成 长度为2-N的候选词,其中,N为自然数,且N≥2,所述候选词是指切分所述文本语料获取的文本片段;判断单元,用于判断所述候选词是否满足预设条件;第一识别单元,用于若所述候选词满足所述预设条件,将所述候选词确定为候选新词;过滤单元,用于判断所述候选新词是否包含在所述预设词库中;以及第二识别单元,用于若所述候选新词不包含在所述预设词库中,将所述候选新词确定为新词。In a second aspect, an embodiment of the present application further provides a new word recognition device, including: a segmentation unit, configured to obtain a text corpus, and segmenting the text corpus into lengths by N-ary segmentation according to a preset sentence endpoint 2-N candidate words, where N is a natural number, and N ≥ 2, the candidate words refer to text segments obtained by segmenting the text corpus; a judgment unit is used to judge whether the candidate words meet a preset condition The first recognition unit is used to determine the candidate word as a candidate new word if the candidate word meets the preset condition; the filtering unit is used to determine whether the candidate new word is included in the preset word In the library; and a second recognition unit for determining that the candidate new word is a new word if the candidate new word is not included in the preset word library.
第三方面,本申请实施例还提供了一种计算机设备,其包括存储器及处理器,所述存储器上存储有计算机程序,所述处理器执行所述计算机程序时实现所述新词识别方法。In a third aspect, an embodiment of the present application further provides a computer device, which includes a memory and a processor, a computer program is stored on the memory, and the new word recognition method is implemented when the processor executes the computer program.
第四方面,本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器执行所述新词识别方法。According to a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, causes the processor to execute the new word recognition methods.
附图说明BRIEF DESCRIPTION
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly explain the technical solutions of the embodiments of the present application, the following will briefly introduce the drawings required in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative work.
图1为本申请实施例提供的新词识别方法的应用场景示意图;1 is a schematic diagram of an application scenario of a new word recognition method provided by an embodiment of this application;
图2为本申请实施例提供的新词识别方法的流程示意图;2 is a schematic flowchart of a new word recognition method provided by an embodiment of the present application;
图3为本申请另一个实施例提供的新词识别方法的流程示意图;3 is a schematic flowchart of a new word recognition method provided by another embodiment of this application;
图4为本申请实施例提供的新词识别装置的示意性框图;4 is a schematic block diagram of a new word recognition device provided by an embodiment of this application;
图5为本申请另一个实施例提供的新词识别装置的示意性框图;以及5 is a schematic block diagram of a new word recognition device provided by another embodiment of this application; and
图6为本申请实施例提供的计算机设备的示意性框图。6 is a schematic block diagram of a computer device provided by an embodiment of the present application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部 的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in the following with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative work fall within the scope of protection of this application.
请参阅图1,图1为本申请实施例提供的新词识别方法的应用场景示意图。所述应用场景包括:Please refer to FIG. 1, which is a schematic diagram of an application scenario of a new word recognition method provided by an embodiment of the present application. The application scenarios include:
(1)计算机设备。图1所示计算机设备是进行新词识别的设备,其上安装有进行新词识别的应用,所述计算机设备由人工进行操作。所述计算机设备可以为笔记本电脑、平板电脑、台式电脑或者服务器等电子设备。(1) Computer equipment. The computer device shown in FIG. 1 is a device for recognizing new words, on which an application for recognizing new words is installed, and the computer device is manually operated. The computer device may be an electronic device such as a notebook computer, a tablet computer, a desktop computer, or a server.
图1中的各个主体工作过程如下:人工使用计算机设备进行新词识别,计算机设备上安装有新词识别的应用,计算机设备获取文本语料,根据预设句子端点,通过N元切分将所述文本语料切分成长度为2-N的候选词,判断所述候选词是否满足预设条件,若所述候选词满足所述预设条件,将所述候选词确定为候选新词,判断所述候选新词是否包含在所述预设词库中,若所述候选新词不包含在所述预设词库中,将所述候选新词确定为新词,计算机设备将识别结果显示给人工,以完成对所述文本语料的新词识别。The working process of each subject in Figure 1 is as follows: artificially use a computer device for new word recognition, a new word recognition application is installed on the computer device, the computer device obtains a text corpus, and divides the sentence by N-ary segmentation according to the preset sentence endpoint The text corpus is divided into candidate words with a length of 2-N, to determine whether the candidate words meet the preset conditions, and if the candidate words meet the preset conditions, the candidate words are determined as candidate new words, and the Whether the candidate new word is included in the preset thesaurus, if the candidate new word is not included in the preset thesaurus, the candidate new word is determined as a new word, and the computer device displays the recognition result to the human To complete the recognition of new words in the text corpus.
需要说明的是,图1中仅仅示意出台式电脑作为计算机设备,在实际操作过程中,计算机设备的类型不限于图1中所示,所述计算机设备还可以为笔记本电脑或者平板电脑等电子设备,上述新词识别方法的应用场景仅仅用于说明本申请技术方案,并不用于限定本申请技术方案。It should be noted that FIG. 1 only illustrates a desktop computer as a computer device. In actual operation, the type of computer device is not limited to that shown in FIG. 1. The computer device may also be an electronic device such as a notebook computer or tablet computer. The application scenario of the above new word recognition method is only used to explain the technical solution of the present application, and is not used to limit the technical solution of the present application.
图2为本申请实施例提供的新词识别方法的示意性流程图。该新词识别方法应用于图1中的终端中,以完成新词识别方法的全部或者部分功能。FIG. 2 is a schematic flowchart of a new word recognition method provided by an embodiment of the present application. The new word recognition method is applied to the terminal in FIG. 1 to complete all or part of the functions of the new word recognition method.
图2是本申请实施例提供的新词识别方法的流程示意图。如图2所示,该方法包括以下步骤S210-S250:2 is a schematic flowchart of a new word recognition method provided by an embodiment of the present application. As shown in FIG. 2, the method includes the following steps S210-S250:
S210、获取文本语料,根据预设句子端点,通过N元切分将所述文本语料切分成长度为2-N的候选词,其中,N为自然数,且N≥2,其中,所述候选词是指切分所述文本语料获取的文本片段。S210: Obtain a text corpus, and divide the text corpus into candidate words with a length of 2-N according to a preset sentence endpoint, and N is a natural number, and N≥2, where the candidate words Refers to a text segment obtained by segmenting the text corpus.
其中,新词是指给定一段文本,随机取一个片段,若该片段具有独立含义,且未包含在现有词库或者词典中,不为已知的词语,判断该片段为新词。判断文本中的一个片段是否为新词,可以通过给定一段文本,随机取一个片段,若 该片段左右的搭配很丰富,即该片段的左端和右端可以与不同的字或者词进行搭配以完成完整含义的表述,并且片段内部成分搭配很固定,也就是该片段经常以固定整体出现,可以判断该片段是一个词汇,如果这个词不存在现有词典中,判断该片段是一个新词。比如,在一个文本中,若“普惠金融”的左端或者右端分别和不同的字或者词搭配,以进行语义的描述,并且“普惠金融”未包含在现有词典中,判断“普惠金融”为一个新词。Among them, the new word refers to a given piece of text, randomly take a segment, if the segment has an independent meaning, and is not included in the existing thesaurus or dictionary, is not a known word, the segment is judged as a new word. To determine whether a segment in the text is a new word, you can take a segment randomly by giving a segment of text. If the left and right collocation of the segment is very rich, that is, the left and right ends of the segment can be matched with different words or words to complete The full meaning of the expression, and the internal composition of the segment is very fixed, that is, the segment often appears as a fixed whole, you can judge that the segment is a vocabulary, if the word does not exist in the existing dictionary, the segment is a new word. For example, in a text, if the left or right end of "Pu Hui Finance" is matched with different words or words to describe semantically, and "Pu Hui Finance" is not included in the existing dictionary, judge "Pu Hui Finance" "Finance" is a new word.
N元切分,是指对文本语料依次进行相邻N个汉字的分割或者划分,获取包含N个汉字的文本片段,比如,2元切分,是指对文本语料依次进行相邻2个汉字的分割或者划分,获取包含2个汉字的文本片段,3元切分,是指对文本语料依次进行相邻3个汉字的分割或者划分,获取包含3个汉字的文本片段等。比如,对文本语料“我是一个人”进行2元切分,获取的文本片段为“我是”、“是一”、“一个”及“个人”,对文本语料“我是一个人”进行3元切分,获取的文本片段为“我是一”、“是一个”及“一个人”。N-ary segmentation refers to the sequential segmentation or division of adjacent N Chinese characters in the text corpus to obtain a text segment containing N Chinese characters, for example, 2-element segmentation refers to sequentially performing two adjacent Chinese characters on the text corpus Segmentation or division, to obtain a text segment containing 2 Chinese characters, 3 yuan segmentation, refers to the text corpus is divided into three adjacent Chinese characters or division, to obtain a text segment containing 3 Chinese characters and so on. For example, the text corpus "I am a person" is divided into two parts, and the obtained text fragments are "I am", "Is one", "one" and "person", and the text corpus "I am a person" Three yuan segmentation, the obtained text fragments are "I am one", "Is one" and "one person".
文本语料,是指进行新词识别的文本所属的语言材料。所述文本语料可以为一段文字、一篇文章、一个网站的网页或者一本书籍等。所述文本语料可以是存储在移动存储器、计算机设备或者互联网中的电子版书籍或者文本,比如,以Word格式保存的文本,也可以是指定网站的网页等。Text corpus refers to the language material to which the text for new word recognition belongs. The text corpus may be a piece of text, an article, a web page of a website, or a book. The text corpus may be an electronic book or text stored in a mobile memory, a computer device, or the Internet, for example, text saved in Word format, or a web page of a designated website.
所述候选词是指切分所述文本语料获取的文本片段。根据预设句子端点将文本语料进行切分后,会获取多个文本片段,所述文本片段有可能是词语,也有可能不是词语,需要根据预设条件进行筛选以判断是否是词语,若所述文本片段满足预设条件,判断是词语,若所述文本片段不满足预设条件,判断为不是词语。由于所述文本片段处于成为词语的候选状态,称为候选词。2-N的候选词,是指候选词的长度为2至N,也就是候选词包含的汉字个数分别为2、3、4…N,比如,2-5的候选词,是指候选词的长度分别为2、3、4和5,也就是候选词分别包含2个汉字、3个汉字、4个汉字及5个汉字。The candidate word refers to a text segment obtained by segmenting the text corpus. After segmenting the text corpus according to the preset sentence endpoints, multiple text fragments may be obtained. The text fragments may be words or may not be words, and need to be filtered according to preset conditions to determine whether they are words. The text segment meets the preset condition and is determined to be a word, and if the text segment does not satisfy the preset condition, it is determined to be not a word. Since the text segment is in a candidate state for words, it is called a candidate word. 2-N candidate words refer to candidate words with a length of 2 to N, that is, the number of Chinese characters included in the candidate words is 2, 3, 4 ... N, for example, 2-5 candidate words refer to candidate words The lengths of are 2, 3, 4, and 5, respectively, that is, the candidate words include 2 Chinese characters, 3 Chinese characters, 4 Chinese characters, and 5 Chinese characters, respectively.
进一步地,所述预设句子端点是指预先设置所述候选词的词边界,以这些词边界为端点将文本语料进行切分,以获取候选词。所述预设句子端点包括标点符号和预设分割端点,所述预设分割端点是指所述文本语料中除标点符号外 预先被设置为句子端点的成分,相对于标点符号,是人为将文本语料中具有独立意义的固定成分作为切分所述文本语料的词边界,属于人工句子端点。其中,文本语料一般包括文字、标点符号、回车、空格等成分。Further, the preset sentence endpoint refers to setting the word boundaries of the candidate words in advance, and using these word boundaries as endpoints to segment the text corpus to obtain candidate words. The preset sentence endpoint includes a punctuation mark and a preset segmentation endpoint. The preset segmentation endpoint refers to a component of the text corpus that is previously set as a sentence endpoint except for punctuation. The fixed components with independent meaning in the corpus are used as the word boundaries that segment the text corpus, and belong to the endpoints of artificial sentences. Among them, the text corpus generally includes text, punctuation marks, carriage returns, spaces and other components.
标点符号,是指文本语料中完成完整含义描述后、起停顿作用以形成断句的句子标点符号,比如逗号、分号、双引号及句号等一般用于句子断句的符号。Punctuation refers to the punctuation of sentences in the text corpus after complete description of the meaning, which serves as a pause to form a sentence break, such as comma, semicolon, double quotation mark, and period.
所述预设分割端点包括所述文本语料中除标点符号外具有停顿或者断句作用的符号,比如,空格符和回车符等,以及具有独立语义的停用字和停用词。停用字是指文本语料中用于停顿用的字,停用词是指文本预料中用于停顿用的词,停用字和停用词一般具有独立的含义,比如,常用的停用字包括“你”、“我”、“她”及“的”等字,停用词包括“我们”、“根据”、“所述”及“作为”等词语。预设分割端点可以看作是标点符号的延伸,标点符号一般是用在句子与句子之间形成断句,预设分割端点可以看作是句子内部之间对句子成分之间形成停顿或者断句作用,可以像标点符号一样作为句子端点识别词边界。所述常用字和常用词在切分文本中作为独立特征使用,把停用字和停用词作为切分文本语料以获取候选词的左边界和右边界,所述停用字比如“在”、“我”、“的”等,停用词包括“我们”“你们”“这些”等本身已经不可拆分并且具有独立语义的固定词语,把它们作为候选词的词边界的左边界或者右边界,通过预设分割端点,可以有效的提高新词发现的准确率,提高新词识别的效率。The preset segmentation endpoint includes symbols in the text corpus that have a pause or sentence-breaking function in addition to punctuation, such as space characters and carriage returns, and stop words and stop words with independent semantics. Stop word refers to the word used for pause in the text corpus. Stop word refers to the word used for pause in the text. The stop word and the stop word generally have independent meanings. For example, commonly used stop words Including words such as "you", "me", "her" and "de", stop words include words such as "we", "based", "said" and "act". The preset segmentation endpoints can be regarded as an extension of punctuation marks. Punctuation symbols are generally used to form sentence breaks between sentences. The preset segmentation endpoints can be regarded as the formation of pauses or sentence breaks between sentence components within sentences. Word boundaries can be identified as endpoints of sentences like punctuation marks. The common words and common words are used as independent features in the segmented text, and the stop words and stop words are used as the segmented text corpus to obtain the left and right boundaries of the candidate words, such as "在" , "I", "的", etc., stop words include "we" "you" "these" and other fixed words that are already inseparable and have independent semantics, use them as the left or right boundary of the word boundary of the candidate words Boundaries, through preset segmentation endpoints, can effectively improve the accuracy of new word discovery and improve the efficiency of new word recognition.
在切分文本语料以获得候选词时,加入属于人工句子端点的预设分割端点作为独立特征,结合标点符号作为切分所述文本语料的预设句子端点,不依赖于任何已有的词库,仅仅根据词的共同特征,将一段大规模语料中可能成词的文本片段全部提取出来,作为候选词,再对所述候选词进行满足预设条件的识别,识别出具有独立语义的候选新词,再把所有抽出来的候选新词和已有词库进行比较,筛选出现有词库中未包含的候选新词为识别出的新词,可以有效提高新词发现的准确率和召回率。When segmenting a text corpus to obtain candidate words, add a preset segmentation endpoint belonging to an artificial sentence endpoint as an independent feature, and combine punctuation marks as a preset sentence endpoint for segmenting the text corpus, without relying on any existing thesaurus , Based on the common characteristics of the words, extract all possible text fragments in a large-scale corpus as candidate words, and then identify the candidate words that meet the preset conditions to identify candidate new words with independent semantics Words, and then compare all the extracted candidate new words with the existing thesaurus, and filter out the candidate new words that are not contained in the thesaurus as the identified new words, which can effectively improve the accuracy of the new word discovery and the recall rate. .
具体地,计算机设备在进行新词识别时,获取要进行新词识别的文本语料,所述文本语料可以为一段文字、一篇文章、一个网站的网页或者一本书籍等。根据包括标点符号、空格、回车、停用字和停用词等预设句子端点切分所述文 本语料,通过N元切分将所述文本语料切分成长度为2-N的候选词,其中,N为自然数,且N≥2,获取切分后的候选词。比如,若N为5,将获得的文本语料切分为长度分别2、3、4及5的候选词,也就是以候选词为两个字、三个字、四个字及五个字等分别进行切分,获取切分后的候选词。比如,将文本语料“我非常热衷于用统计的方法去分析汉语资料。”,进行N元切分(假设N=3),预设句子端点为标点符号及“我”“的”及“非常”,切分后得到候选词包括:“热衷”、“热衷于”、“衷于”“衷于用”“于用”、“于用统”、“用统”、“用统计”、“统计”、“统计的”、“计的”、“的方”、“的方法”、“方法”、“方法去”、“法去”、“法去分”、“去分”、“去分析”、“分析”、“分析汉”、“析汉”、“析汉语”、“汉语”、“汉语资”、“语资”、“语资料”、“资料”、“资料的”、“料的”。Specifically, when the new word recognition is performed by the computer device, the text corpus to be recognized for the new word is obtained, and the text corpus may be a piece of text, an article, a web page of a website, or a book. The text corpus is segmented according to preset sentence endpoints including punctuation marks, spaces, carriage returns, stop words and stop words, and the text corpus is divided into candidate words with a length of 2-N by N-ary segmentation, Among them, N is a natural number, and N ≥ 2, to obtain the candidate words after segmentation. For example, if N is 5, the obtained text corpus is divided into candidate words of length 2, 3, 4, and 5, respectively, that is, the candidate words are two characters, three characters, four characters, and five characters, etc. Perform segmentation separately to obtain the candidate words after segmentation. For example, the text corpus "I am very keen to use statistical methods to analyze Chinese data.", Perform N-ary segmentation (assuming N = 3), and presuppose that the endpoints of the sentence are punctuation marks and "I", "I" and "Very ", The candidate words obtained after segmentation include:" Passion "," Passion "," Passion "," Passion "," Using "," Using system "," Using system "," Using statistics "," Statistics "," Statistical "," Counted "," Fang "," Method "," Method "," Method to "," Method to "," Method to divide "," To divide "," Go "Analysis", "Analysis", "Analysis of Chinese", "Analysis of Chinese", "Analysis of Chinese", "Chinese", "Chinese Language", "Language", "Language Information", "Information", "Information", "Expected".
进一步地,N需要根据具体文本进行设置,比如“君子之交淡如水”为七个字的词语,“百尺竿头更进一步”为八个字的词语,而一些公司名字可以有更多的字,需要根据不同的文本语料设置具体的数字N。进一步地,针对同一文本语料,可以设置不同的数字N,比较不同的识别结果,将相同的识别结果去掉,从而筛选出长粒度新词,根据长粒度新词的识别结果,出结果比较理想的新词识别。Further, N needs to be set according to the specific text, for example, "the gentleman's cross is as light as water" is a seven-word word, "Baizhigantou further" is an eight-word word, and some company names can have more Word, you need to set a specific number N according to different text corpus. Further, for the same text corpus, you can set different numbers N, compare different recognition results, remove the same recognition results, so as to filter out long-grained new words, according to the recognition results of long-grained new words, the results are more ideal New word recognition.
S220、判断所述候选词是否满足预设条件。S220. Determine whether the candidate word meets a preset condition.
其中,所述预设条件是指识别所述候选词为所述候选新词的条件,若所述候选词满足所述预设条件,识别所述候选词为词语,判断所述候选词为候选新词,若所述候选词不满足预设条件,识别所述候选词不为词语,判断所述候选词不为候选新词。所述预设条件所述候选词分别满足所述词频、互信息和左右信息熵各自的第一预设阈值,或者所述候选词分别满足所述词频、互信息和句子端点各自的第二预设阈值。Wherein, the preset condition refers to a condition for identifying the candidate word as the candidate new word. If the candidate word meets the preset condition, the candidate word is identified as a word, and the candidate word is determined as a candidate New words, if the candidate words do not satisfy the preset condition, identify that the candidate words are not words, and judge that the candidate words are not candidate new words. In the preset condition, the candidate words satisfy the first preset thresholds of the word frequency, mutual information, and left and right information entropy, respectively, or the candidate words satisfy the second presets of the word frequency, mutual information, and sentence endpoints, respectively. Set the threshold.
具体地,计算机设备将获取的文本语料进行切分,获取作为候选词的文本片段,有的候选词是词语,有的不是词语,因此需要使用预设条件对所述候选词进行筛选,将不是词语的文本片段过滤掉,将是词语的文本片段保留,以做进一步识别。因此,判断所述候选词是否满足预设条件,以识别所述候选词是否成为词语,成为词语的候选词,识别为候选新词,进入步骤S230,若所述候 选词不满足预设条件,表明所述候选词不能成为词语,进入步骤S221,将所述候选词过滤点,进行舍弃,以进一步缩小新词识别的范围,提高新词识别的效率和准确率。Specifically, the computer device divides the obtained text corpus to obtain text fragments as candidate words. Some candidate words are words, and some are not words. Therefore, it is necessary to filter the candidate words using preset conditions. The text fragments of the words are filtered out, and the text fragments of the words are retained for further recognition. Therefore, it is determined whether the candidate word satisfies the preset condition to identify whether the candidate word becomes a word, a candidate word of the word, and a new word candidate. Step S230 is entered. If the candidate word does not satisfy the preset condition, It indicates that the candidate word cannot be a word. Step S221 is entered, and the candidate word filtering points are discarded to further narrow the scope of new word recognition and improve the efficiency and accuracy of new word recognition.
S230、若所述候选词满足所述预设条件,将所述候选词确定为候选新词。S230. If the candidate word meets the preset condition, determine the candidate word as a candidate new word.
其中,所述候选新词是指识别为词语的候选词。通过N元切分,将所述文本语料切分成长度为2-N的候选词,其中有的候选词不能成为词语。比如,将文本语料“我非常热衷于用统计的方法去分析汉语资料。”,进行N元切分(假设N=3),预设句子端点为标点符号及“我”“的”及“非常”,切分后得到候选词中“于用”、“于用统”、“用统”、“计的”、“的方”及“的方法”等候选词,由人工根据经验判断明显不能成为词语,因此,对获得的候选词需要根据预设条件进行筛选,识别所述候选词为候选新词,且所述候选新词是指识别为词语的候选词,以过滤掉不能成为词语的文本片段,缩小新词识别的范围。Wherein, the candidate new word refers to a candidate word recognized as a word. Through N-ary segmentation, the text corpus is divided into 2-N candidate words, some of which cannot be words. For example, the text corpus "I am very keen to use statistical methods to analyze Chinese data.", Perform N-ary segmentation (assuming N = 3), and presuppose that the endpoints of the sentence are punctuation marks and "I", "I" and "Very Very" ", After segmentation, candidate words such as" Yu Yong "," Yu Tong Tong "," Yong Tong "," Guo De "," Dai Fang ", and" Method "are obtained from the candidate words, which is obviously impossible by human judgment based on experience Become a word, therefore, the obtained candidate words need to be filtered according to a preset condition, and the candidate word is identified as a candidate new word, and the candidate new word refers to the candidate word recognized as a word to filter out the words that cannot become words Text fragments to narrow the scope of new word recognition.
具体地,计算机设备对于获得的候选词进行筛选,对于不能成为词语的候选词,需要通过预设条件进行过滤,移除不能成为词语的候选词,只保留能成为词语的候选词,以进一步进行新词识别。若所述候选词满足所述预设条件,识别所述候选词为词语,判断所述候选词为候选新词,若所述候选词不满足预设条件,识别所述候选词不为词语,判断所述候选词不为候选新词,从而进一步缩小新词识别的范围,提高新词识别的准确率、效率和召回率。Specifically, the computer device filters the obtained candidate words. For the candidate words that cannot be words, it is necessary to filter through the preset conditions, remove the candidate words that cannot be words, and only keep the candidate words that can be words, to further proceed New word recognition. If the candidate word satisfies the preset condition, the candidate word is identified as a word, and the candidate word is determined to be a new candidate word; if the candidate word does not satisfy the preset condition, the candidate word is not identified as a word, It is determined that the candidate word is not a candidate new word, thereby further narrowing the scope of new word recognition, and improving the accuracy, efficiency, and recall rate of new word recognition.
比如,请参阅表格1。若设置候选词的最低左右信息熵的预设阈值为1,最低互信息的预设阈值为1,句子端点的预设阈值为3。其中,最低左右信息熵的预设阈值为1是指所述候选词的左邻字信息熵和右邻字信息熵中的较小值为1。句子端点的预设阈值是指所述候选词的左边界或者右边界的句子端点出现的次数。请参见表格1,在对一语料识别后,得到的结果如表格1所示,根据上述判断标准可知,“南山区”、“南山”、“普惠”被识别为候选新词,“去南山”不成词,被排除在候选新词外。For example, see Table 1. If the preset threshold of the minimum left and right information entropy of the candidate word is set to 1, the preset threshold of the lowest mutual information is 1, and the preset threshold of the sentence endpoint is 3. Wherein, the preset threshold value of the lowest left and right information entropy is 1 means that the smaller value of the left and right neighbor information entropy of the candidate word is 1. The preset threshold of the sentence endpoint refers to the number of occurrences of the sentence endpoint of the left or right boundary of the candidate word. Please refer to Table 1. After recognizing a corpus, the results are shown in Table 1. According to the above judgment criteria, "Nanshan District", "Nanshan", and "Puhui" are recognized as candidate new words, "Go to Nanshan" "No word, excluded from candidate new words.
表格1Table 1
候选词Candidate 词频Word frequency 互信息Mutual information 左右信息熵Left and right information entropy 端点数Number of endpoints 是否成词Whether it is a word
南山区Nanshan District 175175 5.75485.7548 2.28812.2881 88 Yes
去南山Go to Nanshan 23twenty three 0.82560.8256 3.37513.3751 33 no
南山Nanshan 27742774 9.63109.6310 5.72005.7200 2828 Yes
普惠Pratt & Whitney 1818 2.38112.3811 0.83320.8332 33 Yes
S240、判断所述候选新词是否包含在所述预设词库中。S240: Determine whether the candidate new word is included in the preset vocabulary.
其中,预设词库,也可以称为已有词库,是指包含有已经确定为词语的已知词的集合,可以为预设词典。The preset thesaurus can also be called an existing thesaurus, which refers to a set of known words that have been determined as words, and can be a preset dictionary.
具体地,计算机设备通过N元切分将所述文本语料切分成长度为2-N的候选词,候选词也就是对文本语料经过切分获取的文本片段。若所述候选词满足所述预设条件,将所述候选词确定为候选新词,识别出候选新词,只是从候选词中筛选出了能够成为词语的文本片段。候选新词中包含有自然语言处理技术中已经确认为词语的词和新识别出来的词语,因此需要将已经确认为词语的词过滤掉,筛选出来为识别出来的新词。预设词库中包含有自然语言处理技术中已经确认为词语的词,所述预设词库为传统技术中已有的各种现有词典,也可以为由人工设置的词库,比如设置传统技术中已有的几本词典的集合为词库,所述预设词库中还可以包含以往识别出来的新词。使用预设词库过滤所述候选新词,以判断所述候选新词是否包含在所述预设词库中,也就是检测所述候选新词是否包含在所述预设词库中,可以采用匹配的方式,若所述候选新词能在所述预设词库中匹配得到,判断所述候选新词包含在所述预设词库中,进入步骤S241,所述候选新词为已知的词语,过滤掉已有词;若所述候选新词未能在所述预设词库中匹配得到,判断所述候选新词未包含在所述预设词库中,所述候选新词为未知的词语,为识别出来的新词,进入步骤S250。比如,请参见表格1,在对一语料识别后,得到的结果如表格1所示,若“南山区”、“南山”及“普惠”被识别为候选新词,则用预设词典过滤获得的候选新词“南山区”、“南山”及“普惠”,若候选新词“南山区”、“南山”在所述预设词库中,“南山区”及“南山”为已知的旧词,若候选新词“普惠”未包含在所述预设词库中,判定“普惠”为识别出来的新词。Specifically, the computer device divides the text corpus into candidate words with a length of 2-N through N-ary segmentation, and the candidate words are text segments obtained by segmenting the text corpus. If the candidate word satisfies the preset condition, the candidate word is determined as a candidate new word, and the candidate new word is identified, but only text fragments that can become words are selected from the candidate word. The candidate new words include words that have been confirmed as words in the natural language processing technology and newly recognized words. Therefore, the words that have been confirmed as words need to be filtered out and screened out as recognized new words. The preset thesaurus contains words that have been confirmed as words in natural language processing technology. The preset thesaurus is a variety of existing dictionaries already in the traditional technology, or may be a manually set thesaurus, such as setting The collection of several dictionaries existing in the conventional technology is a thesaurus, and the preset thesaurus may also contain new words that have been recognized in the past. Filtering the candidate new words using a preset thesaurus to determine whether the candidate new words are included in the preset thesaurus, that is, detecting whether the candidate new words are included in the preset thesaurus, can In a matching manner, if the candidate new word can be matched in the preset thesaurus, it is determined that the candidate new word is included in the preset thesaurus, and step S241 is entered, where the candidate new word is Known words, filter out the existing words; if the candidate new words cannot be matched in the preset vocabulary, it is judged that the candidate new words are not included in the preset vocabulary, the candidate new The word is an unknown word, and for the recognized new word, step S250 is entered. For example, please refer to Table 1. After recognizing a corpus, the results obtained are shown in Table 1. If "Nanshan District", "Nanshan" and "Puhui" are identified as candidate new words, the default dictionary is used to filter The candidate new words "Nanshan District", "Nanshan" and "Pu Hui" are obtained. If the candidate new words "Nanshan District" and "Nanshan" are in the preset vocabulary, "Nanshan District" and "Nanshan" are already Known old words, if the candidate new word "Puhui" is not included in the preset vocabulary, it is determined that "Puhui" is a recognized new word.
S250、若所述候选新词不包含在所述预设词库中,将所述候选新词确定为新词。S250. If the candidate new word is not included in the preset thesaurus, determine the candidate new word as a new word.
具体地,若所述候选新词未包含在所述预设词库中,认为所述候选新词不为已知的词语,所述候选新词为未知的词语,为识别出来的新词,识别出来的新词一般是在以往未见过的词语,从而完成对所给文本语料的新词识别。比如,请参见表格1,在对一语料识别后,得到的结果如表格1所示,若识别出来的候选新词“普惠”未包含在所述预设词库中,判断“普惠”为识别出来的新词。Specifically, if the candidate new word is not included in the preset thesaurus, the candidate new word is considered not to be a known word, the candidate new word is an unknown word, and is a recognized new word, The recognized new words are generally words that have not been seen in the past, so as to complete the recognition of the new words of the given text corpus. For example, please refer to Table 1. After recognizing a corpus, the result is shown in Table 1. If the identified candidate new word "Puhui" is not included in the preset lexicon, judge "Puhui" For the new words identified.
本申请实施例基于语音语义中的自然语言处理,在切分文本语料以获得候选词时,通过N元切分结合预设句子端点将文本语料进行准确切分,以获取长度为2-N候选词,不依赖于任何已有的词库,仅仅根据词的共同特征,将一段大规模语料中可能成词的文本片段全部提取出来,作为候选词,通过预设句子端点作为独立特征,作为切分所述文本语料的词边界,减少候选词的数量,提高切分的准确率和效率,再对所述候选词进行是否满足预设条件的识别,若候选词满足预设条件,识别为候选新词,作为具有独立语义的候选新词,从而缩小新词识别的范围,再把所有抽出来的候选新词和已有词库进行比较,若候选新词不包含在预设词库,识别为新词,筛选出现有词库中未包含的候选新词为识别出的新词,可以有效提高新词发现的准确率、效率和召回率。Embodiments of the present application are based on natural language processing in speech semantics. When text corpus is segmented to obtain candidate words, the text corpus is accurately segmented through N-ary segmentation combined with preset sentence endpoints to obtain a candidate with a length of 2-N Words, do not depend on any existing thesaurus, just extract all possible text fragments in a large-scale corpus based on the common characteristics of the words, as candidate words, through the preset sentence endpoints as independent features, as cut Divide the word boundaries of the text corpus, reduce the number of candidate words, improve the accuracy and efficiency of segmentation, and then identify whether the candidate words meet the preset conditions. If the candidate words meet the preset conditions, they are recognized as candidates New words, as candidate new words with independent semantics, narrow the scope of new word recognition, and then compare all extracted candidate new words with existing thesaurus, if the candidate new words are not included in the preset thesaurus, recognize For new words, screening candidate new words that are not included in the thesaurus as recognized new words can effectively improve the accuracy, efficiency and recall rate of new word discovery.
在一个实施例中,所述判断所述候选词是否满足预设条件的步骤包括:In one embodiment, the step of determining whether the candidate word meets a preset condition includes:
获取所述候选词的互信息和左右信息熵,并且获取所述候选词的词频,其中,所述左右信息熵是指所述候选词的左邻字信息熵和右邻字信息熵中的较小值;Acquiring the mutual information and left and right information entropy of the candidate words, and acquiring the word frequency of the candidate words, wherein the left and right information entropy refers to the comparison between the left entropy information entropy and the right entropy information entropy of the candidate words Small value
判断所述候选词的词频、互信息和左右信息熵是否分别大于或等于词频第一预设阈值、互信息第一预设阈值以及左右信息熵第一预设阈值;Judging whether the word frequency, mutual information and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset threshold of left and right information entropy;
若所述候选词的词频、互信息和左右信息熵分别大于或者等于所述词频第一预设阈值、互信息第一预设阈值以及左右信息熵第一预设阈值,判定所述候选词满足预设条件。If the word frequency, mutual information and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset threshold of left and right information entropy, it is determined that the candidate word meets Preset conditions.
其中,互信息是指候选词的内部凝聚力,也可以称为候选词的内部凝固程度或者凝合程度。互信息的公式为:Among them, mutual information refers to the internal cohesion of the candidate words, and may also be referred to as the degree of internal coagulation or the degree of cohesion of the candidate words. The formula for mutual information is:
Figure PCTCN2018124797-appb-000001
Figure PCTCN2018124797-appb-000001
其中w表示候选词,p(x)为候选词x在整个语料中出现的概率,l表示组成候选词的左字符串,r表示组成候选词的右字符串。Where w represents the candidate word, p (x) is the probability that the candidate word x appears in the entire corpus, l represents the left character string that constitutes the candidate word, and r represents the right character string that constitutes the candidate word.
比如,在一个包含候选词“南山区”的语料中,若“南山区”由“南山”和“区”组成,“南山”为候选词“南山区”的左字符串,“区”为候选词“南山区”的右字符串。如果“南山”和“区”是各自独立地在文本语料中随机出现,它俩正好拼到一起的概率是多少呢?若在所述文本语料整个2400万字的数据中,“南山”一共出现了2774次,出现的概率约为0.000113,“区”出现了4797次,出现的概率约为0.0001969。如果两者之间毫无关系,它们恰好拼在了一起的概率就应该是0.000113×0.0001969,约为2.223×10 -8次方。但“南山区”在语料中一共出现了175次,出现概率约为7.183×10 -6次方,是预测值的300多倍。同样地,在所述文本语料中,统计可得“去”字的出现概率约为0.0166,因而“去”和“南山”随机组合到了一起的理论概率值为0.0166×0.000113,约为1.875×10 -6,这与“去南山”出现的真实概率很接近,真实概率约为1.6×10 -5次方,是预测值的8.5倍。结果表明,“南山区”更可能是一个有意义的搭配,而“去南山”则更像是“去”和“南山”这两个成分偶然拼到一起的。但是,在新词识别中,无法判断“南山区”是“南山”加“区”得来的,也无法判断“去南山”是“去”加上“南山”得来的。错误的切分方法会过高地估计该片段的凝合程度。如果把“南山区”看作是“南”加“山区”所得,由此得到的凝合程度会更高一些。因此,为了算出一个候选词的凝合程度,采取枚举它的凝合方式,这个候选词是由哪两部分组合而来的。令p(x)为候选词x在整个语料中出现的概率,那么定义“南山区”的凝合程度就是p(南山区)与p(南)*p(山区)比值和p(南山区)与p(南山)*p(区)的比值中的较小值,“去南山”的凝合程度则是p(去南山)分别除以p(去)*p(南山)和p(去南)*p(山)所得的互信息的较小值。可以得到,凝合程度最高的候选词就是诸如“蝙蝠”、“蜘蛛”、“彷徨”、“忐忑”、“玫瑰”之类的词,这些词里的每一个字几乎总是会和另一个字同时出现,从不在其他场合中使用。 For example, in a corpus containing the candidate word "Nanshan District", if "Nanshan District" is composed of "Nanshan" and "District", "Nanshan" is the left string of the candidate word "Nanshan District", and "District" is the candidate The right string of the word "Nanshan District". If "Nanshan" and "District" appear independently and randomly in the text corpus, what is the probability that they will just fit together? If in the entire 24 million-word data of the text corpus, "Nanshan" appears a total of 2774 times, the probability of occurrence is about 0.000113, and "District" appears 4797 times, the probability of occurrence is about 0.0001969. If there is no relationship between the two, the probability that they happen to fit together should be 0.000113 × 0.0001969, which is about 2.223 × 10 -8 power. However, "Nanshan District" appeared 175 times in the corpus, with an occurrence probability of about 7.183 × 10 -6 , which is more than 300 times the predicted value. Similarly, in the text corpus, statistics show that the occurrence probability of the word "Go" is about 0.0166, so the theoretical probability value that "Go" and "Nanshan" are randomly combined is 0.0166 × 0.000113, which is about 1.875 × 10. -6 , which is very close to the true probability of "Going to Nanshan". The true probability is about 1.6 × 10 -5 power, which is 8.5 times the predicted value. The results show that "Nanshan District" is more likely to be a meaningful collocation, and "Go to Nanshan" is more like the two components of "Go" and "Nanshan" accidentally joined together. However, in the recognition of new words, it is impossible to judge that "Nanshan District" is derived from "Nanshan" plus "District", nor can it be judged that "Go to Nanshan" is derived from "Go" plus "Nanshan". The wrong segmentation method will overestimate the degree of condensation of the fragment. If "Nanshan District" is regarded as the result of "Nan" plus "Mountain District", the resulting degree of cohesion will be higher. Therefore, in order to calculate the degree of cohesion of a candidate word, it is necessary to enumerate its condensed way, which two parts of this candidate word are combined. Let p (x) be the probability of the candidate word x appearing in the entire corpus, then the degree of cohesion that defines "Nanshan District" is the ratio of p (Nanshan District) to p (South) * p (Mountain District) and p (Nanshan District) The smaller of the ratios with p (南山) * p (区), the condensing degree of "Go to Nanshan" is p (Go to Nanshan) divided by p (Go) * p (南山) and p (Go to South) ) * p (mountain) The smaller value of the mutual information. Can be obtained, the most condensed candidate words are words such as "bat", "spider", "wandering", "uneasy", "rose" and so on, almost every word in these words will always be with another The words appear at the same time and are never used in other occasions.
信息熵是指候选词的自由程度,也就是所述候选词的左邻字或者右邻字的丰富程度,所述候选词的信息熵与该候选词的左邻字或者右邻字的数量成正比,若所述候选词能够搭配更多的左邻字或者右邻字,所述候选词的对应信息熵越 大,若所述候选词能够搭配更少的左邻字或者右邻字,所述候选词的对应信息熵越小。候选词的信息熵又可以称为左右信息熵,一个候选词的左右信息熵,也就是该候选词的自由程度定义为它的左邻字信息熵和右邻字信息熵中的较小值。Information entropy refers to the degree of freedom of the candidate word, that is, the richness of the left or right neighbor of the candidate word, and the information entropy of the candidate word becomes the number of left or right neighbor of the candidate word. Proportionally, if the candidate word can be matched with more left or right neighbor words, the corresponding information entropy of the candidate word is greater, if the candidate word can be matched with fewer left or right neighbor words, The smaller the corresponding information entropy of the candidate words. The information entropy of a candidate word can also be called left and right information entropy. The left and right information entropy of a candidate word, that is, the degree of freedom of the candidate word is defined as the smaller value of the information entropy of its left neighbor word and right neighbor word.
进一步地,左邻字信息熵,又称为左信息熵,是指所述候选词的左邻字丰富程度,也就是所述候选词左侧能够搭配的字词的数量,左信息熵的公式为:Further, the left neighbor information entropy, also known as the left information entropy, refers to the richness of the left neighbor of the candidate word, that is, the number of words that can be matched on the left side of the candidate word, and the formula of the left information entropy for:
HL(W)=-∑ ap(a|W)logp(a|W),公式(2) HL (W) =-∑ a p (a | W) logp (a | W), formula (2)
其中W表示候选词,a表示候选词左边的字,p(a|W)表示候选词左边出现字a的概率,其中,p(a|W)是条件概率。条件概率是指事件A在另外一个事件B已经发生条件下的发生概率。条件概率表示为:p(A|B),读作“在B的条件下A的概率”。Where W represents the candidate word, a represents the word on the left of the candidate word, and p (a | W) represents the probability of the word a appearing on the left of the candidate word, where p (a | W) is the conditional probability. Conditional probability refers to the occurrence probability of event A under the condition that another event B has occurred. The conditional probability is expressed as: p (A | B), read as "probability of A under the condition of B".
右邻字信息熵,又称为右左信息熵,是指所述候选词的右邻字丰富程度,也就是所述候选词右侧能够搭配的字词的数量,右信息熵的公式为:Right neighbor information entropy, also known as right left information entropy, refers to the richness of the right neighbor of the candidate word, that is, the number of words that can be matched on the right side of the candidate word. The formula for the right information entropy is:
HR(W)=-∑bp(b|W)logp(b|W),公式(3)HR (W) =-∑bp (b | W) logp (b | W), formula (3)
其中W表示候选词,b表示候选词右边的字,p(b|W)表示候选词右边出现字b的概率。Where W represents the candidate word, b represents the word on the right of the candidate word, and p (b | W) represents the probability of the word b appearing on the right of the candidate word.
一个词能否出现在各种不同的环境中,具有非常丰富的左邻字集合和右邻字集合,自由程度用信息熵来表述,通过信息熵能够反映,知道一个事件的结果后平均会给你带来多大的信息量。如果某个结果的发生概率为p,当你知道它确实发生了,你得到的信息量就被定义为-log(p)。用信息熵来衡量一个候选词的左邻字集合和右邻字集合有多随机,比如,“吃葡萄不吐葡萄皮不吃葡萄倒吐葡萄皮”,“葡萄”一词出现了四次,其中左邻字分别为{吃,吐,吃,吐},右邻字分别为{不,皮,倒,皮}。根据公式(2)和(3),“葡萄”一词的左邻字的信息熵为-(1/2)*log(1/2)-(1/2)*log(1/2)≈0.693,它的右邻字的信息熵则为-(1/2)*log(1/2)-(1/4)*log(1/4)-(1/4)*log(1/4)≈1.04。可见,在这个句子中,“葡萄”一词的右邻字更加丰富一些。Whether a word can appear in various environments, has a very rich set of left and right neighbors, the degree of freedom is expressed by information entropy, which can be reflected by information entropy, and the average will be given after knowing the result of an event How much information do you bring. If the probability of a certain result is p, when you know that it did happen, the amount of information you get is defined as -log (p). Information entropy is used to measure how random the set of left and right neighbors of a candidate word is. For example, "eating grapes without spitting grape skins and not eating grapes but spitting grape skins", the word "grape" appears four times. The left neighboring words are {eating, vomiting, eating, vomiting}, and the right neighboring words are {no, skin, pour, skin}. According to formulas (2) and (3), the information entropy of the left neighbor of the word "grape" is-(1/2) * log (1/2)-(1/2) * log (1/2) ≈ 0.693, the information entropy of its right neighbor is-(1/2) * log (1/2)-(1/4) * log (1/4)-(1/4) * log (1/4 ) ≈1.04. It can be seen that in this sentence, the right neighbor of the word "Grape" is more abundant.
词频,英文为Term Frequency,简写为TF,是指在一份给定的文本语料中,一个给定的词语在该文本语料中出现的次数,字词的重要性随着它在文件中出 现的次数成正比增加。获得识别新词的文本语料后,统计各个候选词出现的次数。比如,在一段2400万字的数据中,“南山”一共出现了2774次,“南山”的词频为2774次,“区”字则出现了4797次,“区”的词频为4797次。Word frequency, English is Term Frequency, abbreviated as TF, refers to the number of times a given word appears in the text corpus in a given text corpus, the importance of the word follows it in the file The frequency increases proportionally. After obtaining a text corpus that recognizes new words, the number of occurrences of each candidate word is counted. For example, in a piece of data of 24 million words, "Nanshan" appeared a total of 2,774 times, the word frequency of "Nanshan" was 2,774, the word "region" appeared 4,797 times, and the word frequency of "region" was 4,797.
具体地,能够反映候选词的词边界信息的参数包括候选词的句子端点和左右信息熵。由于句子端点和左右信息熵反映的均为候选词的词边界信息,因此句子端点和左右信息熵在候选词的识别中所起的作用是一样的,在新词识别过程中,两者满足其中一个条件即可。在本实施例中,若所述候选词的词频、互信息和左右信息熵分别大于或者等于所述词频第一预设阈值、互信息第一预设阈值以及左右信息熵第一预设阈值,判定所述候选词满足预设条件,将所述候选词确定为候选新词为例,也就是所述候选词的词频满足词频的第一预设阈值,互信息满足互信息的第一预设阈值,左右信息熵满足左右信息熵的第一预设阈值。比如,若候选词的词频的第一预设阈值为10,所述候选词的互信息的第一预设阈值为1,所述候选词的左右信息熵的第一预设阈值为1,切分的候选词分别满足词频、互信息和左右信息熵各自的第一预设阈值,是指所述候选词的词频大于或者等于10,互信息大于或者等于1和左右信息熵大于或者等于1。比如,若切分的候选词的词频大于10、互信息大于1和左信息熵大于1,则判断该候选词为候选新词,或者若切分的候选词的互信息大于1和右信息熵大于1,则也可以判断该候选词为候选新词。请继续参见表格1,在表格1中,由于“南山区”和“南山”的互信息和左右信息熵均大于1,识别“南山区”和“南山”为候选新词。而由于“去南山”的互信息小于1,“普惠”的左右信息熵小于1,通过互信息和左右信息熵识别新词时,“去南山”和“普惠”不为候选新词。Specifically, the parameters that can reflect the word boundary information of the candidate word include the sentence endpoint of the candidate word and left and right information entropy. Since the sentence endpoints and the left and right information entropies reflect the word boundary information of the candidate words, the sentence endpoints and the left and right information entropies play the same role in the recognition of the candidate words. In the process of new word recognition, both meet One condition is sufficient. In this embodiment, if the word frequency, mutual information, and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information, and the first preset threshold of left and right information entropy, It is determined that the candidate word meets a preset condition, and the candidate word is determined as a candidate new word as an example, that is, the word frequency of the candidate word satisfies the first preset threshold of the word frequency, and the mutual information satisfies the first preset of the mutual information Threshold, the left and right information entropy meets the first preset threshold of the left and right information entropy. For example, if the first preset threshold value of the word frequency of the candidate word is 10, the first preset threshold value of the mutual information of the candidate word is 1, and the first preset threshold value of the left and right information entropy of the candidate word is 1. The divided candidate words satisfy the first preset thresholds of the word frequency, mutual information, and left and right information entropy, respectively, which means that the word frequency of the candidate word is greater than or equal to 10, the mutual information is greater than or equal to 1, and the left and right information entropy is greater than or equal to 1. For example, if the word frequency of the segmented candidate words is greater than 10, the mutual information is greater than 1, and the left information entropy is greater than 1, then the candidate word is judged as a candidate new word, or if the mutual information of the segmented candidate words is greater than 1 and the right information entropy If it is greater than 1, the candidate word can also be determined as a candidate new word. Please continue to refer to Table 1. In Table 1, since the mutual information and left and right information entropy of "Nanshan District" and "Nanshan" are both greater than 1, "Nanshan District" and "Nanshan" are identified as candidate new words. However, since the mutual information of "Go to Nanshan" is less than 1, and the left and right information entropy of "Puhui" is less than 1, when identifying new words through mutual information and left and right information entropy, "Go to Nanshan" and "Puhui" are not candidate new words.
综上所述,在表格1涉及的语料进行的新词识别中,“南山区”和“南山”为候选新词,“去南山”和“普惠”不为候选新词。本申请实施例中,通过将包含停用词、停用字、空格、回车的预设分割端点及标点符号作为候选词的左右词边界,通过词边界的统计,统计候选词的左右信息熵,由于细化了新词识别的粒度,通过候选词左右信息熵的统计,可有效发现低频及长粒度新词,可以有效提高新词识别的效率和准确率。In summary, in the new word recognition performed by the corpus referred to in Table 1, “Nanshan District” and “Nanshan” are candidate new words, and “Go to Nanshan” and “Puhui” are not candidate new words. In the embodiment of the present application, by using the preset segmentation endpoints including stop words, stop words, spaces, carriage returns, and punctuation marks as the left and right word boundaries of the candidate words, through the statistics of the word boundaries, the left and right information entropy of the candidate words is counted Because the granularity of new word recognition is refined, the statistics of the information entropy around the candidate words can effectively find low-frequency and long-grained new words, which can effectively improve the efficiency and accuracy of new word recognition.
进一步地,若所述候选词的词频第一预设阈值、互信息第一预设阈值、左 右信息熵第一预设阈值设置的分别越大,识别出来的候选新词越准确,所述候选词的词频第一预设阈值、互信息第一预设阈值、左右信息熵第一预设阈值设置的分别越小,识别出来的候选新词越多。Further, if the first preset threshold of the word frequency of the candidate word, the first preset threshold of mutual information, and the first preset threshold of left and right information entropy are set larger, the more accurate candidate new words are identified, the candidate The smaller the first preset threshold of the word frequency of the word, the first preset threshold of mutual information, and the first preset threshold of left and right information entropy, the more candidate new words are identified.
在一个实施例中,所述判断所述候选词是否满足预设条件的步骤包括:In one embodiment, the step of determining whether the candidate word meets a preset condition includes:
获取所述候选词的互信息,并且获取所述候选词的词频及所述候选词的句子端点数,所述句子端点数是指所述候选词的左端点数或者所述候选词的右端点数,所述左端点数是指所述候选词的左端点出现的次数,所述右端点数是指所述候选词的右端点出现的次数;Acquiring mutual information of the candidate words, and acquiring the word frequency of the candidate words and the number of sentence endpoints of the candidate words, the number of sentence endpoints refers to the number of left endpoints of the candidate words or the number of right endpoints of the candidate words, The number of left endpoints refers to the number of occurrences of the left endpoint of the candidate word, and the number of right endpoints refers to the number of occurrences of the right endpoint of the candidate word;
判断所述候选词的词频、互信息和句子端点数是否分别大于或等于词频第二预设阈值、互信息第二预设阈值以及句子端点数第一预设阈值;Judging whether the word frequency, mutual information and the number of sentence endpoints of the candidate words are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset threshold of sentence endpoints;
若所述候选词的词频、互信息和句子端点数分别大于或等于所述词频第二预设阈值、互信息第二预设阈值以及句子端点数第一预设阈值,判定所述候选词满足预设条件。If the word frequency, mutual information and the number of sentence endpoints of the candidate word are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset threshold of number of sentence endpoints, it is determined that the candidate word meets Preset conditions.
其中,所述候选词的端点是指所述候选词的左邻词边界和右邻词边界,其中,词边界是指词语的分界边缘,也就是词语的分界线,通过分界线将文本语料进行划分为不同的候选词。所述候选词的左端点是指所述候选词的左邻词边界,候选词的右端点是指所述候选词的右邻词边界。左右端点数,是指候选词的左邻词边界和右邻词边界各自出现的次数,词边界包括标点符号及预设分割端点包括的空格、回车、停用词及停用字等。为了便于统计所述候选词的左右端点数,可以用统一标识符替换所述文本语料中作为词边界的所述预设句子端点。如果词边界用统一标识符进行了替换,则左右端点数为候选词左边出现的标识符数和候选词右边出现的标识符数。比如,若统一标识符为“*”的文本,一语料为:“电影院是为观众放映电影的场所。随着电影的进步与发展,出现了专门为放映电影而建造的电影院。电影的发展使电影院的形体、尺寸、比例和声学技术都发生了很大变化。电影院必须满足电影放映的工艺要求。”用统一标识符替换句子端点后为:“电影院*为观众放映电影*场所*随着电影的进步与发展*出现了专门为放映电影而建造*电影院*电影的发展*电影院*形体、尺寸、比例和声学技术都发生了很大变化*电影院*满足电影放映*工艺要求*”,从中,可知, 候选词“电影院”的左端点出现了3次,右端点出现了4次。将包含停用词、停用字的预设分割端点、空格、回车及标点符号作为候选词的左右词边界,通过词边界的统计,统计候选词的左右端点的出现次数,由于细化了新词识别的粒度,通过候选词端点的统计,可有效发现低频及长粒度新词,可以有效提高新词识别的效率和准确率。Wherein, the endpoints of the candidate words refer to the left and right neighbor word boundaries of the candidate words, wherein the word boundaries refer to the boundary edges of the words, that is, the boundary lines of the words, and the text corpus is processed through the boundary lines Divided into different candidate words. The left end point of the candidate word refers to the left neighbor word boundary of the candidate word, and the right end point of the candidate word refers to the right neighbor word boundary of the candidate word. The number of left and right end points refers to the number of occurrences of the left and right word boundaries of the candidate word respectively. The word boundary includes punctuation marks and spaces included in the preset segmentation end point, carriage return, stop words, stop words, etc. In order to facilitate counting the number of left and right endpoints of the candidate word, the preset sentence endpoints as word boundaries in the text corpus may be replaced with a unified identifier. If the word boundary is replaced with a unified identifier, the number of left and right endpoints is the number of identifiers appearing on the left of the candidate word and the number of identifiers appearing on the right of the candidate word. For example, if the text of the unified identifier is "*", the corpus is: "The movie theater is a place for the audience to show movies. With the progress and development of movies, there have been movie theaters built specifically for the screening of movies. The shape, size, proportion and acoustic technology of the movie theater have changed a lot. The movie theater must meet the technical requirements for film screening. "After replacing the sentence endpoints with a unified identifier:" Movie theater * for the audience to show the movie * venue * with the movie The progress and development of * appeared specially built for the screening of movies * Movie theater * Movie development * Movie theater * The shape, size, proportion and acoustic technology have changed a lot * Movie theater * Meeting film screening * Technical requirements * ", from which, It can be seen that the left end point of the candidate word "cinema" appears 3 times, and the right end point appears 4 times. The stop words, the default split end points of stop words, spaces, carriage returns, and punctuation marks are used as the left and right word boundaries of the candidate words. Through the word boundary statistics, the number of occurrences of the left and right end points of the candidate words is counted. The granularity of new word recognition, through the statistics of the endpoints of candidate words, can effectively find new words with low frequency and long granularity, and can effectively improve the efficiency and accuracy of new word recognition.
具体地,由于候选词的句子端点和左右信息熵反映的均为候选词的词边界信息,因此句子端点和左右信息熵在候选词的识别中所起的作用是一样的,在新词识别过程中,两者满足其中一个条件即可。在本实施例中,以所述候选词的词频、互信息和句子端点数分别大于或等于所述词频第二预设阈值、互信息第二预设阈值以及句子端点数第一预设阈值,判定所述候选词满足预设条件,将所述候选词确定为候选新词为例。比如,若候选词的词频第二预设阈值为10,设置所述候选词的最低左右信息熵的第二预设阈值为1,句子端点的第一预设阈值为3。其中,最低左右信息熵的第二预设阈值为1是指候选词的左邻字信息熵和右邻字信息熵中的较小值为1。句子端点第一预设阈值是指候选词的左边界或者右边界的句子端点出现的次数。请参见表格1,在对一语料识别后,得到的结果如表格1所示,若切分的候选词分别满足互信息第二预设阈值和句子端点数第一预设阈值,是指候选词的互信息大于1和句子端点出现的次数大于3,若切分的候选词的互信息大于1和句子端点出现的次数大于3,则判断该候选词为候选新词。请继续参见表格1,在表格1中,由于“南山区”、“南山”和“普惠”的互信息和句子端点数均大于3,判断“南山区”、“南山”和“普惠”为候选新词。而由于“去南山”的互信息小于1,并且句子端点数等于3,不满足大于3的条件,通过互信息和句子端点数识别新词时,“去南山”不为候选新词。根据上述判断标准可知,“南山区”、“南山”、“普惠”被识别为候选新词,“去南山”不成词,被排除在候选新词外。综上所述,在表格1涉及的语料进行的新词识别中,“南山区”、“南山”和“普惠”为候选新词,“去南山”不为候选新词。Specifically, since the sentence endpoints of the candidate words and the left and right information entropy reflect the word boundary information of the candidate word, the sentence endpoints and the left and right information entropy play the same role in the recognition of the candidate word. In both cases, it suffices to satisfy one of the conditions. In this embodiment, the word frequency, mutual information and the number of sentence endpoints of the candidate words are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset threshold of number of sentence endpoints, It is determined that the candidate word meets a preset condition, and the candidate word is determined as a candidate new word as an example. For example, if the second preset threshold value of the word frequency of the candidate word is 10, the second preset threshold value of the lowest left and right information entropy of the candidate word is set to 1, and the first preset threshold value of the sentence endpoint is 3. Wherein, the second preset threshold value of the lowest left and right information entropy is 1 means that the smaller value of the left word neighbor information entropy and the right word neighbor information entropy of the candidate word is 1. The first preset threshold of the sentence endpoint refers to the number of occurrences of the sentence endpoint of the left or right boundary of the candidate word. Please refer to Table 1. After recognizing a corpus, the result is shown in Table 1. If the segmented candidate words meet the second preset threshold of mutual information and the first preset threshold of sentence endpoints respectively, they refer to the candidate words The mutual information of is greater than 1 and the number of occurrences of the endpoint of the sentence is greater than 3. If the mutual information of the segmented candidate words is greater than 1 and the number of occurrences of the endpoint of the sentence is greater than 3, the candidate word is judged as a candidate new word. Please continue to refer to Table 1. In Table 1, since the mutual information of "Nanshan District", "Nanshan" and "Puhui" and the number of sentence endpoints are all greater than 3, judge "Nanshan District", "Nanshan" and "Puhui" As a candidate for new words. However, because the mutual information of "Go to Nanshan" is less than 1, and the number of sentence endpoints is equal to 3, which does not satisfy the condition greater than 3, when identifying new words through mutual information and the number of sentence endpoints, "Go to Nanshan" is not a candidate new word. According to the above judgment criteria, "Nanshan District", "Nanshan", and "Puhui" are recognized as candidate new words, and "Go to Nanshan" is not a word, and is excluded from the candidate new words. In summary, in the recognition of new words in the corpus referred to in Table 1, "Nanshan District", "Nanshan" and "Puhui" are candidate new words, and "Go to Nanshan" is not a candidate new word.
请参阅图3,图3为本申请另一个实施例提供的新词识别方法的流程示意图。如图3所示,在该实施例中,所述判断所述候选词是否满足预设条件的步骤包括:Please refer to FIG. 3, which is a schematic flowchart of a new word recognition method according to another embodiment of the present application. As shown in FIG. 3, in this embodiment, the step of determining whether the candidate word meets a preset condition includes:
S211、获取所述候选词的互信息和左右信息熵,并且获取所述候选词的词频及所述候选词的句子端点数,其中,所述左右信息熵是指所述候选词的左邻字信息熵和右邻字信息熵中的较小值;S211. Acquire the mutual information and left and right information entropy of the candidate word, and acquire the word frequency of the candidate word and the number of sentence endpoints of the candidate word, where the left and right information entropy refers to the left neighbor of the candidate word The smaller of the information entropy and the right entropy information entropy;
S212、判断所述候选词的词频、互信息和左右信息熵是否分别大于或等于词频第一预设阈值、互信息第一预设阈值以及左右信息熵第一预设阈值,或者所述候选词的词频、互信息和句子端点数是否分别大于或等于词频第二预设阈值、互信息第二预设阈值以及句子端点数第一预设阈值;S212. Determine whether the word frequency, mutual information, and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information, and the first preset threshold of left and right information entropy, or the candidate words Whether the word frequency, mutual information, and sentence endpoints are greater than or equal to the word frequency second preset threshold, the mutual information second preset threshold, and the sentence endpoint number first preset threshold, respectively;
S213、若所述候选词的词频、互信息和左右信息熵分别大于或者等于所述词频第一预设阈值、互信息第一预设阈值以及左右信息熵第一预设阈值,或者所述候选词的词频、互信息和句子端点数分别大于或等于所述词频第二预设阈值、互信息第二预设阈值以及句子端点数第一预设阈值,判定所述候选词满足预设条件。S213. If the word frequency, mutual information, and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information, and the first preset threshold of left and right information entropy, or the candidate The word frequency, mutual information and the number of sentence endpoints of the word are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset threshold of the number of sentence endpoints, respectively, to determine that the candidate word meets the preset condition.
其中,所述词频第一预设阈值与所述词频第二预设阈值可以对应相同,所述互信息第一预设阈值与所述互信息第二预设阈值可以对应相同。Wherein, the first preset threshold of word frequency and the second preset threshold of word frequency may correspond to the same, and the first preset threshold of mutual information and the second preset threshold of mutual information may correspond to the same.
具体地,计算机设备根据所述候选词的词频、互信息和左右信息熵分别大于或者等于所述词频第一预设阈值、互信息第一预设阈值以及左右信息熵第一预设阈值,判定所述候选词满足预设条件,识别出来的第一候选新词,结合所述候选词的词频、互信息和句子端点数分别大于或等于所述词频第二预设阈值、互信息第二预设阈值以及句子端点数第一预设阈值,判定所述候选词满足预设条件,识别所述候选词为第二候选新词,取所述第一候选新词和所述第二候选新词的并集,作为识别的最终候选新词,可以提高候选新词的准确率,进入步骤S230,以进一步识别候选新词,否则,若所述候选词不满足上述条件,进入步骤S221,所述候选词不能成为词,舍弃掉所述候选词。请继续参阅表格1,根据若所述候选词的词频、互信息和左右信息熵分别大于或者等于所述词频第一预设阈值、互信息第一预设阈值以及左右信息熵第一预设阈值,判定所述候选词满足预设条件,将所述候选词确定为候选新词,识别“南山区”和“南山”为候选新词,“去南山”和“普惠”不为候选新词,而根据若所述候选词的词频、互信息和句子端点数分别大于或等于所述词频第二预设阈值、互信息第二预设阈 值以及句子端点数第一预设阈值,判定所述候选词满足预设条件,将所述候选词确定为候选新词,“南山区”、“南山”、“普惠”被识别为候选新词,“去南山”不成词,被排除在候选新词外,将两者结合起来,识别“南山区”、“南山”、“普惠”被识别为候选新词,从而避免识别“普惠”不为候选新词,识别出“普惠”为候选新词,提高了新词识别的准确率。Specifically, the computer device determines according to the word frequency, mutual information, and left and right information entropy of the candidate words that are greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information, and the first preset threshold of left and right information entropy, respectively. The candidate word satisfies the preset condition, and the first candidate new word identified, combined with the word frequency, mutual information, and the number of sentence endpoints of the candidate word is greater than or equal to the word word second preset threshold, and the mutual information second pre Set a threshold and a first preset threshold for the number of sentence endpoints, determine that the candidate word meets a preset condition, identify the candidate word as a second candidate new word, and take the first candidate new word and the second candidate new word The union of, as the final candidate new word recognition, can improve the accuracy of the candidate new word, go to step S230 to further identify the candidate new word, otherwise, if the candidate word does not meet the above conditions, go to step S221, said The candidate word cannot be a word, and the candidate word is discarded. Please continue to refer to Table 1, according to if the word frequency, mutual information and left and right information entropy of the candidate words are greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset threshold of left and right information entropy, respectively , Determine that the candidate word meets the preset conditions, determine the candidate word as a candidate new word, identify "Nanshan District" and "Nanshan" as candidate new words, "Go Nanshan" and "Puhui" are not candidate new words , And if the word frequency, mutual information, and the number of sentence endpoints of the candidate words are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information, and the first preset threshold of sentence endpoints, respectively, The candidate words meet the preset conditions, and the candidate words are determined as candidate new words. "Nanshan District", "Nanshan", and "Pu Hui" are identified as candidate new words. "Go to Nanshan" is not a word, and is excluded from the candidate new words. Beyond the words, combine the two to identify "Nanshan District", "Nanshan" and "Puhui" as new candidate words, thus avoiding identifying "Puhui" as a candidate new word and identifying "Puhui" as Candidate new words, High accuracy of recognition of new words.
本申请实施例中,通过将包含停用词、停用字、空格、回车的预设分割端点及标点符号作为候选词的左右词边界,通过词边界的统计,统计候选词的左右端点的出现次数,由于细化了新词识别的粒度,通过候选词端点的统计,可有效发现低频及长粒度新词,可以有效提高新词识别的效率和准确率。In the embodiment of the present application, by using the preset segmentation endpoints including stop words, stop words, spaces, carriage returns, and punctuation marks as the left and right word boundaries of the candidate words, through the word boundary statistics, the left and right end points of the candidate words are counted The number of occurrences, due to the refinement of the granularity of new word recognition, through the statistics of the endpoints of candidate words, low frequency and long granular new words can be effectively found, which can effectively improve the efficiency and accuracy of new word recognition.
进一步地,若所述词频的预设阈值、互信息的预设阈值、左右信息熵的预设阈值和句子端点信息的预设阈值设置的分别越大,候选新词的识别越准确,若所述词频的预设阈值、互信息的预设阈值、左右信息熵的预设阈值和句子端点信息的预设阈值设置的分别越小,识别出来的候选新词越多。Further, if the preset threshold of the word frequency, the preset threshold of mutual information, the preset threshold of left and right information entropy, and the preset threshold of sentence endpoint information are set larger, the more accurate the recognition of candidate new words is, if The smaller the preset threshold of predicate frequency, the preset threshold of mutual information, the preset threshold of left and right information entropy, and the preset threshold of sentence endpoint information, the more candidate new words are identified.
在一个实施例中,所述根据预设句子端点,通过N元切分将所述文本语料切分成长度为2-N的候选词,其中,N为自然数,且N≥2的步骤之前还包括:使用统一标识符替换所述文本语料中的所述预设句子端点。In an embodiment, the text corpus is divided into candidate words with a length of 2-N by N-ary segmentation according to a preset sentence endpoint, where N is a natural number, and the step of N≥2 further includes : Replace the preset sentence endpoint in the text corpus with a unified identifier.
具体地,使用统一标识符替换所述文本语料中的所述预设句子端点,是指将所述预设句子端点包括的标点符号及包括停用字、停用词、回车和空格的预设分割端点用统一标识符进行置换。比如,设置统一标识符为“*”、“#”或者“△”等标识符号,将标点符号和预设分割端点在文本中用统一标识符替代,可以方便后续句子端点数统计,从而提高所述文本语料的切分效率,提高所述文本语料的新词识别效率。比如,用“*”作为统一标识符替换所述预设句子端点中的标点符号、停用字及停用词,一段文本语料替换前为“我非常热衷于用统计的方法去分析汉语资料。”,用“*”作为统一标识符替换所述预设句子端点中的“我”、“的”及“。”后的文本语料为“*热衷于用统计*方法去分析汉语资料*”。Specifically, replacing the preset sentence endpoint in the text corpus with a unified identifier refers to a punctuation mark included in the preset sentence endpoint and a pre-inclusion including stop words, stop words, carriage returns, and spaces Let the split endpoint be replaced with a unified identifier. For example, setting the unified identifier to "*", "#", or "△" and other identification symbols, replacing the punctuation marks and preset segmentation endpoints with uniform identifiers in the text, can facilitate subsequent sentence endpoint statistics, thereby improving The segmentation efficiency of the text corpus improves the efficiency of new word recognition of the text corpus. For example, using "*" as a unified identifier to replace the punctuation marks, stop words, and stop words in the endpoints of the preset sentence. Before replacing a text corpus, "I am very keen to analyze Chinese data using statistical methods." ", Using" * "as a unified identifier to replace" I "," 的 "and". "In the endpoint of the preset sentence as the text corpus" * Keen to use statistical * methods to analyze Chinese data * ".
把停用字、停用词空格及回车作为预设分割端点,将空格、回车、停用字、停用词及标点符号替换为“*”,置换所述预设句子端点后,将文本通过N元切分成长度为2-N的候选词,并统计各个候选词出现的次数,比如,在一段2400万 字的数据中,“南山”一共出现了2774次,“区”字则出现了4797次。Use stop words, stop word spaces, and carriage return as the default segmentation endpoints, replace spaces, carriage returns, stop words, stop words, and punctuation marks with "*". After replacing the preset sentence endpoints, replace The text is divided into candidate words with a length of 2-N by N yuan, and the number of occurrences of each candidate word is counted. For example, in a piece of data of 24 million words, "Nanshan" appears a total of 2774 times, and the word "District" appears 4797 times.
请继续参阅图3,在该实施例中,所述若所述候选新词不包含在所述预设词库中,将所述候选新词确定为新词的步骤之后还包括:Please continue to refer to FIG. 3. In this embodiment, if the candidate new word is not included in the preset lexicon, the step of determining the candidate new word as a new word further includes:
S260、获取所述新词的长度,判断所述新词的长度是否大于或者等于预设长度阈值;S260: Obtain the length of the new word, and determine whether the length of the new word is greater than or equal to a preset length threshold;
S261若所述新词的长度大于或者等于所述预设长度阈值,识别所述新词为长粒度新词;S261 If the length of the new word is greater than or equal to the preset length threshold, identify the new word as a long-granular new word;
S262、若所述新词的长度小于所述预设长度阈值,识别所述新词为非长粒度新词。S262. If the length of the new word is less than the preset length threshold, identify the new word as a non-long-granularity new word.
其中,新词的长度是指新词包含的字符个数,比如词语“电影院”包含三个文字,“电影院”词语的长度为3。Among them, the length of the new word refers to the number of characters contained in the new word. For example, the word "movie" contains three characters, and the length of the word "movie" is 3.
预设长度阈值是指预先设置的词语的长度临界值。预设长度阈值可以由人工进行设置。The preset length threshold refers to a preset length threshold of words. The preset length threshold can be set manually.
具体地,长粒度新词是指识别的新词包含的字符数大于或者等于预设长度阈值的词。比如,若设置的预设长度阈值为5,则识别出的新词若包含超过五个字符或者等于五个字符,识别该新词为长粒度新词。针对长粒度新词,可以根据长粒度新词的属性,进行对应处理。Specifically, the long-granularity new word refers to a word whose number of characters contained in the recognized new word is greater than or equal to a preset length threshold. For example, if the preset preset length threshold is 5, if the new word identified contains more than five characters or equals five characters, the new word is identified as a long-granular new word. For long-granularity new words, corresponding processing can be performed according to the attributes of long-granularity new words.
请继续参阅图3,在该实施例中,所述若所述候选新词不包含在所述预设词库中,将所述候选新词确定为新词的步骤之后还包括:Please continue to refer to FIG. 3. In this embodiment, if the candidate new word is not included in the preset lexicon, the step of determining the candidate new word as a new word further includes:
S270、获取所述新词的词频,判断所述新词的词频是否低于预设词频阈值;S270: Obtain the word frequency of the new word, and determine whether the word frequency of the new word is lower than a preset word frequency threshold;
S271、若所述新词出现的词频低于所述预设词频阈值,识别所述新词为低频新词;S271. If the word frequency of the new word is lower than the preset word frequency threshold, identify the new word as a low-frequency new word;
S272、若所述新词出现的词频大于或者等于所述预设词频阈值,识别所述新词为非低频新词。S272. If the word frequency of the new word is greater than or equal to the preset word frequency threshold, identify the new word as a non-low frequency new word.
其中,低频新词是指识别出的新词的在所述文本语料中词频低于预设词频阈值。The low-frequency new words refer to the recognized new words whose word frequency in the text corpus is lower than a preset word frequency threshold.
具体地,若预设词频阈值为10,计算机设备识别出的新词中,若识别出的新词的词频小于10,则该新词为低频新词。由于低频新词为不常用新词,在进 行新词识别时,根据不同的文本语料,可以选择将低频新词包含或者不包含进预设词库。若选择将低频新词不包含进预设词库,可以减小预设词库的数量,提高新词识别过程中与预设词库的匹配效率,提高新词识别的效率。Specifically, if the preset word frequency threshold is 10, among the new words recognized by the computer device, if the word frequency of the recognized new word is less than 10, the new word is a low-frequency new word. Since low-frequency new words are new words that are not commonly used, when recognizing new words, according to different text corpora, you can choose to include or exclude low-frequency new words into the preset lexicon. If you choose not to include low-frequency new words in the preset thesaurus, you can reduce the number of preset thesauruses, improve the matching efficiency of the new word recognition process with the preset thesaurus, and improve the efficiency of new word recognition.
需要说明的是,上述各个实施例所述的新词识别方法,可以根据需要将不同实施例中包含的技术特征重新进行组合,以获取组合后的实施方案,但都在本申请要求的保护范围之内。It should be noted that the new word recognition methods described in the above embodiments can recombine the technical features contained in different embodiments as needed to obtain the combined implementation, but they are all within the scope of protection required by this application. within.
请参阅图4,图4为本申请实施例提供的新词识别装置的示意性框图。对应于上述新词识别方法,本申请实施例还提供一种新词识别装置。请参阅图4,该新词识别装置包括用于执行上述新词识别方法的单元,该装置可以被配置于台式机电脑等计算机设备中。具体地,请参阅图4,该新词识别装置400包括切分单元401、判断单元402、第一识别单元403、过滤单元404及第二识别单元405。Please refer to FIG. 4, which is a schematic block diagram of a new word recognition apparatus provided by an embodiment of the present application. Corresponding to the above new word recognition method, an embodiment of the present application further provides a new word recognition device. Referring to FIG. 4, the new word recognition device includes a unit for performing the above new word recognition method, and the device can be configured in a desktop computer or other computer equipment. Specifically, referring to FIG. 4, the new word recognition device 400 includes a segmentation unit 401, a judgment unit 402, a first recognition unit 403, a filtering unit 404 and a second recognition unit 405.
其中,切分单元401,用于获取文本语料,根据预设句子端点,通过N元切分将所述文本语料切分成长度为2-N的候选词,其中,N为自然数,且N≥2,所述候选词是指切分所述文本语料获取的文本片段;Among them, the segmentation unit 401 is used to obtain a text corpus, and divide the text corpus into candidate words with a length of 2-N by N-ary segmentation according to preset sentence endpoints, where N is a natural number and N ≥ 2 , The candidate word refers to a text segment obtained by segmenting the text corpus;
判断单元402,用于判断所述候选词是否满足预设条件;The judging unit 402 is used to judge whether the candidate word meets a preset condition;
第一识别单元403,用于若所述候选词满足所述预设条件,将所述候选词确定为候选新词;The first recognition unit 403 is configured to determine the candidate word as a candidate new word if the candidate word meets the preset condition;
过滤单元404,用于判断所述候选新词是否包含在所述预设词库中;以及The filtering unit 404 is used to determine whether the candidate new word is included in the preset thesaurus; and
第二识别单元405,用于若所述候选新词不包含在所述预设词库中,将所述候选新词确定为新词。The second recognition unit 405 is configured to determine the candidate new word as a new word if the candidate new word is not included in the preset thesaurus.
在一个实施例中,所述预设句子端点包括标点符号和预设分割端点,所述预设分割端点是指所述文本语料中除标点符号外预先被设置为句子端点的成分。In one embodiment, the preset sentence endpoint includes a punctuation mark and a preset segmentation endpoint, and the preset segmentation endpoint refers to a component of the text corpus that is previously set as a sentence endpoint except for punctuation marks.
请参阅图5,图5为本申请另一个实施例提供的新词识别装置的示意性框图。如图5所示,在该实施例中,所述判断单元402包括:Please refer to FIG. 5, which is a schematic block diagram of a new word recognition apparatus provided by another embodiment of the present application. As shown in FIG. 5, in this embodiment, the judgment unit 402 includes:
第一获取子单元4021,用于获取所述候选词的互信息和左右信息熵,并且获取所述候选词的词频,其中,所述左右信息熵是指所述候选词的左邻字信息熵和右邻字信息熵中的较小值;The first obtaining subunit 4021 is configured to obtain the mutual information and left and right information entropy of the candidate words, and obtain the word frequency of the candidate words, wherein the left and right information entropy refers to the left neighbor word information entropy of the candidate words And the smaller value in the information entropy of the right neighbor;
第一判断子单元4022,用于判断所述候选词的词频、互信息和左右信息熵 是否分别大于或等于词频第一预设阈值、互信息第一预设阈值以及左右信息熵第一预设阈值;The first judgment subunit 4022 is used to judge whether the word frequency, mutual information and left and right information entropy of the candidate words are greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset of left and right information entropy, respectively Threshold
第一判定子单元4023,用于若所述候选词的词频、互信息和左右信息熵分别大于或者等于所述词频第一预设阈值、互信息第一预设阈值以及左右信息熵第一预设阈值,判定所述候选词满足预设条件。The first determination sub-unit 4023 is used if the word frequency, mutual information and left and right information entropy of the candidate words are greater than or equal to the first preset threshold of the word frequency, the first preset threshold of mutual information and the first preselection of left and right information entropy A threshold is set to determine that the candidate word meets a preset condition.
在一个实施例中,所述判断单元402包括:In one embodiment, the judgment unit 402 includes:
第二获取子单元,用于获取所述候选词的互信息,并且获取所述候选词的词频及所述候选词的句子端点数,所述句子端点数是指所述候选词的左端点数或者所述候选词的右端点数,所述左端点数是指所述候选词的左端点出现的次数,所述右端点数是指所述候选词的右端点出现的次数;The second obtaining subunit is used to obtain mutual information of the candidate words, and obtain the word frequency of the candidate words and the number of sentence endpoints of the candidate words. The number of sentence endpoints refers to the number of left endpoints of the candidate words The number of right endpoints of the candidate word, the number of left endpoints refers to the number of times the left endpoint of the candidate word appears, and the number of right endpoints refers to the number of times the right endpoint of the candidate word appears;
第二判断子单元,用于判断所述候选词的词频、互信息和句子端点数是否分别大于或等于词频第二预设阈值、互信息第二预设阈值以及句子端点数第一预设阈值;A second judgment subunit, used for judging whether the word frequency, mutual information and the number of sentence endpoints of the candidate words are greater than or equal to the word frequency second preset threshold, the mutual information second preset threshold and the sentence endpoint number first preset threshold ;
第二判定子单元,用于若所述候选词的词频、互信息和句子端点数分别大于或等于所述词频第二预设阈值、互信息第二预设阈值以及句子端点数第一预设阈值,判定所述候选词满足预设条件。A second determination subunit, configured to: if the word frequency, mutual information and the number of sentence endpoints of the candidate word are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset number of sentence endpoints The threshold value determines that the candidate word meets a preset condition.
请继续参阅图5,在该实施例中,所述装置400还包括:Please continue to refer to FIG. 5. In this embodiment, the device 400 further includes:
替换单元406,用于使用统一标识符替换所述文本语料中的所述预设句子端点。The replacement unit 406 is configured to replace the preset sentence endpoint in the text corpus with a unified identifier.
请继续参阅图5,在该实施例中,所述装置400还包括:Please continue to refer to FIG. 5. In this embodiment, the device 400 further includes:
第三获取单元407,用于获取所述新词的长度,判断所述新词的长度是否大于或者等于预设长度阈值;The third obtaining unit 407 is configured to obtain the length of the new word and determine whether the length of the new word is greater than or equal to a preset length threshold;
第三识别单元408,用于若所述新词的长度大于或者等于所述预设长度阈值,识别所述新词为长粒度新词。The third recognition unit 408 is configured to recognize that the new word is a long-grained new word if the length of the new word is greater than or equal to the preset length threshold.
请继续参阅图5,在该实施例中,所述装置400还包括:Please continue to refer to FIG. 5. In this embodiment, the device 400 further includes:
第四获取单元409,用于获取所述新词的词频,判断所述新词的词频是否低于预设词频阈值;The fourth obtaining unit 409 is configured to obtain the word frequency of the new word and determine whether the word frequency of the new word is lower than a preset word frequency threshold;
第四识别单元410,用于若所述新词的词频低于所述预设词频阈值,识别所 述新词为低频新词。The fourth recognition unit 410 is configured to recognize the new word as a low-frequency new word if the word frequency of the new word is lower than the preset word frequency threshold.
需要说明的是,所属领域的技术人员可以清楚地了解到,上述新词识别装置和各单元的具体实现过程,可以参考前述方法实施例中的相应描述,为了描述的方便和简洁,在此不再赘述。It should be noted that those skilled in the art can clearly understand that the specific implementation process of the above new word recognition device and each unit can refer to the corresponding description in the foregoing method embodiments. For convenience and conciseness of description, it is not described here. Repeat again.
同时,上述新词识别装置中各个单元的划分和连接方式仅用于举例说明,在其他实施例中,可将新词识别装置按照需要划分为不同的单元,也可将新词识别装置中各单元采取不同的连接顺序和方式,以完成上述新词识别装置的全部或部分功能。At the same time, the division and connection of each unit in the above-mentioned new word recognition device are for illustration only. In other embodiments, the new word recognition device may be divided into different units as needed, or each of the new word recognition device The units adopt different connection sequences and methods to complete all or part of the functions of the above new word recognition device.
上述新词识别装置可以实现为一种计算机程序的形式,该计算机程序可以在如图6所示的计算机设备上运行。The above new word recognition device may be implemented in the form of a computer program, and the computer program may run on the computer device shown in FIG. 6.
请参阅图6,图6是本申请实施例提供的一种计算机设备的示意性框图。该计算机设备600可以是台式机电脑或者平板电脑等电子设备,也可以是其他设备中的组件或者部件。Please refer to FIG. 6, which is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 600 may be an electronic device such as a desktop computer or a tablet computer, or may be a component or part in other devices.
参阅图6,该计算机设备600包括通过系统总线601连接的处理器602、存储器和网络接口605,其中,存储器可以包括非易失性存储介质603和内存储器604。Referring to FIG. 6, the computer device 600 includes a processor 602, a memory, and a network interface 605 connected through a system bus 601, where the memory may include a non-volatile storage medium 603 and an internal memory 604.
该非易失性存储介质603可存储操作系统6031和计算机程序6032。该计算机程序6032被执行时,可使得处理器602执行一种上述新词识别方法。The non-volatile storage medium 603 can store an operating system 6031 and a computer program 6032. When the computer program 6032 is executed, it may cause the processor 602 to execute the above-mentioned new word recognition method.
该处理器602用于提供计算和控制能力,以支撑整个计算机设备600的运行。The processor 602 is used to provide computing and control capabilities to support the operation of the entire computer device 600.
该内存储器604为非易失性存储介质603中的计算机程序6032的运行提供环境,该计算机程序6032被处理器602执行时,可使得处理器602执行一种上述新词识别方法。The internal memory 604 provides an environment for the operation of the computer program 6032 in the non-volatile storage medium 603. When the computer program 6032 is executed by the processor 602, the processor 602 can execute the above-mentioned new word recognition method.
该网络接口605用于与其它设备进行网络通信。本领域技术人员可以理解,图6中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备600的限定,具体的计算机设备600可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在 这样的实施例中,存储器及处理器的结构及功能与图6所示实施例一致,在此不再赘述。The network interface 605 is used for network communication with other devices. Those skilled in the art can understand that the structure shown in FIG. 6 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 600 to which the solution of the present application is applied. The specific computer device 600 may include more or less components than shown in the figure, or combine certain components, or have a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are the same as those in the embodiment shown in FIG. 6, which will not be repeated here.
其中,所述处理器602用于运行存储在存储器中的计算机程序6032,以实现本申请实施例的新词识别方法。Wherein, the processor 602 is used to run the computer program 6032 stored in the memory, so as to implement the new word recognition method of the embodiment of the present application.
应当理解,在本申请实施例中,处理器602可以是中央处理单元(Central Processing Unit,CPU),该处理器602还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that in the embodiment of the present application, the processor 602 may be a central processing unit (Central Processing Unit, CPU), and the processor 602 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), Application specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may be any conventional processor.
本领域普通技术人员可以理解的是实现上述实施例的方法中的全部或部分流程,是可以通过计算机程序来完成,该计算机程序可存储于一计算机可读存储介质。该计算机程序被该计算机系统中的至少一个处理器执行,以实现上述方法的实施例的流程步骤。Those of ordinary skill in the art may understand that all or part of the processes in the method for implementing the foregoing embodiments may be completed by a computer program, and the computer program may be stored in a computer-readable storage medium. The computer program is executed by at least one processor in the computer system to implement the process steps of the above method embodiments.
因此,本申请还提供一种计算机可读存储介质。该计算机可读存储介质可以为非易失性的计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行时使处理器执行以上各实施例中所描述的新词识别方法的步骤。Therefore, the present application also provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium stores a computer program. When the computer program is executed by the processor, the processor causes the processor to perform the operations described in the foregoing embodiments. New word recognition method steps.
所述计算机可读存储介质可以是U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、磁碟或者光盘等各种可以存储计算机程序的存储介质。The computer-readable storage medium may be various storage media that can store computer programs, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art may realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two, in order to clearly explain the hardware and software. Interchangeability, in the above description, the composition and steps of each example have been generally described according to function. Whether these functions are executed in hardware or software depends on the specific application of the technical solution and design constraints. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
以上所述,仅为本申请的具体实施方式,但本申请明的保护范围并不局限 于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above are only specific implementations of this application, but the scope of protection disclosed in this application is not limited to this, and any person skilled in the art can easily think of various equivalents within the technical scope disclosed in this application Modifications or replacements, these modifications or replacements should be covered within the scope of protection of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (20)

  1. 一种新词识别方法,包括:A new word recognition method, including:
    获取文本语料,根据预设句子端点,通过N元切分将所述文本语料切分成长度为2-N的候选词,其中,N为自然数,且N≥2,所述候选词是指通过切分所述文本语料获取的文本片段;Obtain the text corpus, and divide the text corpus into candidate words with a length of 2-N according to the preset endpoints of the sentence, where N is a natural number, and N ≥ 2, the candidate word refers to the pass A text segment obtained from the text corpus;
    判断所述候选词是否满足预设条件;Determine whether the candidate word meets the preset condition;
    若所述候选词满足所述预设条件,将所述候选词确定为候选新词;If the candidate word meets the preset condition, determine the candidate word as a candidate new word;
    判断所述候选新词是否包含在所述预设词库中;以及Determine whether the candidate new word is included in the preset thesaurus; and
    若所述候选新词不包含在所述预设词库中,将所述候选新词确定为新词。If the candidate new word is not included in the preset thesaurus, the candidate new word is determined as a new word.
  2. 根据权利要求1所述新词识别方法,其中,所述预设句子端点包括标点符号和预设分割端点,所述预设分割端点是指所述文本语料中除标点符号外预先被设置为句子端点的成分。The new word recognition method according to claim 1, wherein the preset sentence endpoint includes a punctuation mark and a preset segmentation endpoint, and the preset segmentation endpoint refers to that the text corpus is pre-set as a sentence except punctuation symbols The composition of the endpoint.
  3. 根据权利要求1所述新词识别方法,其中,所述判断所述候选词是否满足预设条件的步骤包括:The new word recognition method according to claim 1, wherein the step of determining whether the candidate word meets a preset condition includes:
    获取所述候选词的互信息和左右信息熵,并且获取所述候选词的词频,其中,所述左右信息熵是指所述候选词的左邻字信息熵和右邻字信息熵中的较小值;Acquiring the mutual information and left and right information entropy of the candidate words, and acquiring the word frequency of the candidate words, wherein the left and right information entropy refers to the comparison between the left entropy information entropy and the right entropy information entropy of the candidate words Small value
    判断所述候选词的词频、互信息和左右信息熵是否分别大于或等于词频第一预设阈值、互信息第一预设阈值以及左右信息熵第一预设阈值;Judging whether the word frequency, mutual information and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset threshold of left and right information entropy;
    若所述候选词的词频、互信息和左右信息熵分别大于或者等于所述词频第一预设阈值、互信息第一预设阈值以及左右信息熵第一预设阈值,判定所述候选词满足预设条件。If the word frequency, mutual information and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset threshold of left and right information entropy, it is determined that the candidate word meets Preset conditions.
  4. 根据权利要求1所述新词识别方法,其中,所述判断所述候选词是否满足预设条件的步骤包括:The new word recognition method according to claim 1, wherein the step of determining whether the candidate word meets a preset condition includes:
    获取所述候选词的互信息,并且获取所述候选词的词频及所述候选词的句子端点数,所述句子端点数是指所述候选词的左端点数或者所述候选词的右端点数,所述左端点数是指所述候选词的左端点出现的次数,所述右端点数是指所述候选词的右端点出现的次数;Acquiring mutual information of the candidate words, and acquiring the word frequency of the candidate words and the number of sentence endpoints of the candidate words, the number of sentence endpoints refers to the number of left endpoints of the candidate words or the number of right endpoints of the candidate words, The number of left endpoints refers to the number of occurrences of the left endpoint of the candidate word, and the number of right endpoints refers to the number of occurrences of the right endpoint of the candidate word;
    判断所述候选词的词频、互信息和句子端点数是否分别大于或等于词频第二预设阈值、互信息第二预设阈值以及句子端点数第一预设阈值;Judging whether the word frequency, mutual information and the number of sentence endpoints of the candidate words are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset threshold of sentence endpoints;
    若所述候选词的词频、互信息和句子端点数分别大于或等于所述词频第二预设阈值、互信息第二预设阈值以及句子端点数第一预设阈值,判定所述候选词满足预设条件。If the word frequency, mutual information and the number of sentence endpoints of the candidate word are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset threshold of number of sentence endpoints, it is determined that the candidate word meets Preset conditions.
  5. 根据权利要求1所述新词识别方法,其中,所述根据预设句子端点,通过N元切分将所述文本语料切分成长度为2-N的候选词的步骤之前还包括:The new word recognition method according to claim 1, wherein the step of dividing the text corpus into candidate words with a length of 2-N by N-ary segmentation according to a preset sentence endpoint further includes:
    使用统一标识符替换所述文本语料中的所述预设句子端点。Replace the preset sentence endpoint in the text corpus with a unified identifier.
  6. 根据权利要求1所述新词识别方法,其中,所述若所述候选新词不包含在所述预设词库中,将所述候选新词确定为新词的步骤之后还包括:The new word recognition method according to claim 1, wherein, if the candidate new word is not included in the preset lexicon, the step of determining the candidate new word as a new word further includes:
    获取所述新词的长度,判断所述新词的长度是否大于或者等于预设长度阈值;Acquiring the length of the new word, and determining whether the length of the new word is greater than or equal to a preset length threshold;
    若所述新词的长度大于或者等于所述预设长度阈值,识别所述新词为长粒度新词。If the length of the new word is greater than or equal to the preset length threshold, the new word is identified as a long-granular new word.
  7. 根据权利要求1所述新词识别方法,其中,所述若所述候选新词不包含在所述预设词库中,将所述候选新词确定为新词的步骤之后还包括:The new word recognition method according to claim 1, wherein, if the candidate new word is not included in the preset lexicon, the step of determining the candidate new word as a new word further includes:
    获取所述新词的词频,判断所述新词的词频是否低于预设词频阈值;Acquiring the word frequency of the new word, and determining whether the word frequency of the new word is lower than a preset word frequency threshold;
    若所述新词的词频低于所述预设词频阈值,识别所述新词为低频新词。If the word frequency of the new word is lower than the preset word frequency threshold, the new word is identified as a low-frequency new word.
  8. 一种新词识别装置,包括:A new word recognition device, including:
    切分单元,用于获取文本语料,根据预设句子端点,通过N元切分将所述文本语料切分成长度为2-N的候选词,其中,N为自然数,且N≥2,所述候选词是指切分所述文本语料获取的文本片段;The segmentation unit is used to obtain a text corpus, and divide the text corpus into candidate words with a length of 2-N by N-ary segmentation according to the preset sentence endpoints, where N is a natural number and N≥2, the The candidate word refers to a text segment obtained by segmenting the text corpus;
    判断单元,用于判断所述候选词是否满足预设条件;The judging unit is used to judge whether the candidate word meets a preset condition;
    第一识别单元,用于若所述候选词满足所述预设条件,将所述候选词确定为候选新词;A first recognition unit, configured to determine the candidate word as a candidate new word if the candidate word meets the preset condition;
    过滤单元,用于判断所述候选新词是否包含在所述预设词库中;以及A filtering unit, used to determine whether the candidate new word is included in the preset thesaurus; and
    第二识别单元,用于若所述候选新词不包含在所述预设词库中,将所述候选新词确定为新词。The second recognition unit is configured to determine the candidate new word as a new word if the candidate new word is not included in the preset vocabulary.
  9. 根据权利要求8所述新词识别装置,其中,所述预设句子端点包括标点符号和预设分割端点,所述预设分割端点是指所述文本语料中除标点符号外预先被设置为句子端点的成分。The new word recognition device according to claim 8, wherein the preset sentence endpoint includes a punctuation mark and a preset segmentation endpoint, and the preset segmentation endpoint refers to that the text corpus is pre-set as a sentence except for punctuation The composition of the endpoint.
  10. 根据权利要求8所述新词识别装置,其中,所述判断单元包括:The new word recognition device according to claim 8, wherein the judgment unit includes:
    第一获取子单元,用于获取所述候选词的互信息和左右信息熵,并且获取所述候选词的词频,其中,所述左右信息熵是指所述候选词的左邻字信息熵和右邻字信息熵中的较小值;The first obtaining subunit is used to obtain the mutual information and left and right information entropy of the candidate words, and obtain the word frequency of the candidate words, where the left and right information entropy refers to the information entropy of the left neighbor words of the candidate word and The smaller value in the information entropy of the right neighbor word;
    第一判断子单元,用于判断所述候选词的词频、互信息和左右信息熵是否分别大于或等于词频第一预设阈值、互信息第一预设阈值以及左右信息熵第一预设阈值;The first judgment subunit is used to judge whether the word frequency, mutual information and left and right information entropies of the candidate words are greater than or equal to the word frequency first preset threshold, the mutual information first preset threshold and the left and right information entropy first preset threshold, respectively ;
    第一判定子单元,用于若所述候选词的词频、互信息和左右信息熵分别大于或者等于所述词频第一预设阈值、互信息第一预设阈值以及左右信息熵第一预设阈值,判定所述候选词满足预设条件。The first determination subunit is used if the word frequency, mutual information, and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information, and the first preset of left and right information entropy The threshold value determines that the candidate word meets a preset condition.
  11. 一种计算机设备,所述计算机设备包括存储器以及与所述存储器相连的处理器;所述存储器用于存储计算机程序;所述处理器用于运行所述存储器中存储的计算机程序,所述处理器执行所述计算机程序时实现以下步骤:A computer device, the computer device includes a memory and a processor connected to the memory; the memory is used to store a computer program; the processor is used to run the computer program stored in the memory, the processor executes The computer program implements the following steps:
    获取文本语料,根据预设句子端点,通过N元切分将所述文本语料切分成长度为2-N的候选词,其中,N为自然数,且N≥2,所述候选词是指通过切分所述文本语料获取的文本片段;Obtain the text corpus, and divide the text corpus into candidate words with a length of 2-N according to the preset endpoints of the sentence, where N is a natural number, and N ≥ 2, the candidate word refers to the pass A text segment obtained from the text corpus;
    判断所述候选词是否满足预设条件;Determine whether the candidate word meets the preset condition;
    若所述候选词满足所述预设条件,将所述候选词确定为候选新词;If the candidate word meets the preset condition, determine the candidate word as a candidate new word;
    判断所述候选新词是否包含在所述预设词库中;以及Determine whether the candidate new word is included in the preset thesaurus; and
    若所述候选新词不包含在所述预设词库中,将所述候选新词确定为新词。If the candidate new word is not included in the preset thesaurus, the candidate new word is determined as a new word.
  12. 根据权利要求11所述计算机设备,其中,所述预设句子端点包括标点符号和预设分割端点,所述预设分割端点是指所述文本语料中除标点符号外预先被设置为句子端点的成分。The computer device according to claim 11, wherein the preset sentence endpoints include punctuation marks and preset segmentation endpoints, and the preset segmentation endpoints refer to those in the text corpus that are previously set as sentence endpoints except for punctuation marks ingredient.
  13. 根据权利要求11所述计算机设备,其中,所述判断所述候选词是否满足预设条件的步骤包括:The computer device according to claim 11, wherein the step of determining whether the candidate word meets a preset condition includes:
    获取所述候选词的互信息和左右信息熵,并且获取所述候选词的词频,其中,所述左右信息熵是指所述候选词的左邻字信息熵和右邻字信息熵中的较小值;Acquiring the mutual information and left and right information entropy of the candidate words, and acquiring the word frequency of the candidate words, wherein the left and right information entropy refers to the comparison between the left entropy information entropy and the right entropy information entropy of the candidate words Small value
    判断所述候选词的词频、互信息和左右信息熵是否分别大于或等于词频第一预设阈值、互信息第一预设阈值以及左右信息熵第一预设阈值;Judging whether the word frequency, mutual information and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset threshold of left and right information entropy;
    若所述候选词的词频、互信息和左右信息熵分别大于或者等于所述词频第一预设阈值、互信息第一预设阈值以及左右信息熵第一预设阈值,判定所述候选词满足预设条件。If the word frequency, mutual information and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset threshold of left and right information entropy, it is determined that the candidate word meets Preset conditions.
  14. 根据权利要求11所述新词识别方法,其中,所述判断所述候选词是否满足预设条件的步骤包括:The new word recognition method according to claim 11, wherein the step of determining whether the candidate word meets a preset condition includes:
    获取所述候选词的互信息,并且获取所述候选词的词频及所述候选词的句子端点数,所述句子端点数是指所述候选词的左端点数或者所述候选词的右端点数,所述左端点数是指所述候选词的左端点出现的次数,所述右端点数是指所述候选词的右端点出现的次数;Acquiring mutual information of the candidate words, and acquiring the word frequency of the candidate words and the number of sentence endpoints of the candidate words, the number of sentence endpoints refers to the number of left endpoints of the candidate words or the number of right endpoints of the candidate words, The number of left endpoints refers to the number of occurrences of the left endpoint of the candidate word, and the number of right endpoints refers to the number of occurrences of the right endpoint of the candidate word;
    判断所述候选词的词频、互信息和句子端点数是否分别大于或等于词频第二预设阈值、互信息第二预设阈值以及句子端点数第一预设阈值;Judging whether the word frequency, mutual information and the number of sentence endpoints of the candidate words are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset threshold of sentence endpoints;
    若所述候选词的词频、互信息和句子端点数分别大于或等于所述词频第二预设阈值、互信息第二预设阈值以及句子端点数第一预设阈值,判定所述候选词满足预设条件。If the word frequency, mutual information and the number of sentence endpoints of the candidate word are greater than or equal to the second preset threshold of word frequency, the second preset threshold of mutual information and the first preset threshold of number of sentence endpoints, it is determined that the candidate word meets Preset conditions.
  15. 根据权利要求11所述新词识别方法,其中,所述根据预设句子端点,通过N元切分将所述文本语料切分成长度为2-N的候选词的步骤之前还包括:The new word recognition method according to claim 11, wherein the step of dividing the text corpus into candidate words with a length of 2-N by N-ary segmentation according to a preset sentence endpoint further includes:
    使用统一标识符替换所述文本语料中的所述预设句子端点。Replace the preset sentence endpoint in the text corpus with a unified identifier.
  16. 根据权利要求11所述新词识别方法,其中,所述若所述候选新词不包含在所述预设词库中,将所述候选新词确定为新词的步骤之后还包括:The new word recognition method according to claim 11, wherein, if the candidate new word is not included in the preset thesaurus, the step of determining the candidate new word as a new word further includes:
    获取所述新词的长度,判断所述新词的长度是否大于或者等于预设长度阈值;Acquiring the length of the new word, and determining whether the length of the new word is greater than or equal to a preset length threshold;
    若所述新词的长度大于或者等于所述预设长度阈值,识别所述新词为长粒度新词。If the length of the new word is greater than or equal to the preset length threshold, the new word is identified as a long-granular new word.
  17. 根据权利要求11所述新词识别方法,其中,所述若所述候选新词不包含在所述预设词库中,将所述候选新词确定为新词的步骤之后还包括:The new word recognition method according to claim 11, wherein, if the candidate new word is not included in the preset thesaurus, the step of determining the candidate new word as a new word further includes:
    获取所述新词的词频,判断所述新词的词频是否低于预设词频阈值;Acquiring the word frequency of the new word, and determining whether the word frequency of the new word is lower than a preset word frequency threshold;
    若所述新词的词频低于所述预设词频阈值,识别所述新词为低频新词。If the word frequency of the new word is lower than the preset word frequency threshold, the new word is identified as a low-frequency new word.
  18. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器执行如下步骤:A computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, causes the processor to perform the following steps:
    获取文本语料,根据预设句子端点,通过N元切分将所述文本语料切分成长度为2-N的候选词,其中,N为自然数,且N≥2,所述候选词是指通过切分所述文本语料获取的文本片段;Obtain the text corpus, and divide the text corpus into candidate words with a length of 2-N according to the preset endpoints of the sentence, where N is a natural number, and N ≥ 2, the candidate word refers to the pass A text segment obtained from the text corpus;
    判断所述候选词是否满足预设条件;Determine whether the candidate word meets the preset condition;
    若所述候选词满足所述预设条件,将所述候选词确定为候选新词;If the candidate word meets the preset condition, determine the candidate word as a candidate new word;
    判断所述候选新词是否包含在所述预设词库中;以及Determine whether the candidate new word is included in the preset thesaurus; and
    若所述候选新词不包含在所述预设词库中,将所述候选新词确定为新词。If the candidate new word is not included in the preset thesaurus, the candidate new word is determined as a new word.
  19. 根据权利要求18所述计算机可读存储介质,其中,所述预设句子端点包括标点符号和预设分割端点,所述预设分割端点是指所述文本语料中除标点符号外预先被设置为句子端点的成分。The computer-readable storage medium according to claim 18, wherein the preset sentence endpoint includes a punctuation mark and a preset segmentation endpoint, and the preset segmentation endpoint refers to that the text corpus is set in advance in addition to the punctuation symbol as The component of the sentence endpoint.
  20. 根据权利要求18所述计算机可读存储介质,其中,所述判断所述候选词是否满足预设条件的步骤包括:The computer-readable storage medium of claim 18, wherein the step of determining whether the candidate word meets a preset condition includes:
    获取所述候选词的互信息和左右信息熵,并且获取所述候选词的词频,其中,所述左右信息熵是指所述候选词的左邻字信息熵和右邻字信息熵中的较小值;Acquiring the mutual information and left and right information entropy of the candidate words, and acquiring the word frequency of the candidate words, wherein the left and right information entropy refers to the comparison between the left entropy information entropy and the right entropy information entropy of the candidate words Small value
    判断所述候选词的词频、互信息和左右信息熵是否分别大于或等于词频第一预设阈值、互信息第一预设阈值以及左右信息熵第一预设阈值;Judging whether the word frequency, mutual information and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset threshold of left and right information entropy;
    若所述候选词的词频、互信息和左右信息熵分别大于或者等于所述词频第一预设阈值、互信息第一预设阈值以及左右信息熵第一预设阈值,判定所述候选词满足预设条件。If the word frequency, mutual information and left and right information entropies of the candidate words are respectively greater than or equal to the first preset threshold of word frequency, the first preset threshold of mutual information and the first preset threshold of left and right information entropy, it is determined that the candidate word meets Preset conditions.
PCT/CN2018/124797 2018-10-12 2018-12-28 New word recognition method and apparatus, computer device, and computer readable storage medium WO2020073523A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811191755.9 2018-10-12
CN201811191755.9A CN109408818B (en) 2018-10-12 2018-10-12 New word recognition method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2020073523A1 true WO2020073523A1 (en) 2020-04-16

Family

ID=65467079

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/124797 WO2020073523A1 (en) 2018-10-12 2018-12-28 New word recognition method and apparatus, computer device, and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN109408818B (en)
WO (1) WO2020073523A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914554A (en) * 2020-08-19 2020-11-10 网易(杭州)网络有限公司 Training method of field new word recognition model, field new word recognition method and field new word recognition equipment
CN111931491A (en) * 2020-08-14 2020-11-13 工银科技有限公司 Domain dictionary construction method and device
CN113033183A (en) * 2021-03-03 2021-06-25 西北大学 Network new word discovery method and system based on statistics and similarity
CN113609844A (en) * 2021-07-30 2021-11-05 国网山西省电力公司晋城供电公司 Electric power professional word bank construction method based on hybrid model and clustering algorithm

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110096591A (en) * 2019-04-04 2019-08-06 平安科技(深圳)有限公司 Long text classification method, device, computer equipment and storage medium based on bag of words
CN110222157A (en) * 2019-06-20 2019-09-10 贵州电网有限责任公司 A kind of new word discovery method based on mass text
CN110457595B (en) * 2019-08-01 2023-07-04 腾讯科技(深圳)有限公司 Emergency alarm method, device, system, electronic equipment and storage medium
CN110569504B (en) * 2019-09-04 2022-11-15 北京明略软件系统有限公司 Relation word determining method and device
CN112487132A (en) * 2019-09-12 2021-03-12 北京国双科技有限公司 Keyword determination method and related equipment
CN110807322B (en) * 2019-09-19 2024-03-01 平安科技(深圳)有限公司 Method, device, server and storage medium for identifying new words based on information entropy
CN110866400B (en) * 2019-11-01 2023-08-04 中电科大数据研究院有限公司 Automatic change lexical analysis system of update
CN110825840B (en) * 2019-11-08 2023-02-17 北京声智科技有限公司 Word bank expansion method, device, equipment and storage medium
CN110991173B (en) * 2019-11-29 2023-09-29 支付宝(杭州)信息技术有限公司 Word segmentation method and system
CN110990571B (en) * 2019-12-02 2024-04-02 北京秒针人工智能科技有限公司 Method and device for acquiring discussion duty ratio, storage medium and electronic equipment
CN111061924A (en) * 2019-12-11 2020-04-24 北京明略软件系统有限公司 Phrase extraction method, device, equipment and storage medium
CN111125327A (en) * 2019-12-11 2020-05-08 中国建设银行股份有限公司 Short-session-based new word discovery method, storage medium and electronic device
CN111274361A (en) * 2020-01-21 2020-06-12 北京明略软件系统有限公司 Industry new word discovery method and device, storage medium and electronic equipment
CN111460170B (en) * 2020-03-27 2024-02-13 深圳价值在线信息科技股份有限公司 Word recognition method, device, terminal equipment and storage medium
CN111626053B (en) * 2020-05-21 2024-04-09 北京明亿科技有限公司 New scheme means descriptor recognition method and device, electronic equipment and storage medium
CN112329458B (en) * 2020-05-21 2024-05-10 北京明亿科技有限公司 New organization descriptor recognition method and device, electronic equipment and storage medium
CN111626054B (en) * 2020-05-21 2023-12-19 北京明亿科技有限公司 Novel illegal action descriptor recognition method and device, electronic equipment and storage medium
CN111966791B (en) * 2020-09-03 2024-04-19 深圳市小满科技有限公司 Method for extracting and retrieving customs data product words
CN112380856B (en) * 2020-10-20 2023-09-29 湖南大学 Automatic extraction method, system, terminal and readable storage medium for component naming in patent text
CN112463969B (en) * 2020-12-08 2022-09-20 上海烟草集团有限责任公司 Method, system, equipment and medium for detecting new words of cigarette brand and product rule words
CN113468879A (en) * 2021-07-16 2021-10-01 上海明略人工智能(集团)有限公司 Method, system, electronic device and medium for judging unknown words
CN113449082A (en) * 2021-07-16 2021-09-28 上海明略人工智能(集团)有限公司 New word discovery method, system, electronic device and medium
CN113779200A (en) * 2021-09-14 2021-12-10 中国电信集团系统集成有限责任公司 Target industry word stock generation method, processor and device
CN114822527A (en) * 2021-10-11 2022-07-29 北京中电慧声科技有限公司 Error correction method and device for converting voice into text, electronic equipment and storage medium
CN114218938A (en) * 2021-12-13 2022-03-22 北京智齿众服技术咨询有限公司 Word segmentation method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106970919A (en) * 2016-01-14 2017-07-21 北京国双科技有限公司 The method and device that new phrase is found
CN107180025A (en) * 2017-03-31 2017-09-19 北京奇艺世纪科技有限公司 A kind of recognition methods of neologisms and device
CN107291684A (en) * 2016-04-12 2017-10-24 华为技术有限公司 The segmenting method and system of language text
CN108536667A (en) * 2017-03-06 2018-09-14 中国移动通信集团广东有限公司 Chinese text recognition methods and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7917355B2 (en) * 2007-08-23 2011-03-29 Google Inc. Word detection
CN108875040B (en) * 2015-10-27 2020-08-18 上海智臻智能网络科技股份有限公司 Dictionary updating method and computer-readable storage medium
CN107092588B (en) * 2016-02-18 2022-09-09 腾讯科技(深圳)有限公司 Text information processing method, device and system
CN108595433A (en) * 2018-05-02 2018-09-28 北京中电普华信息技术有限公司 A kind of new word discovery method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106970919A (en) * 2016-01-14 2017-07-21 北京国双科技有限公司 The method and device that new phrase is found
CN107291684A (en) * 2016-04-12 2017-10-24 华为技术有限公司 The segmenting method and system of language text
CN108536667A (en) * 2017-03-06 2018-09-14 中国移动通信集团广东有限公司 Chinese text recognition methods and device
CN107180025A (en) * 2017-03-31 2017-09-19 北京奇艺世纪科技有限公司 A kind of recognition methods of neologisms and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHOU, QING: "Research on Network New Word Discovery Algorithm", CHINESE MASTER'S THESES FULL-TEXT DATABASE (ELECTRONIC JOURNAL), INFORMATION SCIENCE & TECHNOLOGY, 15 August 2017 (2017-08-15), ISSN: 1674-0246 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931491A (en) * 2020-08-14 2020-11-13 工银科技有限公司 Domain dictionary construction method and device
CN111931491B (en) * 2020-08-14 2023-11-14 中国工商银行股份有限公司 Domain dictionary construction method and device
CN111914554A (en) * 2020-08-19 2020-11-10 网易(杭州)网络有限公司 Training method of field new word recognition model, field new word recognition method and field new word recognition equipment
CN113033183A (en) * 2021-03-03 2021-06-25 西北大学 Network new word discovery method and system based on statistics and similarity
CN113033183B (en) * 2021-03-03 2023-10-27 西北大学 Network new word discovery method and system based on statistics and similarity
CN113609844A (en) * 2021-07-30 2021-11-05 国网山西省电力公司晋城供电公司 Electric power professional word bank construction method based on hybrid model and clustering algorithm
CN113609844B (en) * 2021-07-30 2024-03-08 国网山西省电力公司晋城供电公司 Electric power professional word stock construction method based on hybrid model and clustering algorithm

Also Published As

Publication number Publication date
CN109408818B (en) 2023-04-07
CN109408818A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
WO2020073523A1 (en) New word recognition method and apparatus, computer device, and computer readable storage medium
US11544459B2 (en) Method and apparatus for determining feature words and server
WO2018205389A1 (en) Voice recognition method and system, electronic apparatus and medium
WO2019153551A1 (en) Article classification method and apparatus, computer device and storage medium
WO2020151164A1 (en) Message pushing method and apparatus, computer device and storage medium
TWI654530B (en) Method and device for screening and promoting keywords
US20220318275A1 (en) Search method, electronic device and storage medium
CN111626064B (en) Training method, training device and storage medium for neural machine translation model
WO2017091985A1 (en) Method and device for recognizing stop word
CN109063184B (en) Multi-language news text clustering method, storage medium and terminal device
US20160188569A1 (en) Generating a Table of Contents for Unformatted Text
WO2020056979A1 (en) Knowledge base search method and apparatus, and computer-readable storage medium
CN107861948B (en) Label extraction method, device, equipment and medium
CN108959259B (en) New word discovery method and system
CN112989235B (en) Knowledge base-based inner link construction method, device, equipment and storage medium
WO2023060633A1 (en) Relationship extraction method and apparatus for enhancing semantics, and computer device and storage medium
CN113408660B (en) Book clustering method, device, equipment and storage medium
CN117743577A (en) Text classification method, device, electronic equipment and storage medium
CN112528640A (en) Automatic domain term extraction method based on abnormal subgraph detection
CN111930949A (en) Search string processing method and device, computer readable medium and electronic equipment
CN111492364B (en) Data labeling method and device and storage medium
WO2018041036A1 (en) Keyword searching method, apparatus and terminal
WO2020170804A1 (en) Synonym extraction device, synonym extraction method, and synonym extraction program
JP6680472B2 (en) Information processing apparatus, information processing method, and information processing program
CN113868508A (en) Writing material query method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 06.08.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18936573

Country of ref document: EP

Kind code of ref document: A1