WO2021051600A1 - 基于信息熵识别新词的方法、装置、设备及存储介质 - Google Patents

基于信息熵识别新词的方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021051600A1
WO2021051600A1 PCT/CN2019/118276 CN2019118276W WO2021051600A1 WO 2021051600 A1 WO2021051600 A1 WO 2021051600A1 CN 2019118276 W CN2019118276 W CN 2019118276W WO 2021051600 A1 WO2021051600 A1 WO 2021051600A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
participle
collocation
information entropy
word segmentation
Prior art date
Application number
PCT/CN2019/118276
Other languages
English (en)
French (fr)
Inventor
陈婷婷
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021051600A1 publication Critical patent/WO2021051600A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • This application relates to the field of big data technology, and in particular to a method, device, server, and computer-readable storage medium for identifying new words based on information entropy.
  • the word segmentation system is not easy to identify when the initial word segmentation is performed, so they are separated, such as "machine/learning".
  • the separation will cause ambiguity or semantic incompleteness, which will affect the accuracy of the keyword extraction results for news articles.
  • the popular Chinese word segmentation system such as nlpir word segmentation system
  • the new word recognition function is added based on information entropy, it still cannot achieve a better word segmentation effect when multiple word segmentation and word segmentation phrases or new words appear at the same time.
  • the machine learning in the text is a new word or keyword phrase, but there are also multiple independent machine and learning word segmentation in the text at the same time.
  • the inventor realizes that the current word segmentation system cannot perform word segmentation recognition for phrases, so that multiple word segmentation and new words in the word segmentation phrase cannot be recognized, resulting in a low recognition rate.
  • the main purpose of this application is to provide a method for identifying new words based on information entropy, which aims to solve the problem that the current word segmentation system in the prior art cannot segment phrases, so that multiple word segmentation and new words in word segmentation phrases cannot be identified, resulting in Technical problems with low recognition rate.
  • the present application provides a method for identifying new words based on information entropy.
  • the method for identifying new words based on information entropy includes:
  • the first participle and the right collocation are calculated by the preset probability formula, the co-occurring word frequency of the first participle and the right collocation word, and the co-occurring word frequency of the second participle and the left collocation word.
  • the co-occurrence probability value of the first participle and the right collocation word, and the co-occurrence probability value of the second participle and the left collocation word, the right part of the first participle is calculated Information entropy and left information entropy of the second word segmentation;
  • the present application also provides a device for identifying new words based on information entropy, and the device for identifying new words based on information entropy includes:
  • a reading unit configured to obtain a target phrase in the text to be processed, divide the target phrase into a first participle and a second participle, and read the information of the first participle and the information of the second participle respectively;
  • the statistics unit is configured to obtain the right collocation words of the first participle and the left collocation words of the second participle based on the information of the first participle and the information of the second participle, and count the first participles The frequency of co-occurring words with the right collocation word and the frequency of co-occurring words with the second participle and the left collocation word;
  • the first calculation unit is configured to calculate the first participle and the co-occurring word frequency of the first participle and the right collocation word, and the co-occurring word frequency of the second participle and the left collocation word through a preset probability formula, A co-occurrence probability value of a participle and the right collocation word and a co-occurrence probability value of the second participle and the left collocation word;
  • the second calculation unit is configured to calculate by using a preset information entropy formula, the co-occurrence probability value of the first participle and the right collocation word, and the co-occurrence probability value of the second participle and the left collocation word.
  • the first determining unit determines that the target phrase is a new word when the right information entropy of the first word segmentation and the left information entropy of the second word segmentation are both less than a first preset threshold.
  • the present application also provides a server, the server including: a memory, a processor, and a program for identifying new words based on information entropy that is stored on the memory and can run on the processor, so When the program for identifying new words based on information entropy is executed by the processor, the steps of the method for identifying new words based on information entropy as described in the above application are realized.
  • the present application also provides a computer-readable storage medium in which computer instructions are stored.
  • the computer instructions When the computer instructions are run on a computer, the computer executes the above-mentioned information entropy-based Methods of identifying new words.
  • the method, device, server, and computer-readable storage medium for identifying new words based on information entropy proposed in the embodiments of the present application can divide the target phrase into a first word segmentation and a second word segmentation by acquiring a target phrase in a text to be processed.
  • Word segmentation and read the information of the first participle and the information of the second participle respectively; based on the information of the first participle and the information of the second participle, obtain the right collocation word of the first participle and The left collocation word of the second participle, and the co-occurrence word frequency of the first participle and the right collocation word and the co-occurrence word frequency of the second participle and the left collocation word are counted; through a preset probability formula, The co-occurrence word frequency of the first participle and the right collocation word and the co-occurrence word frequency of the second participle and the left collocation word, and the co-occurrence probability value of the first participle and the right collocation word is calculated And the co-occurrence probability value of the second participle and the left collocation word; through a preset information entropy formula, the co-occurrence probability value of the first participle and the right collocation word, and the second participle and the The co-occurrence probability value of the left collocation word, the right
  • FIG. 1 is a schematic diagram of a server structure of a hardware operating environment involved in a solution of an embodiment of the application
  • FIG. 2 is a schematic flowchart of a first embodiment of a method for identifying new words based on information entropy according to this application;
  • FIG. 3 is a schematic flowchart of a second embodiment of a method for identifying new words based on information entropy according to this application;
  • FIG. 4 is a schematic flowchart of a third embodiment of a method for identifying new words based on information entropy according to this application;
  • FIG. 5 is a schematic flowchart of a fourth embodiment of a method for identifying new words based on information entropy according to this application;
  • FIG. 6 is a schematic flowchart of a fifth embodiment of a method for identifying new words based on information entropy according to this application;
  • FIG. 7 is a schematic flowchart of a sixth embodiment of a method for identifying new words based on information entropy according to this application;
  • FIG. 8 is a schematic flowchart of a seventh embodiment of a method for identifying new words based on information entropy according to this application.
  • the main solution of the embodiment of the present application is to divide the target phrase into the first participle and the second participle by obtaining the target phrase in the text to be processed, and read the information of the first participle and the information of the second participle respectively;
  • the information of the first participle and the second participle, the right collocation words of the first participle and the left collocation words of the second participle are obtained, and the co-occurring word frequency of the first participle and the right collocation words, and the second participle and left collocation words are counted
  • the frequency of co-occurrence words through the preset probability formula, the co-occurrence word frequency of the first participle and the right collocation word, and the co-occurrence word frequency of the second participle and the left collocation word, calculate the co-occurrence probability value of the first participle and the right collocation word and The co-occurrence probability value of the second participle and the left collocation word; through the preset information entropy formula, the co-occurrence probability value of
  • the present application provides a solution to calculate the information entropy value of the word segmentation by segmenting the phrase to determine the uncertainty of the phrase in the text to be processed, thereby identifying new words and improving the recognition rate.
  • FIG. 1 is a schematic diagram of the server structure of the hardware operating environment involved in the solution of the embodiment of the application.
  • the terminal in the embodiment of this application is a server.
  • the terminal may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002.
  • the communication bus 1002 is used to implement connection and communication between these components.
  • the user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • the memory 1005 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory.
  • the memory 1005 may also be a storage device independent of the aforementioned processor 1001.
  • terminal structure shown in FIG. 1 does not constitute a limitation on the terminal, and may include more or fewer components than shown in the figure, or combine some components, or arrange different components.
  • the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module, and a program for identifying new words based on information entropy.
  • the network interface 1004 is mainly used to connect to a back-end server and communicate with the back-end server;
  • the user interface 1003 is mainly used to connect to a client (user side) to communicate with the client;
  • the processor 1001 can be used to call a program for identifying new words based on information entropy stored in the memory 1005, and perform the following operations:
  • the co-occurrence word frequency of the first participle and the right collocation word, and the co-occurrence word frequency of the second participle and the left collocation word, the co-occurrence probability value of the first participle and the right collocation word and the second participle and the left collocation word are calculated.
  • the processor 1001 may call a program for identifying new words based on information entropy stored in the memory 1005, and also perform the following operations:
  • the Chinese word segmentation sequence in the Chinese word segmentation system is activated to divide the phrase into the first participle and the second participle, and the name information of the first word segmentation and the name information of the second word segmentation are obtained.
  • the participle and the second participle are combined to form the target phrase.
  • the processor 1001 may call a program for identifying new words based on information entropy stored in the memory 1005, and also perform the following operations:
  • the right collocation words of the first participle and the left collocation words of the second participle are obtained, and the first participle and the right collocation words are counted separately The frequency of co-occurring words and the frequency of co-occurring words of the second participle and the left collocation word.
  • the processor 1001 may call a program for identifying new words based on information entropy stored in the memory 1005, and also perform the following operations:
  • the first word frequency is obtained.
  • the co-occurrence probability value of the first participle and the right collocation word and the co-occurrence probability value of the second participle and the left collocation word is obtained.
  • the processor 1001 may call a program for identifying new words based on information entropy stored in the memory 1005, and also perform the following operations:
  • the right information entropy of the first participle or the second participle is calculated The left information entropy.
  • the processor 1001 may call a program for identifying new words based on information entropy stored in the memory 1005, and also perform the following operations:
  • the processor 1001 may call a program for identifying new words based on information entropy stored in the memory 1005, and also perform the following operations:
  • this application is a first embodiment of a method for identifying new words based on information entropy, and the method for identifying new words based on information entropy includes:
  • Step S10 Obtain a target phrase in the text to be processed, divide the target phrase into a first word segmentation and a second word segmentation, and read the information of the first word segmentation and the information of the second word segmentation respectively;
  • the server obtains the target phrase in the text to be processed, for example, the server calls the preset character recognition software to recognize the text to be processed, and obtains all the characters in the text to be processed.
  • the characters include numbers, letters, Chinese and English, etc. Phrase.
  • the server calls the Chinese word segmentation system to divide the phrase into a first word segmentation and a second word segmentation.
  • the Chinese word segmentation system not only divides the phrase into the first word segmentation and the second word segmentation, but can also be divided into The first participle, the second participle, the third participle, etc., and obtain the part of speech, attribute, name, word frequency and other information of the first participle.
  • Step S20 based on the information of the first participle and the second participle, obtain the right collocation word of the first participle and the left collocation word of the second participle, and count the co-occurring word frequency of the first participle and the right collocation word and the second participle Co-occurrence word frequency with left collocation words;
  • the server obtains the information of the first participle and the second participle respectively, it obtains the right collocation word of the first participle and the left collocation word of the second participle.
  • the server uses the name information of the first word segmentation as the search condition, and obtains the position of the first word segmentation in the text to be processed, and divides the first word segmentation on the right side of the first word segmentation
  • the right collocation word can be punctuation marks, spaces, prepositions, verbs, etc., but when the first participle is a verb, the right collocation word cannot be a noun, and the first participle and the right collocation word
  • the frequency of co-occurring words is greater than 1, and the right collocation word is at least one; when the name information of the second participle is obtained, the name information of the second participle is also used as the search condition to obtain the position of the second participle in
  • Step S30 Calculate the co-occurrence probability value of the first participle and the right collocation word and the second participle and the right collocation word frequency through the preset probability formula, the co-occurrence word frequency of the first participle and the right collocation word, and the co-occurrence word frequency of the second participle and the left collocation word Co-occurrence probability value of participle and left collocation word;
  • the server When the server obtains the frequency of the co-occurrence of the first participle and the right collocation word and the frequency of the co-occurrence of the second participle and the left collocation word, it calls the preset probability calculation formula, and respectively calculates the co-occurrence of the first participle and the right collocation word
  • the probability value and the co-occurrence probability value of the second participle and the left collocation word are specifically implemented as the word frequency of the first participle and the word frequency of the second participle obtained as the total value of the entire event, and the first participle and the right collocation
  • the co-occurring word frequency of a word is divided by the word frequency of the first participle as the numerator, where the word frequency of the first participle includes the frequency of the co-occurring word of the first participle and the right collocation word and the word frequency of the phrase (the phrase includes the first participle) and the first participle
  • Step S40 Calculate the right information entropy of the first participle and the co-occurrence probability value of the second participle and the left collocation word through the preset information entropy formula, the co-occurrence probability value of the first participle and the right collocation word, and the right information entropy of the first participle and the second participle.
  • the server When the server obtains the co-occurrence probability value of the first participle and the right collocation word and the co-occurrence probability value of the second participle and the left collocation word, it calls the preset information entropy formula.
  • the co-occurrence probability value of the first participle and the right collocation order is substituted into the preset information entropy, and the right information entropy of the first participle is calculated by calculation; at the same time, the co-occurrence probability value of the second participle and the left collocation word is brought into In the preset information entropy formula, the left information entropy of the second word segmentation is obtained by calculation.
  • Step S50 when the right information entropy of the first word segmentation and the left information entropy of the second word segmentation are both less than a first preset threshold, it is determined that the target phrase is a new word
  • the server judges whether the right information entropy of the first word segmentation and the left information entropy of the second word segmentation are less than the first preset threshold.
  • the right information entropy of the first word segmentation and the left information entropy of the second word segmentation are less than the first preset threshold, it is determined that the target phrase is a new word.
  • the left information entropy of the dichotomous word is selected, a database less than the first preset threshold is retrieved, and the right information entropy of the first word segmentation or the left information entropy of the second word segmentation is matched with the data in the database.
  • the matching is successful, then The right information entropy of the first word segmentation or the left information entropy of the second word segmentation is less than the first preset threshold, and it is determined that the target phrase is a new word.
  • the target phrase by dividing the target phrase into a first participle and a second participle, and reading the information of the first participle and the second participle in the text respectively, the right collocation words of the first participle and the second participle are obtained.
  • Left collocation words and count the co-occurrence word frequency of the first participle and the right collocation word and the co-occurrence word frequency of the second participle and the left collocation word, and obtain the co-occurrence probability value of the first participle and the right collocation word through the preset probability formula
  • the co-occurrence probability value of the second participle and the left collocation word, through the preset information entropy formula, the right information entropy of the first participle and the left information entropy of the second participle are calculated, and the right information entropy of the first participle and the second participle are calculated
  • the left information entropy of is less than the first preset threshold, it is determined that the phrase is a new word, and the information entropy value of the word segmentation is calculated by segmenting the phrase to determine the uncertainty of the phrase in the text to be processed, thereby identifying the new word Words, improve the recognition rate.
  • FIG. 3 is a second embodiment provided by the method for identifying new words based on information entropy in this application. Based on the embodiment shown in FIG. 2, step S10 includes:
  • Step S11 Obtain the target phrase in the text to be processed, and call the word segmentation attribute in the Chinese word segmentation system to determine whether the target phrase is a new word;
  • Step S12 When it is determined that the phrase is not a new word, call the Chinese word segmentation sequence in the Chinese word segmentation system to divide the phrase into a first word segmentation and a second word segmentation, and obtain the name information of the first word segmentation and the name information of the second word segmentation, where , The first participle and the second participle are combined into the target phrase.
  • the server When the server obtains the target phrase in the text to be processed, it calls the word segmentation attribute in the Chinese word segmentation system to determine whether the phrase is a new word.
  • Chinese Word Segmentation refers to the segmentation of a sequence of Chinese characters into individual words. Chinese word segmentation is the basis of text mining. For a piece of Chinese input, successfully performing Chinese word segmentation can achieve the effect of automatically identifying the meaning of the sentence. Store all the words in the Chinese word segmentation system, scan the processed text, find all possible words, and then see which word can be output. Such as: text to be processed: I am a student; words: I/Yes/student.
  • the Chinese word segmentation system records the attributes of different phrases, and when the server does not find the attributes of the phrase in the Chinese analysis system, it determines that the phrase is not a new word. When the server determines that the phrase is not a new word, it starts the Chinese word segmentation system to segment the phrase.
  • the Chinese word segmentation system divides the phrase into the first word segmentation and the second word segmentation based on the Chinese word segmentation sequence, and obtains the first word segmentation and the second word segmentation respectively.
  • the name information of the participle can also be divided into multiple participles according to the phrase, not limited to the first participle and the second participle, where the first participle and the second participle form a phrase, and there is nothing else between the first participle and the second participle. Characters, and no punctuation.
  • the server when the server obtains the target phrase in the text to be processed, it calls the word segmentation attribute in the Chinese word segmentation system to determine whether the phrase is a new word, and when it determines that the phrase is not a new word, it calls the Chinese word segmentation system
  • the Chinese word segmentation series in, divide the target phrase into the first participle and the second participle, and read the name information of the first participle and the name information of the second participle respectively, and first judge the target phrase through the word segmentation attributes of the Chinese word segmentation system.
  • the target phrase is segmented through the Chinese word segmentation series of the Chinese word segmentation system to improve the efficiency of identifying new words.
  • FIG. 4 is a third embodiment provided by the method for identifying new words based on information entropy in this application. Based on the embodiment shown in FIG. 2, step S20 includes:
  • step S21 the name information of the first word segmentation and the name information of the second word segmentation are used as index conditions to obtain the position of the first word segmentation in the text to be processed, the frequency of the first word, and the position of the second word segmentation in the text to be processed. Location and second word frequency;
  • Step S22 based on the position of the first participle in the text to be processed and the position of the second participle in the text to be processed, the right collocation words of the first participle and the left collocation words of the second participle are obtained, and the first participles are counted respectively.
  • the frequency of the co-occurrence of the participle and the right collocation word and the frequency of the co-occurrence of the second participle and the left collocation word are counted respectively.
  • the server When the server obtains the name information of the first word segmentation and the second word segmentation, it uses the name information of the first word segmentation and the name information of the second word segmentation as search conditions to search in the to-be-processed text.
  • the name information of the first participle is the same, the position of the first participle in the text to be processed is obtained, for example, by displaying in the text to be processed, the display mode can be marked with brightness, color, etc., and then the brightness or
  • the first word to the right of the first participle is used as the right collocation word of the first participle, and the first participle and the right collocation word in the text to be processed are recorded The frequency of co-occurring words.
  • the display mode can be marked with brightness, color, etc., and obtain the left collocation word of the second word segmentation and record it The co-occurring word frequency of the second participle and the left collocation word in the text to be processed.
  • the server when the server obtains the name information of the first word segmentation and the name information of the second word segmentation, it uses the name information of the first word segmentation and the name information of the second word segmentation as an index to obtain the first word segmentation in the text to be processed.
  • the position and frequency of the second participle in the text to be processed and the position and frequency of the second participle in the text to be processed are examples of the server.
  • FIG. 5 is a fourth embodiment provided by the method for identifying new words based on information entropy in this application. Based on the embodiment shown in FIG. 2, step S30 includes:
  • Step S31 after obtaining the frequency of the co-occurrence of the first participle and the right collocation word and the frequency of the co-occurrence of the second participle and the left collocation word, call a preset probability formula
  • Step S32 by respectively substituting the co-occurring word frequency of the first participle and the right collocation word, the first word frequency, and the co-occurring word frequency of the second participle and the left collocation word, and the second word frequency into the preset probability calculation formula , Obtain the co-occurrence probability value of the first participle and the right collocation word and the co-occurrence probability value of the second participle and the left collocation word.
  • the server obtains the co-occurrence word frequency of the first participle and the right collocation word and the co-occurrence word frequency of the second participle and the left collocation word, calls the preset probability calculation formula, and obtains the co-occurrence of the first participle and the right collocation word.
  • the word frequency and the word frequency band of the first participle are entered into the preset probability calculation formula, and the co-occurrence probability value of the first participle and the right collocation word is obtained by calculation.
  • N is the first word frequency of the first participle
  • the first word frequency of the first participle includes the word frequency of the target phrase, the word frequency of the first participle and the right collocation word, and the first participle and The frequency of other right collocation words.
  • the obtained co-occurrence word frequency of the second participle and the left collocation word and the word frequency of the second participle are substituted into the preset probability calculation formula, and the co-occurrence probability value of the second participle and the left collocation word is obtained by calculation.
  • the co-occurrence frequency of the second participle and the right collocation word, N is the second word frequency of the second participle
  • the word frequency of the second participle includes the word frequency of the target phrase, the word frequency of the second participle and the right collocation word, and the second participle and other right collocations The word frequency of the word.
  • the server obtains the co-occurring word frequency of the first participle and the right collocation word and the co-occurring word frequency of the second participle and the left collocation word, and obtains the probability of the first participle and the right collocation word through a preset probability formula Value and the probability value of the second participle and the left collocation word, the probability of the first participle and the right collocation word in the text to be processed and the probability of the second participle and the left collocation word in the text to be processed are obtained through the probability calculation formula.
  • FIG. 6 is a fifth embodiment provided by the method for identifying new words based on information entropy in this application. Based on the embodiment shown in FIG. 2, step S40 includes:
  • Step S41 After obtaining the co-occurrence probability value of the first participle and the right collocation word and the co-occurrence probability value of the second participle and the left collocation word, call the preset information entropy formula;
  • Step S42 by respectively substituting the co-occurrence probability value of the first participle and the right collocation word and the co-occurrence probability value of the second participle and the left collocation word into the preset information entropy formula, the right information entropy of the first participle or The left information entropy of the second participle.
  • the server When the server obtains the co-occurrence probability value of the first participle and the right collocation word and the co-occurrence probability value of the second participle and the left collocation word, it calls the preset information entropy formula, where i is 1, and the first The co-occurrence probability value of the participle and the right collocation word is substituted into the preset information entropy formula, and the information entropy value of the right collocation word of the first participle is obtained by calculation; the co-occurrence probability value of the second participle and the left collocation word obtained is substituted into it In the preset information entropy formula, the information entropy value of the left collocation word of the second word segmentation is obtained by calculation.
  • the server obtains the co-occurrence probability value of the first participle and the right collocation word and substitutes it into the preset information entropy formula, and obtains the information entropy value of the right collocation word of the first participle through calculation;
  • the co-occurrence probability value of the dichotomous word and the left collocation word is substituted into the preset information entropy formula, the information entropy value of the left collocation word of the second participle is calculated, and the right collocation word of the first participle is calculated by the preset information entropy formula
  • the information entropy value of and the information entropy value of the left collocation word of the second participle is calculated by the preset information entropy formula The information entropy value of and the information entropy value of the left collocation word of the second participle.
  • FIG. 7 is a sixth embodiment provided by the method for identifying new words based on information entropy in this application. Based on the embodiment shown in FIG. 2, step S50 includes:
  • Step S51 When the first preset threshold value is 0.9 extracted, it is judged whether the right information entropy of the first word segmentation and the left information entropy of the second word segmentation are less than 0.9;
  • Step S52 When the right information entropy of the first word segmentation and the left information entropy of the second word segmentation are less than the first preset threshold of 0.9, it is determined that the target phrase is a new word.
  • the server calculates the right information entropy of the first word segmentation and the left information entropy of the second word segmentation, it obtains the first preset threshold of 0.9, and judges whether the right information entropy of the first word segmentation or the left information entropy of the second word segmentation is less than the first preset The threshold value is 0.9.
  • the server determines that the target phrase is a new word. For example, when the acquired right information entropy of the first word segmentation is 0.91 and the left information entropy of the second word segmentation is 0.92, the server determines that the target phrase is a new word.
  • the server when the server obtains the right information entropy of the first word segmentation and the left information entropy of the second word segmentation, it determines whether the right information entropy of the first word segmentation or the left information entropy of the second word segmentation is less than the first preset threshold. 0.9. When the right information entropy of the first word segmentation or the left information entropy of the second word segmentation is less than the first preset threshold 0.9, the target phrase is determined to be a new word, and the value of the information entropy is obtained based on the uncertainty of the information entropy. Make sure that the corresponding phrase is a new word.
  • FIG. 8 is a seventh embodiment provided by the method for identifying new words based on information entropy in this application. Based on the embodiment shown in FIG. 7, after step S51, the method further includes:
  • Step S60 when the right information entropy of the first word segmentation and/or the left information entropy of the second word segmentation is greater than or equal to the first preset threshold value 0.9, obtain the co-occurring word frequency of the target phrase in the text to be processed;
  • Step S70 When the co-occurring word frequency of the target phrase in the to-be-processed text is greater than the second preset threshold value 5, it is determined that the target phrase is a new word.
  • the server obtains the co-occurring word frequency of the first word segmentation and the second word participle in the text to be processed, and the midline word frequency
  • the first participle and the second participle are adjacent in the text to be processed, and there are no characters or punctuation in the middle, that is, the target phrase.
  • the server determines that the target phrase is not a new word; when the right information entropy of the first participle obtained is 0.81, and the left information entropy of the second participle is 0.82, the server determines that the target phrase is not a new word.
  • the server obtains the co-occurring word frequency of the first and second participles in the text to be processed. For example, by displaying the first and second participles in the to-be-processed text, the first participle and the second participle are recorded adjacent to each other.
  • the word frequency, or the phrase is used as a search condition to search in the to-be-processed text to obtain the word frequency of the phrase, where the word frequency of the phrase is the frequency of the co-occurring word of the first participle and the second participle.
  • the word frequency of the phrase is the frequency of the co-occurring word of the first participle and the second participle.
  • the server when the server determines that the right information entropy of the first participle or the left information entropy of the second participle is greater than or equal to the first preset threshold of 0.9, the server obtains the co-occurring word frequency of the first participle and the second participle.
  • the server determines that the target phrase is a new word, and the target phrase is determined as a new word based on the number of times the first participle and the second participle appear together in the text to be processed Words to avoid missing new words in the text to be processed.
  • an embodiment of the present application also proposes a device for identifying new words based on information entropy, and the device for identifying new words based on information entropy includes:
  • the reading unit is used to obtain the target phrase in the text to be processed, divide the target phrase into a first word segmentation and a second word segmentation, and read the information of the first word segmentation and the information of the second word segmentation respectively;
  • the statistical unit is used to obtain the right collocation word of the first participle and the left collocation word of the second participle based on the information of the first participle and the information of the second participle, and count the co-occurring word frequency of the first participle and the right collocation word and the second participle information.
  • the first calculation unit is used to calculate the co-occurrence probability of the first participle and the right collocation word through the preset probability formula, the co-occurrence word frequency of the first participle and the right collocation word, and the co-occurrence word frequency of the second participle and the left collocation word Value and the co-occurrence probability value of the second participle and the left collocation word;
  • the second calculation unit is used to calculate the right information entropy sum of the first participle through the preset information entropy formula, the co-occurrence probability value of the first participle and the right collocation word, and the co-occurrence probability value of the second participle and the left collocation word The left information entropy of the second participle;
  • the first determining unit determines that the target phrase is a new word when the right information entropy of the first word segmentation and the left information entropy of the second word segmentation are both less than the first preset threshold.
  • reading unit is specifically used for:
  • the Chinese word segmentation sequence in the Chinese word segmentation system is activated to divide the phrase into the first participle and the second participle, and obtain the name information of the first participle and the name information of the second participle. Among them, the first participle Combine with the second participle to form the target phrase.
  • the right collocation words of the first participle and the left collocation words of the second participle are obtained, and the first participle and the right collocation words are counted separately.
  • the co-occurring word frequency of the co-occurring word frequency and the co-occurring word frequency of the second participle and the left collocation word are obtained.
  • first calculation unit is specifically configured to:
  • the first participle and the right collocation word By substituting the co-occurring word frequency of the first participle and the right collocation word, the first word frequency, and the co-occurring word frequency and the second word frequency of the second participle and the left collocation word into the preset probability calculation formula, the first participle and the right collocation are obtained.
  • the co-occurrence probability value of the word and the co-occurrence probability value of the second participle and the left collocation word are obtained.
  • the right information entropy of the first participle or the second participle is calculated The left information entropy.
  • first determining unit is specifically configured to:
  • the device for identifying new words based on information entropy further includes:
  • the acquiring unit is configured to acquire the co-occurring word frequency of the target phrase in the text to be processed when the right information entropy of the first word segmentation and/or the left information entropy of the second word segmentation is greater than or equal to the first preset threshold value of 0.9;
  • the second determining unit is configured to determine that the target phrase is a new word when the co-occurring word frequency of the target phrase in the text to be processed is greater than the second preset threshold value 5.
  • each unit in the device for identifying new words based on information entropy corresponds to the steps in the embodiment of the method for identifying new words based on information entropy, and its functions and implementation processes are not repeated here.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium.
  • the computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer executes the following steps:
  • the co-occurrence word frequency of the first participle and the right collocation word, and the co-occurrence word frequency of the second participle and the left collocation word, the co-occurrence probability value of the first participle and the right collocation word and the second participle and the left collocation word are calculated.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disks, optical disks), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the method described in each embodiment of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

涉及大数据技术领域,一种基于信息熵识别新词的方法、装置、服务器及存储介质,包括:获取待处理文本中的目标短语,将目标短语划分为第一分词和第二分词,并分别读取第一分词的信息和第二分词的信息,获取第一分词的右搭配词和所述第二分词的左搭配词,并统计第一分词与所述右搭配词的共现词频以及所述第二分词与左搭配词的共现词频;通过预置概率公式和通过预置信息熵公式,计算出第一分词的右信息熵和第二分词的左信息熵;在第一分词的右信息熵和第二分词的左信息熵均小于第一预置阈值时,确定目标短语为新词。通过对短语进行分词来计算信息熵值,获取待处理文本中新词的不确定性,从而识别出新词,提高了识别率。

Description

基于信息熵识别新词的方法、装置、设备及存储介质
本申请要求于2019年9月19日提交中国专利局、申请号为201910885192.1,发明名称为“基于信息熵识别新词的方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及大数据技术领域,尤其涉及一种基于信息熵识别新词的方法、装置、服务器及计算机可读存储介质。
背景技术
对于文章中出现的新词及关键词短语,进行初分词时分词系统不易识别,于是将他们分开,如“机器/学习”。但是,在文章中其作为一个整体,分开将会导致歧义或者语义不完整,这会影响到对新闻等文章的关键词提取结果的准确性。
目前比较流行的中文分词系统,如nlpir分词系统,尽管基于信息熵添加了新词识别功能,但对于多个分词与分词短语或者新词同时出现的情况,仍然不能实现较好的分词效果。例如,文本中机器学习为一个新词或者关键词短语,但是文本中同时也出现了多个独立的机器与学习这两个分词。发明人意识到目前的分词系统不能对短语进行分词识别,从而识别不出多个分词和分词短语中的新词,导致识别率低。
发明内容
本申请的主要目的在于提供一种基于信息熵识别新词的方法,旨在解决现有技术目前的分词系统不能对短语进行分词,从而识别不出多个分词和分词短语中的新词,导致识别率低的技术问题。
本申请提供一种基于信息熵识别新词的方法,所述基于信息熵识别新词的方法包括:
获取待处理文本中的目标短语,将所述目标短语划分为第一分词和第二分词,并分别读取所述第一分词的信息和所述第二分词的信息;
基于所述第一分词的信息和所述第二分词的信息,获取所述第一分词的右搭配词和所述第二分词的左搭配词,并统计所述第一分词与所述右搭配词的共现词频以及所述第二分词与所述左搭配词的共现词频;
通过预置概率公式、所述第一分词与所述右搭配词的共现词频以及所述第二分词与所述左搭配词的共现词频,计算出所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值;
通过预置信息熵公式、所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值,计算出所述第一分词的右信息熵和所述第二分词的左信息熵;
在所述第一分词的右信息熵和所述第二分词的左信息熵均小于第一预置阈值时,确定所述目标短语为新词。
本申请还提供一种基于信息熵识别新词的装置,所述基于信息熵识别新词的装置包括:
读取单元,用于获取待处理文本中的目标短语,将所述目标短语划分为第一分词和第 二分词,并分别读取所述第一分词的信息和所述第二分词的信息;
统计单元,用于基于所述第一分词的信息和所述第二分词的信息,获取所述第一分词的右搭配词和所述第二分词的左搭配词,并统计所述第一分词与所述右搭配词的共现词频以及所述第二分词与所述左搭配词的共现词频;
第一计算单元,用于通过预置概率公式、所述第一分词与所述右搭配词的共现词频以及所述第二分词与所述左搭配词的共现词频,计算出所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值;
第二计算单元,用于通过预置信息熵公式、所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值,计算出所述第一分词的右信息熵和所述第二分词的左信息熵;
第一确定单元,在所述第一分词的右信息熵和所述第二分词的左信息熵均小于第一预置阈值时,确定所述目标短语为新词。
此外,为实现上述目的,本申请还提供一种服务器,所述服务器包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的基于信息熵识别新词程序,所述基于信息熵识别新词程序被所述处理器执行时实现如上申请所述的基于信息熵识别新词的方法的步骤。
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行上述基于信息熵识别新词的方法。
本申请实施例提出的一种基于信息熵识别新词的方法、装置、服务器及计算机可读存储介质,通过获取待处理文本中的目标短语,将所述目标短语划分为第一分词和第二分词,并分别读取所述第一分词的信息和所述第二分词的信息;基于所述第一分词的信息和所述第二分词的信息,获取所述第一分词的右搭配词和所述第二分词的左搭配词,并统计所述第一分词与所述右搭配词的共现词频以及所述第二分词与所述左搭配词的共现词频;通过预置概率公式、所述第一分词与所述右搭配词的共现词频以及所述第二分词与所述左搭配词的共现词频,计算出所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值;通过预置信息熵公式、所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值,计算出所述第一分词的右信息熵和所述第二分词的左信息熵;在所述第一分词的右信息熵和所述第二分词的左信息熵均小于第一预置阈值时,确定所述目标短语为新词,实现了通过对短语进行分词来计算信息熵值,获取待处理文本中新词的不确定性,从而识别出新词,提高了识别率。
附图说明
图1为本申请实施例方案涉及的硬件运行环境的服务器结构示意图;
图2为本申请基于信息熵识别新词的方法的第一实施例的流程示意图;
图3为本申请基于信息熵识别新词的方法的第二实施例的流程示意图;
图4为本申请基于信息熵识别新词的方法的第三实施例的流程示意图;
图5为本申请基于信息熵识别新词的方法的第四实施例的流程示意图;
图6为本申请基于信息熵识别新词的方法的第五实施例的流程示意图;
图7为本申请基于信息熵识别新词的方法的第六实施例的流程示意图;
图8为本申请基于信息熵识别新词的方法的第七实施例的流程示意图。
具体实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请实施例的主要解决方案是:通过获取待处理文本中的目标短语,将目标短语划分为第一分词和第二分词,并分别读取第一分词的信息和第二分词的信息;基于第一分词的信息和第二分词的信息,获取第一分词的右搭配词和第二分词的左搭配词,并统计第一分词与右搭配词的共现词频以及第二分词与左搭配词的共现词频;通过预置概率公式、第一分词与右搭配词的共现词频以及第二分词与左搭配词的共现词频,计算出第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率值;通过预置信息熵公式、第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率值,计算出第一分词的右信息熵和第二分词的左信息熵;在第一分词的右信息熵和第二分词的左信息熵均小于第一预置阈值时,确定目标短语为新词。
由于现有技术仅目前的分词系统不能对短语进行分词,从而识别不出多个分词和分词短语中的新词,导致识别率低的技术问题。
本申请提供一种解决方案,通过对短语进行切分来计算分词的信息熵值,以此判定待处理文本中短语的不确定性,从而识别出新词,提高了识别率。
如图1所示,图1为本申请实施例方案涉及的硬件运行环境的服务器结构示意图。
本申请实施例终端为服务器。
如图1所示,该终端可以包括:处理器1001,例如CPU,网络接口1004,用户接口1003,存储器1005,通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。
本领域技术人员可以理解,图1中示出的终端结构并不构成对终端的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
如图1所示,作为一种计算机存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及基于信息熵识别新词程序。
在图1所示的终端中,网络接口1004主要用于连接后台服务器,与后台服务器进行数据通信;用户接口1003主要用于连接客户端(用户端),与客户端进行数据通信;而处理器1001可以用于调用存储器1005中存储的基于信息熵识别新词程序,并执行以下操作:
获取待处理文本中的目标短语,将目标短语划分为第一分词和第二分词,并分别读取 第一分词的信息和第二分词的信息;
基于第一分词的信息和第二分词的信息,获取第一分词的右搭配词和第二分词的左搭配词,并统计第一分词与右搭配词的共现词频以及第二分词与左搭配词的共现词频;
通过预置概率公式、第一分词与右搭配词的共现词频以及第二分词与左搭配词的共现词频,计算出第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率值;
通过预置信息熵公式、第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率值,计算出第一分词的右信息熵和第二分词的左信息熵;
在第一分词的右信息熵和第二分词的左信息熵均小于第一预置阈值时,确定目标短语为新词。
进一步地,处理器1001可以调用存储器1005中存储的基于信息熵识别新词程序,还执行以下操作:
获取待处理文本中的目标短语,调取中文分词系统中的分词属性判断目标短语是否为新词;
当判定目标短语不是新词时,启动中文分词系统中的中文分词序列将短语分为第一分词和第二分词,并获取第一分词的名称信息以及第二分词的名称信息,其中,第一分词和第二分词组合成目标短语。
进一步地,处理器1001可以调用存储器1005中存储的基于信息熵识别新词程序,还执行以下操作:
将第一分词的名称信息和第二分词的名称信息作为索引条件,获取第一分词在待处理文本中的位置、第一词频以及第二分词在所述待处理文本中的位置和第二词频;
基于第一分词在待处理文本中的位置和第二分词在待处理文本中的位置,获取第一分词的右搭配词和第二分词的左搭配词,并分别统计第一分词与右搭配词的共现词频和第二分词与左搭配词的共现词频。
进一步地,处理器1001可以调用存储器1005中存储的基于信息熵识别新词程序,还执行以下操作:
在获取到第一分词与右搭配词的共现词频和第二分词与左搭配词的共现词频,调取预置概率公式;
通过分别将第一分词与右搭配词的共现词频、所述第一词频以及和第二分词与左搭配词的共现词频和所述第二词频代入到预置概率计算公式中,得到第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率值。
进一步地,处理器1001可以调用存储器1005中存储的基于信息熵识别新词程序,还执行以下操作:
在获取第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率0值,调 取预置信息熵公式;
通过分别将第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率值代入到预置信息熵公式中,计算出第一分词的右信息熵或第二分词的左信息熵。
进一步地,处理器1001可以调用存储器1005中存储的基于信息熵识别新词程序,还执行以下操作:
当提取到第一预置阈值为0.9时,判断第一分词的右信息熵和第二分词的左信息熵是否小于0.9;
在第一分词的右信息熵和第二分词的左信息熵小于均第一预置阈值0.9时,确定目标短语为新词。
进一步地,处理器1001可以调用存储器1005中存储的基于信息熵识别新词程序,还执行以下操作:
当第一分词的右信息熵和/或第二分词的左信息熵大于或等于第一预置阈值0.9时,获取目标短语在待处理文本中的共现词频;
在目标短语在待处理文本中的共现词频大于第二预置阈值5时,确定目标短语为新词。
参照图2,本申请为基于信息熵识别新词的方法的第一实施例,所述基于信息熵识别新词的方法包括:
步骤S10,获取待处理文本中的目标短语,将目标短语划分为第一分词和第二分词,并分别读取第一分词的信息和第二分词的信息;
服务器获取待处理文本中的目标短语,例如,服务器调取预置的字符识别软件对该待处理文本进行识别,获取待处理文本中的所有字符,其中,字符包括数字、字母以及中英文等组成的短语。服务器在获取到目标短语时,调取中文分词系统将该短语进行划分为第一分词和第二分词,其中,中文分词系统不仅仅将短语划分为第一分词和第二分词,也可以划分为第一分词、第二分词、第三分词等,并获取第一分词的词性、属性、名称、词频等信息。
步骤S20,基于第一分词的信息和第二分词的信息,获取第一分词的右搭配词和第二分词的左搭配词,并统计第一分词与右搭配词的共现词频以及第二分词与左搭配词的共现词频;
服务器在分别获取到第一分词和第二分词的信息时,获取第一分词的右搭配词和第二分词的左搭配词。具体为服务器在获取到第一分词的名称信息时,将第一分词的名称信息作为搜索条件,获取到第一分词在待处理文本中出现的位置,将第一分词的右边的第一个分词作为第一分词的右搭配词,其中,右搭配词可以是标点符号、空格、介词、动词等,但第一分词为动词时,右搭配词不能是名词,且第一分词与右搭配词的共现词频大于1,其中右搭配词至少为一个;在获取到第二分词的名称信息时,同样,将第二分词的名称信息作为搜索条件,获取第二分词在待处理文本中出现的位置,将第二分词左边的第一个分 词作为该第二分词的左搭配词,其中,左搭配词可以是标点符号、空格、介词、名词等,但左搭配词为名词时第二分词不能是动词,且第二分词与左搭配词的共现词频大于1,其中左搭配词至少为一个,并分别获取第一分词与右搭配词共同出现的词频和第二分词与左搭配词共同出现的词频,中间不能够出现其它的分词。
步骤S30,通过预置概率公式、第一分词与右搭配词的共现词频以及第二分词与左搭配词的共现词频,计算出第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率值;
服务器在获取到第一分词与右搭配词的共现词频和第二分词与左搭配词的共现词频时,调取预置概率计算公式,分别计算出第一分词与右搭配词的共现概率值以及第二分词与左搭配词的共现概率值,具体实施为,将获取到的第一分词的词频和第二分词的词频分别作为整个事件的总值,将第一分词与右搭配词的共现词频作为分子除以第一分词的词频,其中,第一分词的词频包括第一分词与右搭配词的共现词频和该短语的词频(短语包括第一分词)以及第一分词与其它分词出现的词频,得到该第一分词与右搭配词的共现概率值;将第二分词与左搭配词的共现词频作为分子除以第二分词的词频,其中,第二分词的词频包括第二分词与左搭配词的共现词频和短语的词频(短语包括第二分词)以及第二分词与其它分词出现的词频,得到该第二分词与左搭配词的共现概率值。
步骤S40,通过预置信息熵公式、第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率值,计算出第一分词的右信息熵和第二分词的左信息熵;
服务器在获取到第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率值时,调取预置信息熵公式。将第一分词与右搭配次的共现概率值代入到该预置信息熵中,通过计算得到第一分词的右信息熵;同时,将第二分词与左搭配词的共现概率值带入到预置信息熵公式中,通过计算得到第二分词的左信息熵。
步骤S50,在第一分词的右信息熵和第二分词的左信息熵均小于第一预置阈值时,确定目标短语为新词
服务器在获取到第一分词的右信息熵和第二分词的左信息熵时,判断该第一分词的右信息熵和第二分词的左信息熵是否小于第一预置阈值。当第一分词的右信息熵和第二分词的左信息熵小于第一预置阈值时,判定该目标短语为新词,具体的实施方式为,在获取到第一分词的右信息熵或第二分词的左信息熵时,调取小于第一预置阈值的数据库,将该第一分词右信息熵或第二分词的左信息熵与该数据库中的数据进行匹配,当匹配成功时,则该第一分词右信息熵或第二分词的左信息熵小于第一预置阈值,确定目标短语为新词。
在本实施例中,通过将目标短语分为第一分词和第二分词,分别读取第一分词和第二分词在文本中的信息,获取到第一分词的右搭配词和第二分词的左搭配词,并统计第一分词与右搭配词的共现词频和第二分词与左搭配词的共现词频,通过预置概率公式,获取第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率值,通过预置信息熵公式,计算得到第一分词的右信息熵和第二分词的左信息熵,在第一分词的右信息熵和第二分词的左信息熵小于第一预置阈值时,判定该短语为新词,通过对短语进行切分来计算分词的信息熵值,以此判定待处理文本中短语的不确定性,从而识别出新词,提高了识别率。
进一步的,参照图3,图3为本申请基于信息熵识别新词的方法提供的第二实施例,基于上述图2所示的实施例,步骤S10包括:
步骤S11,获取待处理文本中的目标短语,调取中文分词系统中的分词属性判断目标短语是否为新词;
步骤S12,当判定短语不是新词时,调取中文分词系统中的中文分词序列将短语分为第一分词和第二分词,并获取第一分词的名称信息以及第二分词的名称信息,其中,第一分词和第二分词组合成目标短语。
服务器在获取到待处理文本中的目标短语时,调取中文分词系统中的分词属性判断该短语是否为新词。中文分词系统(Chinese Word Segmentation)指的是将一个汉字字符序列切分成一个一个单独的词。中文分词是文本挖掘的基础,对于输入的一段中文,成功的进行中文分词,可以达到自动识别语句含义的效果。把所有的词都存入中文分词系统中,扫描带处理的文本,查找所有可能的词,然后看哪个词可以作为输出。如:待处理文本:我是学生;词:我/是/学生。中文分词系统中的记载了不同短语的属性,服务器在该中文分析系统中没有查找到该短语的属性时,判定该短语不是新词。服务器在判定该短语不是新词时,启动中文分词系统对该短语进行分词,中文分词系统基于中文分词序列将该短语进行划分为第一分词和第二分词并分别获取到第一分词和第二分词的名称信息,也可以根据短语划分为多个分词,不仅限与第一分词和第二分词,其中,第一分词和第二分词组成短语,且第一分词与第二分词之间没有其他字符,更无标点符号。
在本实施例中,服务器在获取到待处理文本中的目标短语时,调取中文分词系统中的分词属性判断该短语是否为新词,当判定该短语不是新词时,调取中文分词系统中的中文分词系列将目标短语分为第一分词和第二分词,并分别读取第一分词的名称信息和第二分词的名称信息,先通过中文分词系统的分词属性对目标短语进行判定,在通过中文分词系统的中文分词系列对目标短语进行分词,提高识别新词的效率。
参照图4,图4为本申请基于信息熵识别新词的方法提供的第三实施例,基于上述图2所示的实施例,步骤S20,包括:
步骤S21,将第一分词的名称信息和第二分词的名称信息作为索引条件,获取第一分词在所述待处理文本中的位置、第一词频以及第二分词在所述待处理文本中的位置和第二词频;
步骤S22,基于第一分词在所述待处理文本中的位置和第二分词在待处理文本中的位置,获取第一分词的右搭配词和第二分词的左搭配词,并分别统计第一分词与所述右搭配词的共现词频和第二分词与所述左搭配词的共现词频。
服务器在获取到第一分词和第二分词的名称信息时,分别将第一分词的名称信息和第二分词的名称信息作为搜索条件在待处理文本中进行搜索,在待处理文本中搜索到与第一分词的名称信息相同的分词时,获取第一分词在待处理文本中的位置,例如,通过在待处理文本中进行显示,显示的方式可以用亮度、颜色等进行标记,再识别亮度或颜色等标记获取到第一分词在待处理文本中的位置时,将第一分词的右边的第一个词作为第一分词的 右搭配词,并记录待处理文本中第一分词与右搭配词的共现词频。在待处理文本中搜索到与第二分词的名称信息相同的分词时,在待处理文本中进行显示,显示的方式可以用亮度、颜色等进行标记,获取该第二分词的左搭配词并记录待处理文本中第二分词与左搭配词的共现词频。
在本实施例中,服务器在获取到第一分词的名称信息和第二分词的名称信息时,将第一分词的名称信息和第二分词的名称信息作为索引,获取第一分词在待处理文本中的位置和词频以及第二分词在待处理文本中的位置和词频,基于第一分词的位置和第二分词的位置,获取第一分词的右搭配词和第二分词的左搭配词,并分别统计第一分词与右搭配词的共现词频和第二分词与左搭配词的共现词频,从而快速的获取第一分词的右搭配词和词频以及第二分词的左搭配词和词频。
参照图5,图5为本申请基于信息熵识别新词的方法提供的第四实施例,基于上述图2所示的实施例,步骤S30包括:
步骤S31,在获取到第一分词与所述右搭配词的共现词频和第二分词与左搭配词的共现词频,调取预置概率公式;
步骤S32,通过分别将第一分词与右搭配词的共现词频、所述第一词频以及和第二分词与左搭配词的共现词频和所述第二词频代入到预置概率计算公式中,得到第一分词与所述右搭配词的共现概率值和第二分词与左搭配词的共现概率值。
服务器在获取到第一分词与右搭配词的共现词频和第二分词与左搭配词的共现词频,调取预置概率计算公式,将获取到的第一分词与右搭配词的共现词频以及第一分词的词频带入到预置概率计算公式中,通过计算得到第一分词与右搭配词的共现概率值。为第一分词与右搭配词的共现词频,N为第一分词的第一词频,第一分词的第一词频包括目标短语的词频、第一分词与右搭配词的词频以及第一分词与其它右搭配词的词频。将获取到的第二分词与左搭配词的共现词频和第二分词的词频代入到预置概率计算公式中,通过计算得到第二分词与左搭配词的共现概率值。第二分词与右搭配词的共现词频,N为第二分词的第二词频,第二分词的词频包括目标短语的词频、第二分词与右搭配词的词频以及第二分词与其它右搭配词的词频。
在本实施例中,服务器在获取到第一分词与右搭配词的共现词频和第二分词与左搭配词的共现词频,通过预置概率公式,获取第一分词与右搭配词的概率值以及第二分词与左搭配词的概率值,通过概率计算公式得到第一分词与右搭配词在待处理文本中出现的概率和第二分词与左搭配词在待处理文本中出现的概率。
参照图6,图6为本申请基于信息熵识别新词的方法提供的第五实施例,基于上述图2所示的实施例,步骤S40包括:
步骤S41,在获取第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率值,调取预置信息熵公式;
步骤S42,通过分别将第一分词与右搭配词的共现概率值和第二分词与左搭配词的共 现概率值代入到预置信息熵公式中,计算出第一分词的右信息熵或第二分词的左信息熵。
服务器在获取到第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率值时,调取预置信息熵公式,其中,i为1,将获取到第一分词与右搭配词的共现概率值代入到预置信息熵公式中,通过计算得到第一分词的右搭配词的信息熵值;将获取到第二分词与左搭配词的共现概率值代入到预置信息熵公式中,通过计算得到第二分词的左搭配词的信息熵值。
在本实施例中,服务器在获取到第一分词与右搭配词的共现概率值代入到预置信息熵公式中,通过计算得到第一分词的右搭配词的信息熵值;将获取到第二分词与左搭配词的共现概率值代入到预置信息熵公式中,通过计算得到第二分词的左搭配词的信息熵值,通过预置信息熵公式计算出第一分词的右搭配词的信息熵值和第二分词的左搭配词的信息熵值。
参照图7,图7为本申请基于信息熵识别新词的方法提供的第六实施例,基于上述图2所示的实施例,步骤S50包括:
步骤S51,当提取到第一预置阈值为0.9时,判断第一分词的右信息熵和第二分词的左信息熵是否小于0.9;
步骤S52,在第一分词的右信息熵和第二分词的左信息熵小于均第一预置阈值0.9时,确定目标短语为新词。
当服务器计算得到第一分词的右信息熵和第二分词的左信息熵,获取第一预置阈值0.9,判断第一分词的右信息熵或第二分词的左信息熵是否小于第一预置阈值0.9,当第一分词的右信息熵或第二分词的左信息熵小于第一预置阈值0.9时,确定该目标短语为新词。例如,当获取到的第一分词的右信息熵为0.91,第二分词的左信息熵为0.92时,服务器判定目标短语为新词。
在本实施例中,服务器在获取到第一分词的右信息熵和第二分词的左信息熵时,判断第一分词的右信息熵或第二分词的左信息熵是否小于第一预置阈值0.9,在第一分词的右信息熵或第二分词的左信息熵小于第一预置阈值0.9时,确定该目标短语为新词,通过基于信息熵的不确定性,获取信息熵的值来确定对应的短语为新词。
参照图8,图8为本申请基于信息熵识别新词的方法提供的第七实施例,基于上述图7所示的实施例,步骤S51之后,还包括:
步骤S60,当第一分词的右信息熵和/或第二分词的左信息熵大于或等于第一预置阈值0.9时,获取目标短语在待处理文本中的共现词频;
步骤S70,在目标短语在待处理文本中的共现词频大于第二预置阈值5时,确定目标短语为新词。
当第一分词的右信息熵和/或第二分词的左信息熵大于或等于第一预置阈值0.9时,服务器获取第一分词和第二分词在待处理文本中的共现词频,中线词频为第一分词和第二分词在待处理文本中处于相邻,且中间无字符、无标点符号,即就是目标短语,例如,当获 取到的第一分词的右信息熵为0.91,第二分词的左信息熵为0.8时,服务器判定目标短语不是新词;当获取到的第一分词的右信息熵为0.81,第二分词的左信息熵为0.82时,服务器判定目标短语不是新词。服务器获取第一分词和第二分词在待处理文本中的共现词频的方式很多,例如,通过在待处理文本中显示第一分词和第二分词,记录第一分词与第二分词相邻的词频,或者,将短语作为搜索条件在该待处理文本中进行搜索,得到短语的词频,其中短语的词频为第一分词和第二分词的共现词频。当获取到第一分词和第二分词的共现词频大于第二预置阈值5时,确定该短语为新词。
在本实施例中,服务器在判定第一分词的右信息熵或第二分词的左信息熵大于或等于第一预置阈值0.9时,获取第一分词和第二分词的共现词频,在第一分词和第二分词的共现词频大于第二预置阈值5时,服务器判定目标短语为新词,通过第一分词和第二分词在待处理文本中共同出现的次数来判定目标短语为新词,避免漏掉待处理文本中的新词。
此外,本申请实施例还提出一种基于信息熵识别新词的装置,基于信息熵识别新词的装置包括:
读取单元,用于获取待处理文本中的目标短语,将目标短语划分为第一分词和第二分词,并分别读取第一分词的信息和第二分词的信息;
统计单元,用于基于第一分词的信息和第二分词的信息,获取第一分词的右搭配词和第二分词的左搭配词,并统计第一分词与右搭配词的共现词频以及第二分词与左搭配词的共现词频;
第一计算单元,用于通过预置概率公式、第一分词与右搭配词的共现词频以及第二分词与左搭配词的共现词频,计算出第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率值;
第二计算单元,用于通过预置信息熵公式、第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率值,计算出第一分词的右信息熵和第二分词的左信息熵;
第一确定单元,在第一分词的右信息熵和第二分词的左信息熵均小于第一预置阈值时,确定目标短语为新词。
进一步地,上述读取单元具体用于:
获取待处理文本中的目标短语,调取中文分词系统中的分词属性判断目标短语是否为新词;
当判定短语不是新词时,启动中文分词系统中的中文分词序列将短语分为第一分词和第二分词,并获取第一分词的名称信息以及第二分词的名称信息,其中,第一分词和第二分词组合成目标短语。
进一步地,上述统计单元具体用于:
将第一分词的名称信息和第二分词的名称信息作为索引条件,获取第一分词在待处理文本中的位置、第一词频以及第二分词在待处理文本中的位置和第二词频;
基于第一分词在待处理文本中的位置和第二分词在待处理文本中的位置,获取第一分词的右搭配词和第二分词的左搭配词,并分别统计第一分词与右搭配词的共现词频和第二 分词与左搭配词的共现词频的共现词频。
进一步地,上述第一计算单元具体用于:
在获取到第一分词与右搭配词和第二分词与左搭配词的共现词频,调取预置概率公式;
通过分别将第一分词与右搭配词的共现词频、第一词频以及第二分词与左搭配词的共现词频和第二词频代入到预置概率计算公式中,得到第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率值。
进一步地,上述第二计算单元具体用于:
在获取第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率值,调取预置信息熵公式;
通过分别将第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率值代入到预置信息熵公式中,计算出第一分词的右信息熵或第二分词的左信息熵。
进一步地,上述第一确定单元具体用于:
当提取到第一预置阈值为0.9时,判断第一分词的右信息熵和第二分词的左信息熵是否小于0.9;
在第一分词的右信息熵和第二分词的左信息熵小于均第一预置阈值0.9时,确定目标短语为新词。
进一步地,基于信息熵识别新词的装置,还包括:
获取单元,用于当第一分词的右信息熵和/或第二分词的左信息熵大于或等于第一预置阈值0.9时,获取目标短语在待处理文本中的共现词频;
第二确定单元,用于在目标短语在待处理文本中的共现词频大于第二预置阈值5时,确定目标短语为新词。
上述基于信息熵识别新词的装置中各个单元的功能实现与上述基于信息熵识别新词的方法实施例中各步骤相对应,其功能和实现过程在此处不再一一赘述。
此外,本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,也可以为易失性计算机可读存储介质。计算机可读存储介质存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:
获取待处理文本中的目标短语,将目标短语划分为第一分词和第二分词,并分别读取第一分词的信息和第二分词的信息;
基于第一分词的信息和第二分词的信息,获取第一分词的右搭配词和第二分词的左搭配词,并统计第一分词与右搭配词的共现词频以及第二分词与左搭配词的共现词频;
通过预置概率公式、第一分词与右搭配词的共现词频以及第二分词与左搭配词的共现词频,计算出第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率值;
通过预置信息熵公式、第一分词与右搭配词的共现概率值和第二分词与左搭配词的共现概率值,计算出第一分词的右信息熵和第二分词的左信息熵;
在第一分词的右信息熵和第二分词的左信息熵均小于第一预置阈值时,确定目标短语为新词。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种基于信息熵识别新词的方法,所述基于信息熵识别新词的方法包括:
    获取待处理文本中的目标短语,将所述目标短语划分为第一分词和第二分词,并分别读取所述第一分词的信息和所述第二分词的信息;
    基于所述第一分词的信息和所述第二分词的信息,获取所述第一分词的右搭配词和所述第二分词的左搭配词,并统计所述第一分词与所述右搭配词的共现词频以及所述第二分词与所述左搭配词的共现词频;
    通过预置概率公式、所述第一分词与所述右搭配词的共现词频以及所述第二分词与所述左搭配词的共现词频,计算出所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值;
    通过预置信息熵公式、所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值,计算出所述第一分词的右信息熵和所述第二分词的左信息熵;
    在所述第一分词的右信息熵和所述第二分词的左信息熵均小于第一预置阈值时,确定所述目标短语为新词。
  2. 如权利要求1所述的基于信息熵识别新词的方法,所述获取待处理文本中的目标短语,将所述目标短语划分为第一分词和第二分词,并分别读取所述第一分词的信息和所述第二分词的信息,包括:
    获取待处理文本中的目标短语,调取中文分词系统中的分词属性判断所述目标短语是否为新词;
    当判定所述短语不是新词时,调取所述中文分词系统中的中文分词序列将所述目标短语分为第一分词和第二分词,并获取所述第一分词的名称信息以及所述第二分词的名称信息,其中,所述第一分词和第二分词组合成所述目标短语。
  3. 如权利要求2所述的基于信息熵识别新词的方法,所述基于所述第一分词的信息和所述第二分词的信息,获取所述第一分词的右搭配词和所述第二分词的左搭配词,并统计所述第一分词与所述右搭配词的共现词频以及所述第二分词与所述左搭配词的共现词频,包括:
    将所述第一分词的名称信息和所述第二分词的名称信息作为索引条件,获取所述第一分词在所述待处理文本中的位置、第一词频以及所述第二分词在所述待处理文本中的位置和第二词频;
    基于所述第一分词在所述待处理文本中的位置和所述第二分词在所述待处理文本中的位置,获取所述第一分词的右搭配词和所述第二分词的左搭配词,并分别统计所述第一分词与所述右搭配词的共现词频和所述第二分词与所述左搭配词的共现词频。
  4. 如权利要求3所述的基于信息熵识别新词的方法,所述通过预置概率公式,计算出所述第一分词与所述右搭配词的共现和所述第二分词与所述左搭配词的共现概率值,包括:
    在获取到第一分词与所述右搭配词的共现词频和所述第二分词与所述左搭配词的共现词频,调取预置概率公式;
    通过分别将所述第一分词与所述右搭配词的共现词频、所述第一词频以及所述第二分词与所述左搭配词的共现词频和所述第二词频代入到所述预置概率计算公式中,得到所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值。
  5. 如权利要求4所述的基于信息熵识别新词的方法,所述通过预置信息熵公式、所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值,计算出所述第一分词的右信息熵或所述第二分词的左信息熵,包括:
    在获取所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值,调取预置信息熵公式;
    通过分别将所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值代入到所述预置信息熵公式中,计算出所述第一分词的右信息熵或所述第二分词的左信息熵。
  6. 如权利要求1-5中任意一项所述的基于信息熵识别新词的方法,在所述第一分词的右信息熵和所述第二分词的左信息熵均小于第一预置阈值时,确定所述目标短语为新词,包括:
    当提取到所述第一预置阈值为0.9时,判断所述第一分词的右信息熵和所述第二分词的左信息熵是否小于0.9;
    在所述第一分词的右信息熵和所述第二分词的左信息熵小于均所述第一预置阈值0.9时,确定所述目标短语为新词。
  7. 如权利要求6所述的基于信息熵识别新词的方法,所述在第一分词的右信息熵和所述第二分词的左信息熵均小于或等于第一预置阈值时,确定所述目标短语为新词之后,还包括:
    当所述第一分词的右信息熵和/或所述第二分词的左信息熵大于或等于所述第一预置阈值0.9时,获取所述目标短语在所述待处理文本中的共现词频;
    在所述目标短语在所述待处理文本中的共现词频大于第二预置阈值5时,确定所述目标短语为新词。
  8. 一种基于信息熵识别新词的装置,所述基于信息熵识别新词的装置包括:
    读取单元,用于获取待处理文本中的目标短语,将所述目标短语划分为第一分词和第二分词,并分别读取所述第一分词的信息和所述第二分词的信息;
    统计单元,用于基于所述第一分词的信息和所述第二分词的信息,获取所述第一分词的右搭配词和所述第二分词的左搭配词,并统计所述第一分词与所述右搭配词的共现词频以及所述第二分词与所述左搭配词的共现词频;
    第一计算单元,用于通过预置概率公式、所述第一分词与所述右搭配词的共现词频以及所述第二分词与所述左搭配词的共现词频,计算出所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值;
    第二计算单元,用于通过预置信息熵公式、所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值,计算出所述第一分词的右信息熵和所述第二分词的左信息熵;
    第一确定单元,在所述第一分词的右信息熵和所述第二分词的左信息熵均小于第一预置阈值时,确定所述目标短语为新词。
  9. 如权利要求8所述的基于信息熵识别新词的装置,所述读取单元具体用于:
    获取待处理文本中的目标短语,调取中文分词系统中的分词属性判断所述目标短语是否为新词;
    当判定所述短语不是新词时,启动所述中文分词系统中的中文分词序列将所述短语分为第一分词和第二分词,并获取所述第一分词的名称信息以及所述第二分词的名称信息,其中,所述第一分词和第二分词组合成所述目标短语。
  10. 如权利要求9所述的基于信息熵识别新词的装置,所述统计单元具体用于:
    将所述第一分词的名称信息和所述第二分词的名称信息作为索引条件,获取所述第一分词在所述待处理文本中的位置、第一词频以及所述第二分词在所述待处理文本中的位置和第二词频;
    基于所述第一分词在所述待处理文本中的位置和所述第二分词在所述待处理文本中的位置,获取所述第一分词的右搭配词和所述第二分词的左搭配词,并分别统计所述第一分词与所述右搭配词的共现词频和所述第二分词与所述左搭配词的共现词频的共现词频。
  11. 如权利要求10所述的基于信息熵识别新词的装置,所述第一计算单元具体用于:
    在获取到第一分词与所述右搭配词和所述第二分词与所述左搭配词的共现词频,调取所述预置概率公式;
    通过分别将所述第一分词与所述右搭配词的共现词频、所述第一词频以及所述第二分词与所述左搭配词的共现词频和所述第二词频代入到所述预置概率计算公式中,得到所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值。
  12. 如权利要求11所述的基于信息熵识别新词的装置,所述第二计算单元具体用于:
    在获取所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值,调取预置信息熵公式;
    通过分别将所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值代入到所述预置信息熵公式中,计算出所述第一分词的右信息熵或所述第二分词的左信息熵。
  13. 如权利要求8-12中任意一项所述的基于信息熵识别新词的装置,所述第一确定单元具体用于:
    当提取到所述第一预置阈值为0.9时,判断所述第一分词的右信息熵和所述第二分词的左信息熵是否小于0.9;
    在所述第一分词的右信息熵和所述第二分词的左信息熵小于均所述第一预置阈值0.9时,确定所述目标短语为新词。
  14. 如权利要求13所述的基于信息熵识别新词的装置,所述基于信息熵识别新词的装置,还包括:
    获取单元,用于当所述第一分词的右信息熵和/或所述第二分词的左信息熵大于或等于所述第一预置阈值0.9时,获取所述目标短语在所述待处理文本中的共现词频;
    第二确定单元,用于在所述目标短语在所述待处理文本中的共现词频大于第二预置阈值5时,确定所述目标短语为新词。
  15. 一种基于信息熵识别新词的设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如下步骤:
    获取待处理文本中的目标短语,将所述目标短语划分为第一分词和第二分词,并分别读取所述第一分词的信息和所述第二分词的信息;
    基于所述第一分词的信息和所述第二分词的信息,获取所述第一分词的右搭配词和所述第二分词的左搭配词,并统计所述第一分词与所述右搭配词的共现词频以及所述第二分词与所述左搭配词的共现词频;
    通过预置概率公式、所述第一分词与所述右搭配词的共现词频以及所述第二分词与所述左搭配词的共现词频,计算出所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值;
    通过预置信息熵公式、所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值,计算出所述第一分词的右信息熵和所述第二分词的左信息熵;
    在所述第一分词的右信息熵和所述第二分词的左信息熵均小于第一预置阈值时,确定所述目标短语为新词。
  16. 如权利要求15所述的基于信息熵识别新词的设备,所述处理器执行所述计算机程序时实现所述获取待处理文本中的目标短语,将所述目标短语划分为第一分词和第二分词,并分别读取所述第一分词的信息和所述第二分词的信息时,包括以下步骤:
    获取待处理文本中的目标短语,调取中文分词系统中的分词属性判断所述目标短语是否为新词;
    当判定所述短语不是新词时,调取所述中文分词系统中的中文分词序列将所述目标短语分为第一分词和第二分词,并获取所述第一分词的名称信息以及所述第二分词的名称信息,其中,所述第一分词和第二分词组合成所述目标短语。
  17. 如权利要求16所述的基于信息熵识别新词的设备,所述处理器执行所述计算机程序时实现所述基于所述第一分词的信息和所述第二分词的信息,获取所述第一分词的右搭配词和所述第二分词的左搭配词,并统计所述第一分词与所述右搭配词的共现词频以及所述第二分词与所述左搭配词的共现词频时,包括以下步骤:
    将所述第一分词的名称信息和所述第二分词的名称信息作为索引条件,获取所述第一分词在所述待处理文本中的位置、第一词频以及所述第二分词在所述待处理文本中的位置和第二词频;
    基于所述第一分词在所述待处理文本中的位置和所述第二分词在所述待处理文本中的位置,获取所述第一分词的右搭配词和所述第二分词的左搭配词,并分别统计所述第一分词与所述右搭配词的共现词频和所述第二分词与所述左搭配词的共现词频。
  18. 如权利要求17所述的基于信息熵识别新词的设备,所述处理器执行所述计算机程序时实现所述通过预置概率公式,计算出所述第一分词与所述右搭配词的共现和所述第二分词与所述左搭配词的共现概率值时,包括以下步骤:
    在获取到第一分词与所述右搭配词的共现词频和所述第二分词与所述左搭配词的共现词频,调取预置概率公式;
    通过分别将所述第一分词与所述右搭配词的共现词频、所述第一词频以及所述第二分词与所述左搭配词的共现词频和所述第二词频代入到所述预置概率计算公式中,得到所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值。
  19. 如权利要求18所述的基于信息熵识别新词的设备,所述处理器执行所述计算机程序时实现所述通过预置信息熵公式、所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值,计算出所述第一分词的右信息熵或所述第二分词的左信息熵时,包括以下步骤:
    在获取所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值,调取预置信息熵公式;
    通过分别将所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值代入到所述预置信息熵公式中,计算出所述第一分词的右信息熵或所述第二分词的左信息熵。
  20. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:
    获取待处理文本中的目标短语,将所述目标短语划分为第一分词和第二分词,并分别读取所述第一分词的信息和所述第二分词的信息;
    基于所述第一分词的信息和所述第二分词的信息,获取所述第一分词的右搭配词和所述第二分词的左搭配词,并统计所述第一分词与所述右搭配词的共现词频以及所述第二分 词与所述左搭配词的共现词频;
    通过预置概率公式、所述第一分词与所述右搭配词的共现词频以及所述第二分词与所述左搭配词的共现词频,计算出所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值;
    通过预置信息熵公式、所述第一分词与所述右搭配词的共现概率值和所述第二分词与所述左搭配词的共现概率值,计算出所述第一分词的右信息熵和所述第二分词的左信息熵;
    在所述第一分词的右信息熵和所述第二分词的左信息熵均小于第一预置阈值时,确定所述目标短语为新词。
PCT/CN2019/118276 2019-09-19 2019-11-14 基于信息熵识别新词的方法、装置、设备及存储介质 WO2021051600A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910885192.1 2019-09-19
CN201910885192.1A CN110807322B (zh) 2019-09-19 2019-09-19 基于信息熵识别新词的方法、装置、服务器及存储介质

Publications (1)

Publication Number Publication Date
WO2021051600A1 true WO2021051600A1 (zh) 2021-03-25

Family

ID=69487658

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118276 WO2021051600A1 (zh) 2019-09-19 2019-11-14 基于信息熵识别新词的方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN110807322B (zh)
WO (1) WO2021051600A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112765975B (zh) * 2020-12-25 2023-08-04 北京百度网讯科技有限公司 分词岐义处理方法、装置、设备以及介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050015251A1 (en) * 2001-05-08 2005-01-20 Xiaobo Pi High-order entropy error functions for neural classifiers
JP2008234026A (ja) * 2007-03-16 2008-10-02 Nippon Hoso Kyokai <Nhk> 単語分類装置および単語分類プログラム
CN103970733A (zh) * 2014-04-10 2014-08-06 北京大学 一种基于图结构的中文新词识别方法
CN106776543A (zh) * 2016-11-23 2017-05-31 上海智臻智能网络科技股份有限公司 新词发现方法、装置、终端及服务器
CN107807918A (zh) * 2017-10-20 2018-03-16 传神联合(北京)信息技术有限公司 泰语词语识别的方法及装置
CN108021558A (zh) * 2017-12-27 2018-05-11 北京金山安全软件有限公司 关键词的识别方法、装置、电子设备和存储介质
CN108038119A (zh) * 2017-11-01 2018-05-15 平安科技(深圳)有限公司 利用新词发现投资标的的方法、装置及存储介质
CN109408818A (zh) * 2018-10-12 2019-03-01 平安科技(深圳)有限公司 新词识别方法、装置、计算机设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105955950A (zh) * 2016-04-29 2016-09-21 乐视控股(北京)有限公司 新词发现方法及装置
CN107180025B (zh) * 2017-03-31 2020-05-29 北京奇艺世纪科技有限公司 一种新词的识别方法及装置
CN109614499B (zh) * 2018-11-22 2023-02-17 创新先进技术有限公司 一种词典生成方法、新词发现方法、装置及电子设备
CN110110322A (zh) * 2019-03-29 2019-08-09 泰康保险集团股份有限公司 网络新词发现方法、装置、电子设备及存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050015251A1 (en) * 2001-05-08 2005-01-20 Xiaobo Pi High-order entropy error functions for neural classifiers
JP2008234026A (ja) * 2007-03-16 2008-10-02 Nippon Hoso Kyokai <Nhk> 単語分類装置および単語分類プログラム
CN103970733A (zh) * 2014-04-10 2014-08-06 北京大学 一种基于图结构的中文新词识别方法
CN106776543A (zh) * 2016-11-23 2017-05-31 上海智臻智能网络科技股份有限公司 新词发现方法、装置、终端及服务器
CN107807918A (zh) * 2017-10-20 2018-03-16 传神联合(北京)信息技术有限公司 泰语词语识别的方法及装置
CN108038119A (zh) * 2017-11-01 2018-05-15 平安科技(深圳)有限公司 利用新词发现投资标的的方法、装置及存储介质
CN108021558A (zh) * 2017-12-27 2018-05-11 北京金山安全软件有限公司 关键词的识别方法、装置、电子设备和存储介质
CN109408818A (zh) * 2018-10-12 2019-03-01 平安科技(深圳)有限公司 新词识别方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN110807322A (zh) 2020-02-18
CN110807322B (zh) 2024-03-01

Similar Documents

Publication Publication Date Title
WO2019184217A1 (zh) 热点事件分类方法、装置及存储介质
US8577882B2 (en) Method and system for searching multilingual documents
EP3819785A1 (en) Feature word determining method, apparatus, and server
WO2019114430A1 (zh) 自然语言提问的理解方法、装置及电子设备
CN111459967A (zh) 结构化查询语句生成方法、装置、电子设备及介质
CN110297880B (zh) 语料产品的推荐方法、装置、设备及存储介质
US9317608B2 (en) Systems and methods for parsing search queries
JP4237813B2 (ja) 構造化文書管理システム
WO2021051600A1 (zh) 基于信息熵识别新词的方法、装置、设备及存储介质
CN111950267B (zh) 文本三元组的抽取方法及装置、电子设备及存储介质
CN110765767B (zh) 局部优化关键词的提取方法、装置、服务器及存储介质
CN110909532B (zh) 用户名称匹配方法、装置、计算机设备和存储介质
CN109977397B (zh) 基于词性组合的新闻热点提取方法、系统及存储介质
CN109783612B (zh) 报表数据定位方法及装置、存储介质、终端
CN110795617A (zh) 一种搜索词的纠错方法及相关装置
CN114547059A (zh) 平台数据的更新处理方法、装置及计算机设备
CN110909128B (zh) 一种利用词根表进行数据查询的方法、设备、及存储介质
JP4091586B2 (ja) 構造化文書管理システム、索引構築方法及びプログラム
JP7216241B1 (ja) チャンキング実行システム、チャンキング実行方法、及びプログラム
JP4304226B2 (ja) 構造化文書管理システム、構造化文書管理方法及びプログラム
CN108197151B (zh) 文法库的更新方法及装置
WO2017126057A1 (ja) 情報検索方法
JP3241854B2 (ja) 単語スペル自動補正装置
KR100525617B1 (ko) 연관 검색 쿼리 추출 방법 및 시스템
CN118152541A (zh) 基于模型的信息问答方法、系统及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19945520

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19945520

Country of ref document: EP

Kind code of ref document: A1