WO2016127459A1 - Method and device for recognizing unlogged word in intelligent interaction system - Google Patents

Method and device for recognizing unlogged word in intelligent interaction system Download PDF

Info

Publication number
WO2016127459A1
WO2016127459A1 PCT/CN2015/073842 CN2015073842W WO2016127459A1 WO 2016127459 A1 WO2016127459 A1 WO 2016127459A1 CN 2015073842 W CN2015073842 W CN 2015073842W WO 2016127459 A1 WO2016127459 A1 WO 2016127459A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
user
dictionary
input
unregistered
Prior art date
Application number
PCT/CN2015/073842
Other languages
French (fr)
Chinese (zh)
Inventor
张贯京
陈兴明
葛新科
张少鹏
方静芳
高伟明
梁艳妮
周荣
梁昊原
周亮
Original Assignee
深圳市前海安测信息技术有限公司
深圳市易特科信息技术有限公司
深圳市贝沃德克生物技术研究院有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市前海安测信息技术有限公司, 深圳市易特科信息技术有限公司, 深圳市贝沃德克生物技术研究院有限公司 filed Critical 深圳市前海安测信息技术有限公司
Publication of WO2016127459A1 publication Critical patent/WO2016127459A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis

Definitions

  • the invention relates to the technical field of computer science, in particular to a method and a device for identifying unregistered words in an intelligent interactive system.
  • the sentence needs to be segmented first, but the existing word segmentation effect is caused by the presence of unregistered words in some sentences. It is not ideal, so it also affects the subsequent calculation of the similarity of sentences, resulting in intelligent reduction of intelligent interactive systems.
  • the effect of word segmentation depends on the word segmentation algorithm and the word segmentation dictionary.
  • the word segmentation algorithm has achieved good results, it is difficult to have a big improvement, and whether the words in the word segmentation dictionary are complete will directly affect the effect of the word segmentation. If the word segmentation dictionary does not contain the word, then the unregistered word appears. The word is difficult to be correctly segmented.
  • search engine In the intelligent interactive system, when some users use the search engine, they will consciously perform keyword query, that is, query with special characters such as spaces,
  • the main object of the present invention is to provide an unregistered word recognition method in an intelligent interactive system, which enriches the user dictionary, and can improve the word segmentation effect and improve the intelligence of the intelligent interaction system when it is required to segment the sentences input by the user based on the user dictionary. Level.
  • the present invention provides a method for identifying an unregistered word in an intelligent interactive system, and the method for identifying an unregistered word in the intelligent interactive system includes the following steps:
  • S60 determining whether the word input by the user is a word in a network entry, and if yes, adding the word input by the user to the user dictionary as an unregistered word, and inputting the word input by the user from the The user enters the word dictionary to delete, otherwise the word entered by the user is ignored.
  • the method for identifying an unregistered word in the intelligent interaction system further includes the following steps:
  • S90 Establish a user dictionary in which commonly used words of a user-specific application domain are stored.
  • the method for identifying an unregistered word in the intelligent interaction system further includes the following steps:
  • the step S10 includes:
  • the method for identifying an unregistered word in the intelligent interaction system further includes the following steps:
  • S90 Establish a user dictionary in which commonly used words of a user-specific application domain are stored.
  • the method for identifying an unregistered word in the intelligent interaction system further includes the following steps:
  • the method for identifying an unregistered word in the intelligent interaction system further includes the following steps:
  • the method for identifying an unregistered word in the intelligent interaction system further includes the following steps:
  • S90 Establish a user dictionary in which commonly used words of a user-specific application domain are stored.
  • the method for identifying an unregistered word in the intelligent interaction system further includes the following steps:
  • the present invention also provides an apparatus for identifying an unregistered word in an intelligent interactive system, wherein the device for identifying a non-registered word in the intelligent interactive system includes:
  • a first-level identification module configured to determine whether the length of the word input by the user is equal to 1 or greater than 4, and if yes, ignore the word input by the user;
  • a secondary identification module configured to determine, when the length of the word input by the user is greater than 1 and less than or equal to 4, whether the word input by the user is a preset word segment dictionary or a word existing in a user dictionary, and if so, Ignore the words entered by the user;
  • a three-level identification module configured to determine, when the word input by the user is not a word in the word segment dictionary or the user dictionary, whether the word input by the user is included in a word dictionary or a word in a user dictionary If yes, ignore the words entered by the user;
  • a user input word dictionary update module configured to add the word input by the user as a possible unregistered word to the user input when the word input by the user is not included in a word of the word segment dictionary or the user dictionary In the word dictionary;
  • a four-level identification module configured to add a word input by the user as an unregistered word into the user dictionary when the word input by the user is a word in a network entry, and input the word input by the user from The user enters a word dictionary to delete, otherwise ignores the word input by the user.
  • the obtaining module is specifically configured to:
  • the device for identifying the unregistered word in the intelligent interaction system further includes:
  • the user inputs a word dictionary word frequency statistics module for counting the word frequency of each word in the user input word dictionary;
  • a user dictionary update module configured to add the word as an unregistered word to the user dictionary if the word frequency of the word in the user input word dictionary is greater than a preset value, and input the word from the user into the word dictionary delete.
  • the device for identifying the unregistered word in the intelligent interaction system further includes:
  • a user dictionary building module is configured to establish a user dictionary in which commonly used words of the user-specific application domain and the unregistered words are stored.
  • the device for identifying the unregistered word in the intelligent interaction system further includes:
  • the user inputs a word dictionary word building module for establishing a user input word dictionary word, and storing possible unregistered words in the user input word dictionary.
  • the technical solution of the present invention adopts the above technical solution, which is to recognize whether the length of the word input by the user is equal to 1 or greater than 4, whether it is a preset word segment dictionary or a word existing in the user dictionary, and whether it is included in the In a word dictionary or a word in the user dictionary, the possible unregistered words are filtered into the user input word dictionary for temporary recording, and when the words input by the user are further recognized as words in the network entry, The words entered by the user are added to the user dictionary while they are deleted from the user input word dictionary.
  • the embodiment of the present invention enriches the user dictionary by identifying the words input by the user step by step, and enriching the user unregistered words. When the word segmentation of the sentence input by the user is needed based on the user dictionary, the word segmentation effect can be improved. Improve the intelligence level of intelligent interactive systems.
  • FIG. 1 is a schematic flow chart of a first preferred embodiment of an unregistered word recognition method in an intelligent interactive system according to the present invention
  • FIG. 2 is a schematic flow chart of a second preferred embodiment of an unregistered word recognition method in the intelligent interactive system of the present invention
  • FIG. 3 is a schematic structural diagram of a first preferred embodiment of an unregistered word recognition apparatus in the intelligent interactive system of the present invention
  • FIG. 4 is a schematic structural diagram of a second preferred embodiment of an unregistered word recognition apparatus in the intelligent interactive system of the present invention.
  • Natural language processing is an important direction in the field of computer science and artificial intelligence.
  • words are the smallest language unit. Chinese does not have a specific mark between words, so it is necessary to perform Chinese word segmentation in advance when performing automatic processing.
  • the large number of unregistered words has become a technical bottleneck affecting the effect of Chinese word segmentation.
  • Unregistered Word Recognition is a process of automatically detecting and identifying words that have not appeared in the dictionary from the corpus. It is an important basic technology in the field of natural language processing, in Chinese automatic word segmentation, dictionary compilation, information extraction, information. There are a wide range of application requirements in the fields of search and machine translation.
  • the main object of the present invention is to provide an unregistered word recognition method in an intelligent interactive system, which enriches the user dictionary, and can improve the word segmentation effect and improve the intelligence of the intelligent interaction system when it is required to segment the sentences input by the user based on the user dictionary. Level.
  • the present invention provides a method for identifying an unregistered word in an intelligent interactive system.
  • FIG. 1 is a schematic flowchart diagram of a first preferred embodiment of an unregistered word recognition method in an intelligent interactive system according to the present invention.
  • the intelligent interaction system in the embodiment of the present invention includes a client and a server.
  • the client is used to obtain content input by the user
  • the server is used to input the user.
  • the content is processed and the results are fed back.
  • the method for identifying an unregistered word in the intelligent interactive system includes the following steps:
  • the embodiment of the present invention acquires a word input by a user through a client.
  • the user inputs content from the input terminal, because the commonly used input method mostly has a memory function, such as Sogou Pinyin input method, Baidu Pinyin input method, etc., the user is also accustomed to inputting the sentence word by word.
  • the word input by the user can be obtained by asynchronous transmission.
  • Asynchronous transmission refers to transmitting a word input by the user as a user to the server end of the intelligent interactive system when the user inputs a word.
  • the statement is transmitted to the server as a whole. That is, the words and statements entered by the user are transmitted asynchronously to the server.
  • the second-level recognition is performed on the word input by the user, and it is determined whether the input word is a preset word segment dictionary or a user dictionary.
  • the preset word segment dictionary in the embodiment of the present invention is a Chinese word segment dictionary in the prior art; the user dictionary refers to a pre-established set of words unique to the field in a certain application field, such as a health management application field. For example, watching movies, diet, physiotherapy, etc.
  • the user dictionary described in the embodiment of the present invention may also be empty, and added and enriched in the process of subsequent user input.
  • a person skilled in the art may perform a word-by-word traversal search matching on the word segmentation dictionary by using various methods, for example, a word input by the user, or pre-establish an index based on the word input by the user, and perform search matching based on the index. It is not limited here, as long as it can be determined whether the word input by the user is a preset word segment dictionary or a word existing in the user dictionary, when the word input by the user already exists in the preset word segment dictionary or the user dictionary. At the time, the words entered by the user are ignored, otherwise the third level recognition is performed.
  • the second level recognition determines that the word input by the user does not exist in the preset word segment dictionary or the user dictionary, further determining whether the word input by the user is included in a word dictionary or a word in the user dictionary
  • the inclusion described herein means that the word input by the user is entirely included in a word dictionary or a word in the user dictionary.
  • the word entered by the user is “Hello”
  • a word of the word dictionary or user dictionary is “Hello”
  • “Hello” is included in “Hello”
  • the word input by the user is “you are beautiful” and the word of the word segment dictionary or the user dictionary is “hello”, the word input by the user is considered not included in the word segment dictionary or the user dictionary. In a word.
  • the word input by the user is added as a possible unregistered word to the user input word dictionary.
  • the user input word dictionary is used to temporarily store words that are input after the user inputs the sentence but are deleted, and are identified step by step through the above steps and will eventually be recognized as possible unregistered words.
  • S60 determining whether the word input by the user is a word in a network entry, and if yes, adding the word input by the user to the user dictionary as an unregistered word, and inputting the word input by the user from the The user enters the word dictionary to delete, otherwise the word entered by the user is ignored.
  • the word input by the user is added as a possible unregistered word to the user input word dictionary.
  • the network term priority refers to a term currently provided by Baidu Encyclopedia. Baidu Encyclopedia adheres to the spirit of equality, collaboration, sharing and freedom. It advocates equality before the network. All people work together to write an encyclopedia, so that knowledge can be continuously combined and expanded under certain technical rules and cultural contexts.
  • the words in the Baidu Encyclopedia entry include the most popular new words at present, which can identify the unregistered words to the maximum extent. If the word is in the network entry, the word input by the user is added to the user dictionary as an unregistered word, and the word input by the user is deleted from the user input word dictionary, otherwise the user is ignored. Enter the word.
  • Steps S10 to S60 are sequentially used to identify all words input by the user.
  • the server side matches the existing database based on the existing word segmentation, calculation similarity, and matching algorithm based on the word segment dictionary and the user dictionary.
  • the embodiment of the present invention recognizes whether the length of the word input by the user is equal to 1 or greater than 4, whether it is a preset word segment dictionary or a word existing in the user dictionary, whether it is included in the word segment dictionary or a word in the user dictionary.
  • the possible unregistered words are filtered into the user input word dictionary for temporary recording, and when the words input by the user are further recognized as words in the network entry, the words input by the user are added to the user dictionary. At the same time, it is deleted from the user input word dictionary.
  • the embodiment of the present invention enriches the user dictionary by identifying the words input by the user step by step, and enriching the user unregistered words. When the word segmentation of the sentence input by the user is needed based on the user dictionary, the word segmentation effect can be improved. Improve the intelligence level of intelligent interactive systems.
  • the step S10 includes:
  • the change content of the text box when the user inputs the content is obtained.
  • the habit will be one by one.
  • enter “I ask” or “Please” “Q”, “Technology Park”, “How", “Go” the client will get the change content of the user input text box, for example, get “Excuse” first, and "I would like to ask” as the word input by the user, according to the flow chart of the first preferred embodiment of the unregistered word recognition method in the intelligent interactive system of the present invention, the "excuse me” is identified as an unregistered word.
  • the server based on the preset word segmentation dictionary and the updated user dictionary according to the existing word segmentation, calculation similarity, matching algorithm from the default database need to return Content.
  • the existing word dictionary and the updated user dictionary can be used according to the existing after the user inputs the sentence. Cut the word to further improve the effect of word segmentation and improve the intelligence level of the intelligent interactive system.
  • FIG. 2 is a schematic flowchart diagram of a second preferred embodiment of an unregistered word recognition method in an intelligent interactive system according to the present invention.
  • the method for identifying an unregistered word in the intelligent interactive system further includes the following steps. :
  • the user input word dictionary is used to temporarily store possible unregistered words that the user recognizes step by step through the above steps in the process of inputting a sentence.
  • the word frequency refers to the frequency at which the word appears in the user input dictionary. These words are words that the user often inputs but do not exist in the network entry.
  • the word frequency of each word in the user input word dictionary is counted, and the word frequency is greater than the preset value (may be common words but not included in The preset word segment dictionary and the user dictionary are added to the user dictionary to further enrich the user dictionary and delete the words from the user input word dictionary.
  • the method for identifying an unregistered word in the intelligent interaction system further includes the following steps:
  • S90 Establish a user dictionary in which commonly used words of a user-specific application domain are stored.
  • the user dictionary in the embodiment of the present invention is a set of words unique to the field that are pre-established in a certain application field, such as a health management application field, such as watching movies, diet therapy, physical therapy, and the like. After pre-establishment, the user dictionary can be added and enriched during subsequent user input.
  • the present invention also provides an apparatus for identifying an unregistered word in an intelligent interactive system.
  • FIG. 3 is a schematic structural diagram of a first preferred embodiment of an unregistered word recognition apparatus in an intelligent interactive system according to the present invention.
  • the device for identifying an unregistered word in the intelligent interaction system includes:
  • An obtaining module 10 configured to acquire a word input by a user
  • the obtaining module 10 acquires a word input by a user through a client.
  • the user inputs content from the input terminal, because the commonly used input method mostly has a memory function, such as Sogou Pinyin input method, Baidu Pinyin input method, etc., the user is also accustomed to inputting the sentence word by word.
  • the word input by the user can be obtained by asynchronous transmission.
  • Asynchronous transmission refers to transmitting a word input by the user as a user to the server end of the intelligent interactive system when the user inputs a word.
  • the statement is transmitted to the server as a whole. That is, the words entered by the user are asynchronously transmitted to the server.
  • the first-level identification module 20 is configured to determine whether the length of the word input by the user is equal to 1 or greater than 4, and if yes, ignore the word input by the user;
  • the first-level identification module 20 first performs first-level recognition, and determines the user input by calculating the length of the word input by the user. Whether the length of the word is equal to 1 or greater than 4, that is, whether it is a single word or a word of 4 or more words, and if so, the word input by the user is ignored, that is, the word input by the user or more than 4 words is filtered out. The word, otherwise the second level of recognition.
  • the secondary identification module 30 is configured to determine, when the length of the word input by the user is greater than 1 and less than or equal to 4, whether the word input by the user is a preset word segment dictionary or a word existing in the user dictionary, and if so, Then ignore the words entered by the user;
  • the secondary identification module 30 performs second-level recognition on the word input by the user. It is judged whether the input word is a word in a preset word segment dictionary or a user dictionary.
  • the preset word segment dictionary in the embodiment of the present invention is a Chinese word segment dictionary in the prior art; the user dictionary refers to a pre-established set of words unique to the field in a certain application field, such as a health management application field. For example, watching movies, diet, physiotherapy, etc.
  • the user dictionary described in the embodiment of the present invention may also be empty, and added and enriched in the process of subsequent user input.
  • a person skilled in the art may perform a word-by-word traversal search matching on the word segmentation dictionary by using various methods, for example, a word input by the user, or pre-establish an index based on the word input by the user, and perform search matching based on the index. It is not limited here, as long as it can be determined whether the word input by the user is a preset word segment dictionary or a word existing in the user dictionary, when the word input by the user already exists in the preset word segment dictionary or the user dictionary. At the time, the words entered by the user are ignored, otherwise the third level recognition is performed.
  • the third-level identification module 40 is configured to determine, when the word input by the user is not a word in the word segment dictionary or the user dictionary, whether the word input by the user is included in the word dictionary or a word in the user dictionary If yes, ignore the words entered by the user;
  • the three-level identification module 40 further determines whether the word input by the user includes In a word segmentation dictionary or a word in a user dictionary, the inclusion herein means that the word input by the user is entirely included in a word dictionary or a word in a user dictionary.
  • the word entered by the user is “Hello”
  • a word of the word dictionary or user dictionary is “Hello”
  • “Hello” is included in “Hello”
  • the word input by the user is “you are beautiful” and the word of the word segment dictionary or the user dictionary is “hello”, the word input by the user is considered not included in the word segment dictionary or the user dictionary. In a word.
  • the user input word dictionary update module 50 is configured to: when the three-level identification module 40 determines that the word input by the user is not included in a word of the word segment dictionary or the user dictionary, the word input by the user Added as a possible unregistered word to the user input word dictionary;
  • the user input word dictionary update module 50 inputs the user.
  • the words are added to the user input word dictionary as possible unregistered words.
  • the user input word dictionary is used to temporarily store words that are input after the user inputs the sentence but are deleted, and are identified step by step through the above steps and will eventually be recognized as possible unregistered words.
  • a four-level identification module 60 configured to add a word input by the user as an unregistered word into the user dictionary when the word input by the user is a word in a network entry, and input the word entered by the user Deleted from the user input word dictionary, otherwise the words entered by the user are ignored.
  • the user input word dictionary update module 50 inputs the user.
  • the word is added as a possible unregistered word to the user input word dictionary
  • the four-level identification module 60 determines whether the word input by the user is a word in a network entry, and the network entry priority refers to the current Baidu Encyclopedia Can provide the terms.
  • Baidu Encyclopedia adheres to the spirit of equality, collaboration, sharing and freedom. It advocates equality before the network. All people work together to write an encyclopedia, so that knowledge can be continuously combined and expanded under certain technical rules and cultural contexts.
  • the words in the Baidu Encyclopedia entry include the most popular new words at present, which can identify the unregistered words to the maximum extent. If the words are in the network entry, the words input by the user are added to the user dictionary, and the words input by the user are deleted from the user input word dictionary, otherwise the words input by the user are ignored.
  • the server side matches the content to be returned from the preset database according to the existing word segmentation, calculation similarity and matching algorithm based on the word segment dictionary and the user dictionary. . Since the unregistered words are added to the user dictionary, when the words input by the user based on the user dictionary need to be segmented, the word segmentation effect can be improved, and the intelligent level of the intelligent interactive system can be improved.
  • the embodiment of the present invention recognizes whether the length of the word input by the user is equal to 1 or greater than 4, whether it is a preset word segment dictionary or a word existing in the user dictionary, whether it is included in the word segment dictionary or a word in the user dictionary.
  • the possible unregistered words are filtered into the user input word dictionary for temporary recording, and when the words input by the user are further recognized as words in the network entry, the words input by the user are added to the user dictionary. At the same time, it is deleted from the user input word dictionary.
  • the embodiment of the present invention enriches the user dictionary by identifying the words input by the user step by step, and enriching the user unregistered words. When the word segmentation of the sentence input by the user is needed based on the user dictionary, the word segmentation effect can be improved. Improve the intelligence level of intelligent interactive systems.
  • the acquiring module is specifically configured to:
  • the change content of the text box when the user inputs the content is obtained.
  • the habit will be one by one.
  • enter “I ask” or “Please” “Q”, “Technology Park”, “How", “Go” the client will get the change content of the user input text box, for example, get “Excuse” first, and "I would like to ask” as the word input by the user, according to the flow chart of the first preferred embodiment of the unregistered word recognition method in the intelligent interactive system of the present invention, the "excuse me” is identified as an unregistered word.
  • the server based on the preset word segmentation dictionary and the updated user dictionary according to the existing word segmentation, calculation similarity, matching algorithm from the default database need to return Content.
  • the existing word dictionary and the updated user dictionary can be used according to the existing after the user inputs the sentence. Cut the word to further improve the effect of word segmentation and improve the intelligence level of the intelligent interactive system.
  • FIG. 4 is a schematic structural diagram of a second preferred embodiment of an unregistered word recognition apparatus in an intelligent interactive system according to the present invention.
  • the device for identifying the unregistered word in the intelligent interactive system further includes:
  • the user input word dictionary word frequency statistics module 70 is configured to count the word frequency of each word in the user input word dictionary
  • the user dictionary update module 80 is configured to add the word as an unregistered word into the user dictionary if the word frequency of the word in the user input word dictionary is greater than a preset value, and input the word from the user into the word dictionary Deleted.
  • the user input word dictionary is used to temporarily store words that are input after the user inputs the sentence but are deleted, and are identified step by step through the above steps and will eventually be recognized as possible unregistered words. These words are words that the user often inputs but do not exist in the network entry.
  • the word frequency of each word in the user input word dictionary is counted, and the word frequency is greater than the preset value (may be common words but not included in The preset word segment dictionary and the user dictionary are added to the user dictionary to further enrich the user dictionary and delete the words from the user input word dictionary.
  • the device for identifying a non-registered word in the intelligent interactive system further includes:
  • a user dictionary building module is configured to establish a user dictionary in which commonly used words of a user-specific application domain are stored.
  • the user dictionary in the embodiment of the present invention is a set of words unique to the field that are pre-established in a certain application field, such as a health management application field, such as watching movies, diet therapy, physical therapy, and the like. After pre-establishment, the user dictionary can be added and enriched during subsequent user input.
  • the device for identifying a non-registered word in the intelligent interactive system further includes:
  • the user inputs a word dictionary word building module for establishing a user input word dictionary word and storing possible unregistered words input by the user during the input sentence.
  • the main stored content can be seen from the user input word dictionary word update module.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

A method for recognizing an unlogged word in an intelligent interaction system. The method comprises: by gradually recognizing whether a length of a word input by a user is equal to 1 or greater than 4, whether the word input by the user is a word existing in a pre-set word segmentation dictionary or a user dictionary and whether the word input by the user is contained in a word of the word segmentation dictionary or the user dictionary level by level, screening a possible unlogged word, adding same to a user input word dictionary, making a temporary record, and when the word input by the user is further recognized as a word in a network entry, adding the word input by the user to the user dictionary, and simultaneously deleting same from the user input word dictionary. A possible unlogged dictionary is added to a user dictionary by gradually recognizing the word input by the user level by level, so that the user dictionary is enriched; when the a sentence input by the user is segmented on the basis of the user dictionary, the word segmentation effect can be improved, and the intelligence level of the intelligent interaction system can be improved.

Description

智能交互系统中未登录词的识别方法和装置  Method and device for identifying unregistered words in intelligent interactive system
技术领域Technical field
本发明涉及计算机科学技术领域,尤其涉及一种智能交互系统中未登录词的识别方法和装置。The invention relates to the technical field of computer science, in particular to a method and a device for identifying unregistered words in an intelligent interactive system.
背景技术Background technique
在智能交互系统中,无论是对问题建立索引,还是计算用户问题与问答库中问题的相似度,都需要首先对句子进行分词,但是由于部分句子中存在未登录词,使得现有的分词效果并不理想,因此也影响到后续对句子相似度的计算,导致智能交互系统智能化降低。In the intelligent interactive system, whether it is indexing the problem or calculating the similarity between the user problem and the question and answer library, the sentence needs to be segmented first, but the existing word segmentation effect is caused by the presence of unregistered words in some sentences. It is not ideal, so it also affects the subsequent calculation of the similarity of sentences, resulting in intelligent reduction of intelligent interactive systems.
现有技术中,分词的效果依赖于分词算法和分词词典。目前,分词算法已取得不错的效果,很难有较大提升,而分词词典中的词是否齐全,会直接影响到分词的效果,若分词词典中不包含该词,即出现未登录词,则该词很难被正确的切分。In the prior art, the effect of word segmentation depends on the word segmentation algorithm and the word segmentation dictionary. At present, the word segmentation algorithm has achieved good results, it is difficult to have a big improvement, and whether the words in the word segmentation dictionary are complete will directly affect the effect of the word segmentation. If the word segmentation dictionary does not contain the word, then the unregistered word appears. The word is difficult to be correctly segmented.
智能交互系统中,部分用户使用搜索引擎时,会自觉的进行关键词查询,即用空格、|、“”等特殊字符进行查询,搜索引擎可以通过用户的查询记录进行新词识别,进而扩充用户词典,实现更快、更准确的查询。问答系统中,用户习惯于使用连续的句子进行查询,因此,无法用同样的方法来识别未登录词。In the intelligent interactive system, when some users use the search engine, they will consciously perform keyword query, that is, query with special characters such as spaces, |, "", and the search engine can identify new words through the user's query record, thereby expanding the user. Dictionary for faster, more accurate queries. In the question and answer system, users are accustomed to using continuous sentences for querying, so the same method cannot be used to identify unregistered words.
基于此有必要提供一种智能交互系统中未登录词识别方法和装置,以丰富用户词典,当需要基于用户词典对用户输入的语句进行分词时,能够改善分词效果,提高智能交互系统的智能化水平。Based on this, it is necessary to provide an unregistered word recognition method and device in the intelligent interactive system to enrich the user dictionary. When it is necessary to segment the words input by the user based on the user dictionary, the word segmentation effect can be improved, and the intelligent interactive system can be improved. Level.
发明内容Summary of the invention
本发明的主要目的在于提供一种智能交互系统中未登录词识别方法,以丰富用户词典,当需要基于用户词典对用户输入的语句进行分词时,能够改善分词效果,提高智能交互系统的智能化水平。The main object of the present invention is to provide an unregistered word recognition method in an intelligent interactive system, which enriches the user dictionary, and can improve the word segmentation effect and improve the intelligence of the intelligent interaction system when it is required to segment the sentences input by the user based on the user dictionary. Level.
为实现上述目的,本发明提供了一种智能交互系统中未登录词的识别方法,所述智能交互系统中未登录词的识别方法包括如下步骤:To achieve the above object, the present invention provides a method for identifying an unregistered word in an intelligent interactive system, and the method for identifying an unregistered word in the intelligent interactive system includes the following steps:
S10:获取用户输入的词;S10: Obtain a word input by a user;
S20:判断所述用户输入的词的长度是否等于1或大于4,若是,则忽略所述用户输入的词,否则执行S30;S20: determining whether the length of the word input by the user is equal to 1 or greater than 4, and if so, ignoring the word input by the user, otherwise executing S30;
S30:判断所述用户输入的词是否为预设的分词词典或用户词典中存在的词,若是,则忽略所述用户输入的词,否则执行S40; S30: determining whether the word input by the user is a preset word segment dictionary or a word existing in the user dictionary, and if so, ignoring the word input by the user, otherwise executing S40;
S40:判断所述用户输入的词是否包含于所述分词词典或用户词典的某个词中,若是,则忽略所述用户输入的词,否则执行S50;S40: determining whether the word input by the user is included in a word dictionary or a word in the user dictionary, and if so, ignoring the word input by the user, otherwise executing S50;
S50:将所述用户输入的词作为可能的未登录词添加到用户输入词词典中;S50: adding the word input by the user as a possible unregistered word to the user input word dictionary;
S60:判断所述用户输入的词是否为网络词条中的词,若是,则将所述用户输入的词作为未登录词加入所述用户词典中,并将所述用户输入的词从所述用户输入词词典中删除,否则忽略所述用户输入的词。S60: determining whether the word input by the user is a word in a network entry, and if yes, adding the word input by the user to the user dictionary as an unregistered word, and inputting the word input by the user from the The user enters the word dictionary to delete, otherwise the word entered by the user is ignored.
优选地,所述智能交互系统中未登录词的识别方法还包括如下步骤:Preferably, the method for identifying an unregistered word in the intelligent interaction system further includes the following steps:
S90:建立用户词典,在所述用户词典中存储用户特定应用领域的常用词。S90: Establish a user dictionary in which commonly used words of a user-specific application domain are stored.
优选地,所述智能交互系统中未登录词的识别方法还包括如下步骤:Preferably, the method for identifying an unregistered word in the intelligent interaction system further includes the following steps:
S100:建立用户输入词词典,在所述用户输入词词典中存储可能的未登录词。S100: Establish a user input word dictionary, and store possible unregistered words in the user input word dictionary.
优选地,所述步骤S10包括:Preferably, the step S10 includes:
S11:获取用户输入时文本框的变化内容;S11: Obtain a change content of the text box when the user inputs;
S12:将所述文本框的变化内容作为用户输入的词。S12: The changed content of the text box is used as a word input by the user.
优选地,所述智能交互系统中未登录词的识别方法还包括如下步骤:Preferably, the method for identifying an unregistered word in the intelligent interaction system further includes the following steps:
S90:建立用户词典,在所述用户词典中存储用户特定应用领域的常用词。S90: Establish a user dictionary in which commonly used words of a user-specific application domain are stored.
优选地,所述智能交互系统中未登录词的识别方法还包括如下步骤:Preferably, the method for identifying an unregistered word in the intelligent interaction system further includes the following steps:
S100:建立用户输入词词典,在所述用户输入词词典中存储可能的未登录词。S100: Establish a user input word dictionary, and store possible unregistered words in the user input word dictionary.
优选地,所述智能交互系统中未登录词的识别方法还包括如下步骤:Preferably, the method for identifying an unregistered word in the intelligent interaction system further includes the following steps:
S70:统计所述用户输入词词典中每个词的词频;S70: Statistics the word frequency of each word in the user input word dictionary;
S80:若所述用户输入词词典中某词的词频大于预设值,则将所述词作为未登录词加入用户词典中,并将所述词从所述用户输入词词典中删除。S80: If the word frequency of a word in the user input word dictionary is greater than a preset value, the word is added to the user dictionary as an unregistered word, and the word is deleted from the user input word dictionary.
优选地,所述智能交互系统中未登录词的识别方法还包括如下步骤:Preferably, the method for identifying an unregistered word in the intelligent interaction system further includes the following steps:
S90:建立用户词典,在所述用户词典中存储用户特定应用领域的常用词。S90: Establish a user dictionary in which commonly used words of a user-specific application domain are stored.
优选地,所述智能交互系统中未登录词的识别方法还包括如下步骤:Preferably, the method for identifying an unregistered word in the intelligent interaction system further includes the following steps:
S100:建立用户输入词词典,在所述用户输入词词典中存储可能的未登录词。S100: Establish a user input word dictionary, and store possible unregistered words in the user input word dictionary.
此外,本发明还提供一种智能交互系统中未登录词的识别装置,所述智能交互系统中未登录词的识别装置包括:In addition, the present invention also provides an apparatus for identifying an unregistered word in an intelligent interactive system, wherein the device for identifying a non-registered word in the intelligent interactive system includes:
获取模块,用于获取用户输入的词;An acquisition module for obtaining a word input by a user;
一级识别模块,用于判断所述用户输入的词的长度是否等于1或大于4,若是,则忽略所述用户输入的词;a first-level identification module, configured to determine whether the length of the word input by the user is equal to 1 or greater than 4, and if yes, ignore the word input by the user;
二级识别模块,用于当所述用户输入的词的长度为大于1且小于等于4时,判断所述用户输入的词是否为预设的分词词典或用户词典中存在的词,若是,则忽略所述用户输入的词;a secondary identification module, configured to determine, when the length of the word input by the user is greater than 1 and less than or equal to 4, whether the word input by the user is a preset word segment dictionary or a word existing in a user dictionary, and if so, Ignore the words entered by the user;
三级识别模块,用于当所述用户输入的词不是所述分词词典或用户词典中存在的词时,判断所述用户输入的词是否包含于所述分词词典或用户词典的某个词中,若是,则忽略所述用户输入的词;a three-level identification module, configured to determine, when the word input by the user is not a word in the word segment dictionary or the user dictionary, whether the word input by the user is included in a word dictionary or a word in a user dictionary If yes, ignore the words entered by the user;
用户输入词词典更新模块,用于当所述用户输入的词未包含于所述分词词典或用户词典的某个词中时,将所述用户输入的词作为可能的未登录词添加到用户输入词词典中;a user input word dictionary update module, configured to add the word input by the user as a possible unregistered word to the user input when the word input by the user is not included in a word of the word segment dictionary or the user dictionary In the word dictionary;
四级识别模块,用于当所述用户输入的词为网络词条中的词时,将所述用户输入的词作为未登录词加入所述用户词典中,并将所述用户输入的词从所述用户输入词词典中删除,否则忽略所述用户输入的词。a four-level identification module, configured to add a word input by the user as an unregistered word into the user dictionary when the word input by the user is a word in a network entry, and input the word input by the user from The user enters a word dictionary to delete, otherwise ignores the word input by the user.
优选地,所述获取模块具体用于:Preferably, the obtaining module is specifically configured to:
获取用户输入时文本框的变化内容,将所述文本框的变化内容作为用户输入的词。Obtaining the changed content of the text box when the user inputs, and changing the content of the text box as a word input by the user.
优选地,所述智能交互系统中未登录词的识别装置还包括:Preferably, the device for identifying the unregistered word in the intelligent interaction system further includes:
用户输入词词典词频统计模块,用于统计所述用户输入词词典中每个词的词频;The user inputs a word dictionary word frequency statistics module for counting the word frequency of each word in the user input word dictionary;
用户词典更新模块,用于若所述用户输入词词典中词的词频大于预设值,则将所述词作为未登录词加入用户词典中,并将所述词从所述用户输入词词典中删除。a user dictionary update module, configured to add the word as an unregistered word to the user dictionary if the word frequency of the word in the user input word dictionary is greater than a preset value, and input the word from the user into the word dictionary delete.
优选地,所述智能交互系统中未登录词的识别装置还包括:Preferably, the device for identifying the unregistered word in the intelligent interaction system further includes:
用户词典建立模块,用于建立用户词典,在所述用户词典中存储用户特定应用领域的常用词和所述未登录词。A user dictionary building module is configured to establish a user dictionary in which commonly used words of the user-specific application domain and the unregistered words are stored.
优选地,所述智能交互系统中未登录词的识别装置还包括:Preferably, the device for identifying the unregistered word in the intelligent interaction system further includes:
用户输入词词典词建立模块,用于建立用户输入词词典词,在所述用户输入词词典中存储可能的未登录词。The user inputs a word dictionary word building module for establishing a user input word dictionary word, and storing possible unregistered words in the user input word dictionary.
本发明采用上述技术方案,带来的技术效果为:通过逐级识别用户输入的词的长度是否等于1或大于4、是否为预设的分词词典或用户词典中存在的词、是否包含于所述分词词典或用户词典的某个词中,筛选出可能的未登录词加入用户输入词词典中做临时记录,当进一步识别出所述用户输入的词为网络词条中的词时,将所述用户输入的词加入所述用户词典中,同时将其从所述用户输入词词典中删除。本发明实施例通过逐级识别用户输入的词,将可能的未登录词加入用户词典中,丰富了用户词典,当需要基于所述用户词典对用户输入的语句进行分词时,能够改善分词效果,提高智能交互系统的智能化水平。The technical solution of the present invention adopts the above technical solution, which is to recognize whether the length of the word input by the user is equal to 1 or greater than 4, whether it is a preset word segment dictionary or a word existing in the user dictionary, and whether it is included in the In a word dictionary or a word in the user dictionary, the possible unregistered words are filtered into the user input word dictionary for temporary recording, and when the words input by the user are further recognized as words in the network entry, The words entered by the user are added to the user dictionary while they are deleted from the user input word dictionary. The embodiment of the present invention enriches the user dictionary by identifying the words input by the user step by step, and enriching the user unregistered words. When the word segmentation of the sentence input by the user is needed based on the user dictionary, the word segmentation effect can be improved. Improve the intelligence level of intelligent interactive systems.
附图说明DRAWINGS
图1为本发明智能交互系统中未登录词识别方法第一优选实施例流程示意图;1 is a schematic flow chart of a first preferred embodiment of an unregistered word recognition method in an intelligent interactive system according to the present invention;
图2为本发明智能交互系统中未登录词识别方法第二优选实施例流程示意图;2 is a schematic flow chart of a second preferred embodiment of an unregistered word recognition method in the intelligent interactive system of the present invention;
图3为本发明智能交互系统中未登录词识别装置第一优选实施例结构示意图;3 is a schematic structural diagram of a first preferred embodiment of an unregistered word recognition apparatus in the intelligent interactive system of the present invention;
图4为本发明智能交互系统中未登录词识别装置第二优选实施例结构示意图。4 is a schematic structural diagram of a second preferred embodiment of an unregistered word recognition apparatus in the intelligent interactive system of the present invention.
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The implementation, functional features, and advantages of the present invention will be further described in conjunction with the embodiments.
具体实施方式detailed description
应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。在自然语言处理过程中,词是最小的语言单位。汉语在词之间没有特定标记,所以因此在进行自动处理时,需要预先进行中文分词。而未登录词的大量存在已经成为影响中文分词效果的技术瓶颈。未登录词识别(UWI)是从语料中自动检测和识别未在词典中出现过的词语的过程,是自然语言处理领域的一项重要基础技术,在中文自动分词、词典编撰、信息抽取、信息检索以及机器翻译等领域都有着广泛的应用需求。Natural language processing is an important direction in the field of computer science and artificial intelligence. In natural language processing, words are the smallest language unit. Chinese does not have a specific mark between words, so it is necessary to perform Chinese word segmentation in advance when performing automatic processing. The large number of unregistered words has become a technical bottleneck affecting the effect of Chinese word segmentation. Unregistered Word Recognition (UWI) is a process of automatically detecting and identifying words that have not appeared in the dictionary from the corpus. It is an important basic technology in the field of natural language processing, in Chinese automatic word segmentation, dictionary compilation, information extraction, information. There are a wide range of application requirements in the fields of search and machine translation.
本发明的主要目的在于提供一种智能交互系统中未登录词识别方法,以丰富用户词典,当需要基于用户词典对用户输入的语句进行分词时,能够改善分词效果,提高智能交互系统的智能化水平。The main object of the present invention is to provide an unregistered word recognition method in an intelligent interactive system, which enriches the user dictionary, and can improve the word segmentation effect and improve the intelligence of the intelligent interaction system when it is required to segment the sentences input by the user based on the user dictionary. Level.
为实现上述目的,本发明提供了一种智能交互系统中未登录词的识别方法。To achieve the above object, the present invention provides a method for identifying an unregistered word in an intelligent interactive system.
参照图1,图1为本发明智能交互系统中未登录词识别方法第一优选实施例流程示意图。Referring to FIG. 1, FIG. 1 is a schematic flowchart diagram of a first preferred embodiment of an unregistered word recognition method in an intelligent interactive system according to the present invention.
为了更好的解释本发明实施例,以智能交互系统为例,本发明实施例中的智能交互系统包括客户端和服务器端,客户端用于获取用户输入的内容,服务器端用于对用户输入的内容进行处理并反馈结果。In order to better explain the embodiment of the present invention, the intelligent interaction system in the embodiment of the present invention includes a client and a server. The client is used to obtain content input by the user, and the server is used to input the user. The content is processed and the results are fed back.
在一实施例中,如图1所示,所述智能交互系统中未登录词的识别方法包括如下步骤:In an embodiment, as shown in FIG. 1, the method for identifying an unregistered word in the intelligent interactive system includes the following steps:
S10:获取用户输入的词;S10: Obtain a word input by a user;
本发明实施例通过客户端获取用户输入的词。当用户从输入端输入内容时,因为常用的输入法大多带有记忆功能,例如搜狗拼音输入法、百度拼音输入法等,用户也习惯于逐词输入语句。在用户输入语句的同时,可以通过异步传输的方式,获取用户输入的词。此处所述的异步传输是指当用户输入完一个词时,即将所述输入词作为用户输入的词传输至所述智能交互系统的服务器端。当用户输入完一个语句时,再将所述语句作为一个整体传输至服务器端。即用户输入的词和语句为异步传输至服务器端。The embodiment of the present invention acquires a word input by a user through a client. When the user inputs content from the input terminal, because the commonly used input method mostly has a memory function, such as Sogou Pinyin input method, Baidu Pinyin input method, etc., the user is also accustomed to inputting the sentence word by word. While the user inputs the statement, the word input by the user can be obtained by asynchronous transmission. Asynchronous transmission as used herein refers to transmitting a word input by the user as a user to the server end of the intelligent interactive system when the user inputs a word. When the user enters a statement, the statement is transmitted to the server as a whole. That is, the words and statements entered by the user are transmitted asynchronously to the server.
S20:判断所述用户输入的词的长度是否等于1或大于4,若是,则忽略所述用户输入的词,否则执行S30;S20: determining whether the length of the word input by the user is equal to 1 or greater than 4, and if so, ignoring the word input by the user, otherwise executing S30;
在获取到所述用户输入的词后,首先进行第一级识别,通过计算所述用户输入的词的长度,判断所述用户输入的词的长度是否等于1或大于4,即是否为单字词或4个字以上的词,若是,则忽略所述用户输入的词,即过滤掉用户输入的单字词或4个字以上的词,否则进行第二级识别。After obtaining the word input by the user, first performing the first level identification, and determining whether the length of the word input by the user is equal to 1 or greater than 4 by calculating the length of the word input by the user, that is, whether the word is a single word. Words or words of more than 4 words, if yes, ignore the words entered by the user, that is, filter out single words or words of more than 4 words input by the user, otherwise perform second level recognition.
S30:判断所述用户输入的词是否为预设的分词词典或用户词典中存在的词,若是,则忽略所述用户输入的词,否则执行S40;S30: determining whether the word input by the user is a preset word segment dictionary or a word existing in the user dictionary, and if so, ignoring the word input by the user, otherwise executing S40;
当判断出所述用户输入的词的长度为大于1且小于等于4时,对所述用户输入的词进行第二级识别,判断所述输入的词是否为预设的分词词典或用户词典中存在的词。本发明实施例所述预设的分词词典,为现有技术中的中文分词词典;所述用户词典是指在某个应用领域,例如健康管理应用领域,预先建立的本领域特有的词的集合,例如看片、食疗、理疗等。本发明实施例所述的用户词典也可以为空,在后续用户输入的过程中进行添加和丰富。本领域技术人员可以通过多种方法,例如针对所述用户输入的词对所述分词词典进行逐词遍历查找匹配,或预先建立基于所述用户输入的词的索引,再基于索引进行查找匹配,在此不作限定,只要能够判断所述用户输入的词是否为预设的分词词典或用户词典中存在的词即可,当所述用户输入的词已经存在于预设的分词词典或用户词典中时,则忽略所述用户输入的词,否则进行第三级识别。When it is determined that the length of the word input by the user is greater than 1 and less than or equal to 4, the second-level recognition is performed on the word input by the user, and it is determined whether the input word is a preset word segment dictionary or a user dictionary. The word that exists. The preset word segment dictionary in the embodiment of the present invention is a Chinese word segment dictionary in the prior art; the user dictionary refers to a pre-established set of words unique to the field in a certain application field, such as a health management application field. For example, watching movies, diet, physiotherapy, etc. The user dictionary described in the embodiment of the present invention may also be empty, and added and enriched in the process of subsequent user input. A person skilled in the art may perform a word-by-word traversal search matching on the word segmentation dictionary by using various methods, for example, a word input by the user, or pre-establish an index based on the word input by the user, and perform search matching based on the index. It is not limited here, as long as it can be determined whether the word input by the user is a preset word segment dictionary or a word existing in the user dictionary, when the word input by the user already exists in the preset word segment dictionary or the user dictionary. At the time, the words entered by the user are ignored, otherwise the third level recognition is performed.
S40:判断所述用户输入的词是否包含于所述分词词典或用户词典的某个词中,若是,则忽略所述用户输入的词,否则执行S50;S40: determining whether the word input by the user is included in a word dictionary or a word in the user dictionary, and if so, ignoring the word input by the user, otherwise executing S50;
当第二级识别判断出所述用户输入的词未存在于预设的分词词典或用户词典中时,进一步判断所述用户输入的词是否包含于所述分词词典或用户词典的某个词中,此处所述的包含是指所述用户输入的词整体包含于所述分词词典或用户词典的某个词中。例如所述用户输入的词为“你好”,而所述分词词典或用户词典的某个词为“你好美”,“你好”包含于“你好美”中,则认为所述分词词典或用户词典中已经存在与“你好”类似的词,此时,忽略所述用户输入的词。若所述用户输入的词为“你美”,而所述分词词典或用户词典的某个词为“你好美”,则认为所述用户输入的词未包含于所述分词词典或用户词典的某个词中。When the second level recognition determines that the word input by the user does not exist in the preset word segment dictionary or the user dictionary, further determining whether the word input by the user is included in a word dictionary or a word in the user dictionary The inclusion described herein means that the word input by the user is entirely included in a word dictionary or a word in the user dictionary. For example, the word entered by the user is “Hello”, and a word of the word dictionary or user dictionary is “Hello,” and “Hello” is included in “Hello”, and the word dictionary or A word similar to "Hello" already exists in the user dictionary, at which time the word entered by the user is ignored. If the word input by the user is “you are beautiful” and the word of the word segment dictionary or the user dictionary is “hello”, the word input by the user is considered not included in the word segment dictionary or the user dictionary. In a word.
S50:将所述用户输入的词作为可能的未登录词添加到用户输入词词典中;S50: adding the word input by the user as a possible unregistered word to the user input word dictionary;
当第三级识别判断出所述用户输入的词未包含于所述分词词典或用户词典的某个词中时,将所述用户输入的词作为可能的未登录词添加到用户输入词词典中。所述用户输入词词典是用来临时存储用户在输入语句的过程中,输入后但又删除的词,经上述步骤逐级识别并将最终识别为可能的未登录词。When the third level recognizes that the word input by the user is not included in the word dictionary or a word in the user dictionary, the word input by the user is added as a possible unregistered word to the user input word dictionary. . The user input word dictionary is used to temporarily store words that are input after the user inputs the sentence but are deleted, and are identified step by step through the above steps and will eventually be recognized as possible unregistered words.
S60:判断所述用户输入的词是否为网络词条中的词,若是,则将所述用户输入的词作为未登录词加入所述用户词典中,并将所述用户输入的词从所述用户输入词词典中删除,否则忽略所述用户输入的词。S60: determining whether the word input by the user is a word in a network entry, and if yes, adding the word input by the user to the user dictionary as an unregistered word, and inputting the word input by the user from the The user enters the word dictionary to delete, otherwise the word entered by the user is ignored.
当第三级识别判断出所述用户输入的词未包含于所述分词词典或用户词典的某个词中时,将所述用户输入的词作为可能的未登录词添加到用户输入词词典中,同时判断所述用户输入的词是否为网络词条中的词,所述网络词条优先是指目前百度百科能提供的词条。百度百科本着平等、协作、分享、自由的互联网精神,提倡网络面前人人平等,所有人共同协作编写百科全书,让知识在一定的技术规则和文化脉络下得以不断组合和拓展。为用户提供一个创造性的网络平台,强调用户的参与和奉献精神,充分调动互联网所有用户的力量,汇聚上亿用户的头脑智慧,积极进行交流和分享,同时实现与搜索引擎的完美结合,从不同的层次上满足用户对信息的需求。因此百度百科词条中的词包括了目前最流行的新词,以此为识别基础能够最大限度的识别出未登录词。若是网络词条中的词,则将所述用户输入的词作为未登录词加入所述用户词典中,并将所述用户输入的词从所述用户输入词词典中删除,否则忽略所述用户输入的词。When the third level recognizes that the word input by the user is not included in the word dictionary or a word in the user dictionary, the word input by the user is added as a possible unregistered word to the user input word dictionary. At the same time, it is judged whether the word input by the user is a word in a network entry, and the network term priority refers to a term currently provided by Baidu Encyclopedia. Baidu Encyclopedia adheres to the spirit of equality, collaboration, sharing and freedom. It advocates equality before the network. All people work together to write an encyclopedia, so that knowledge can be continuously combined and expanded under certain technical rules and cultural contexts. Provide users with a creative network platform, emphasizing user participation and dedication, fully mobilizing the power of all users of the Internet, bringing together the wisdom of hundreds of millions of users, actively communicating and sharing, and achieving perfect integration with search engines, from different At the level of the user to meet the needs of information. Therefore, the words in the Baidu Encyclopedia entry include the most popular new words at present, which can identify the unregistered words to the maximum extent. If the word is in the network entry, the word input by the user is added to the user dictionary as an unregistered word, and the word input by the user is deleted from the user input word dictionary, otherwise the user is ignored. Enter the word.
如上述步骤S10~S60依次识别用户输入的所有词,在用户输入完语句后,服务器端基于分词词典和用户词典按照现有的切词、计算相似度、匹配算法从预设的数据库中匹配需要返回的内容。由于用户词典中增加了未登录词,在需要基于用户词典对用户输入的语句进行分词时,能够改善分词效果,提高智能交互系统的智能化水平。Steps S10 to S60 are sequentially used to identify all words input by the user. After the user inputs the sentence, the server side matches the existing database based on the existing word segmentation, calculation similarity, and matching algorithm based on the word segment dictionary and the user dictionary. The content returned. Since the unregistered words are added to the user dictionary, when the words input by the user based on the user dictionary need to be segmented, the word segmentation effect can be improved, and the intelligent level of the intelligent interactive system can be improved.
本发明实施例通过逐级识别用户输入的词的长度是否等于1或大于4、是否为预设的分词词典或用户词典中存在的词、是否包含于所述分词词典或用户词典的某个词中,筛选出可能的未登录词加入用户输入词词典中做临时记录,当进一步识别出所述用户输入的词为网络词条中的词时,将所述用户输入的词加入所述用户词典中,同时将其从所述用户输入词词典中删除。本发明实施例通过逐级识别用户输入的词,将可能的未登录词加入用户词典中,丰富了用户词典,当需要基于所述用户词典对用户输入的语句进行分词时,能够改善分词效果,提高智能交互系统的智能化水平。The embodiment of the present invention recognizes whether the length of the word input by the user is equal to 1 or greater than 4, whether it is a preset word segment dictionary or a word existing in the user dictionary, whether it is included in the word segment dictionary or a word in the user dictionary. In the middle, the possible unregistered words are filtered into the user input word dictionary for temporary recording, and when the words input by the user are further recognized as words in the network entry, the words input by the user are added to the user dictionary. At the same time, it is deleted from the user input word dictionary. The embodiment of the present invention enriches the user dictionary by identifying the words input by the user step by step, and enriching the user unregistered words. When the word segmentation of the sentence input by the user is needed based on the user dictionary, the word segmentation effect can be improved. Improve the intelligence level of intelligent interactive systems.
作为优选的实施方式,所述步骤S10包括:As a preferred implementation, the step S10 includes:
S11:获取用户输入时文本框的变化内容;S11: Obtain a change content of the text box when the user inputs;
S12:将所述文本框的变化内容作为用户输入的词。S12: The changed content of the text box is used as a word input by the user.
当用户在客户端的文本框中输入语句时,根据用户的输入习惯,获取用户输入时文本框的变化内容,例如当用户输入“请问科技园怎么走?”这条语句时,按照习惯会逐个向文本框输入“请问”或者“请”“问”、“科技园”、“怎么”、“走”,客户端会获取到用户输入文本框的变化内容,例如先获取到“请问”,并将“请问”作为用户输入的词,按照本发明智能交互系统中未登录词识别方法第一优选实施例流程示意图那样逐级识别“请问”是否为未登录词。直到用户输入完语句,再将语句传输至服务器端,服务器端基于预设的分词词典和更新后的用户词典按照现有的切词、计算相似度、匹配算法从预设的数据库中匹配需要返回的内容。通过将文本框的变化的内容作为用户输入的词,达到了异步传输用户输入的词和语句的目的,能够在用户输入完语句之后基于预设的分词词典和更新后的用户词典按照现有的切词,进一步改善分词效果,提高智能交互系统的智能化水平。When the user inputs a statement in the text box of the client, according to the user's input habits, the change content of the text box when the user inputs the content is obtained. For example, when the user inputs the phrase "How do you go to the technology park?", according to the habit, the habit will be one by one. In the text box, enter "I ask" or "Please" "Q", "Technology Park", "How", "Go", the client will get the change content of the user input text box, for example, get "Excuse" first, and "I would like to ask" as the word input by the user, according to the flow chart of the first preferred embodiment of the unregistered word recognition method in the intelligent interactive system of the present invention, the "excuse me" is identified as an unregistered word. Until the user inputs the statement, and then transmits the statement to the server, the server based on the preset word segmentation dictionary and the updated user dictionary according to the existing word segmentation, calculation similarity, matching algorithm from the default database need to return Content. By using the changed content of the text box as the word input by the user, the purpose of asynchronously transmitting the words and sentences input by the user is achieved, and the existing word dictionary and the updated user dictionary can be used according to the existing after the user inputs the sentence. Cut the word to further improve the effect of word segmentation and improve the intelligence level of the intelligent interactive system.
参照图2,图2为本发明智能交互系统中未登录词识别方法第二优选实施例流程示意图。Referring to FIG. 2, FIG. 2 is a schematic flowchart diagram of a second preferred embodiment of an unregistered word recognition method in an intelligent interactive system according to the present invention.
在一实施例中,如图2所示,基于图1所示本发明智能交互系统中未登录词识别方法第一优选实施例,所述智能交互系统中未登录词的识别方法还包括如下步骤:In an embodiment, as shown in FIG. 2, based on the first preferred embodiment of the method for identifying an unregistered word in the intelligent interactive system of the present invention shown in FIG. 1, the method for identifying an unregistered word in the intelligent interactive system further includes the following steps. :
S70:统计所述用户输入词词典中每个词的词频;S70: Statistics the word frequency of each word in the user input word dictionary;
S80:若所述用户输入词词典中某词的词频大于预设值,则将所述词作为未登录词加入用户词典中,并将所述词从所述用户输入词词典中删除。S80: If the word frequency of a word in the user input word dictionary is greater than a preset value, the word is added to the user dictionary as an unregistered word, and the word is deleted from the user input word dictionary.
所述用户输入词词典是用来临时存储用户在输入语句的过程中,经上述步骤逐级识别的可能的未登录词。词频是指该词在所述用户输入词典中出现的频率。这些词是用户经常输入的但未存在于网络词条中的词,统计所述用户输入词词典中每个词的词频,将词频大于预设值的词(可能为常用词但未被包含于预设的分词词典和用户词典中)加入到用户词典中,进一步丰富用户词典,并将所述词从所述用户输入词词典中删除。The user input word dictionary is used to temporarily store possible unregistered words that the user recognizes step by step through the above steps in the process of inputting a sentence. The word frequency refers to the frequency at which the word appears in the user input dictionary. These words are words that the user often inputs but do not exist in the network entry. The word frequency of each word in the user input word dictionary is counted, and the word frequency is greater than the preset value (may be common words but not included in The preset word segment dictionary and the user dictionary are added to the user dictionary to further enrich the user dictionary and delete the words from the user input word dictionary.
在一实施例中,所述智能交互系统中未登录词的识别方法还包括如下步骤:In an embodiment, the method for identifying an unregistered word in the intelligent interaction system further includes the following steps:
S90:建立用户词典,在所述用户词典中存储用户特定应用领域的常用词。S90: Establish a user dictionary in which commonly used words of a user-specific application domain are stored.
本发明实施例所述用户词典为在某个应用领域,例如健康管理应用领域,预先建立的本领域特有的词的集合,例如看片、食疗、理疗等。在预先建立之后,所述用户词典可以在后续用户输入的过程中进行添加和丰富。The user dictionary in the embodiment of the present invention is a set of words unique to the field that are pre-established in a certain application field, such as a health management application field, such as watching movies, diet therapy, physical therapy, and the like. After pre-establishment, the user dictionary can be added and enriched during subsequent user input.
为实现上述目的,本发明还提供了一种智能交互系统中未登录词的识别装置。To achieve the above object, the present invention also provides an apparatus for identifying an unregistered word in an intelligent interactive system.
参照图3,图3为本发明智能交互系统中未登录词识别装置第一优选实施例结构示意图。Referring to FIG. 3, FIG. 3 is a schematic structural diagram of a first preferred embodiment of an unregistered word recognition apparatus in an intelligent interactive system according to the present invention.
在一实施例中,如图3所示,所述智能交互系统中未登录词的识别装置包括:In an embodiment, as shown in FIG. 3, the device for identifying an unregistered word in the intelligent interaction system includes:
获取模块10,用于获取用户输入的词;An obtaining module 10, configured to acquire a word input by a user;
具体地,本发明实施例所述的获取模块10通过客户端获取用户输入的词。当用户从输入端输入内容时,因为常用的输入法大多带有记忆功能,例如搜狗拼音输入法、百度拼音输入法等,用户也习惯于逐词输入语句。在用户输入语句的同时,可以通过异步传输的方式,获取用户输入的词。此处所述的异步传输是指当用户输入完一个词时,即将所述输入词作为用户输入的词传输至所述智能交互系统的服务器端。当用户提交问题时,再将所述语句作为一个整体传输至服务器端。即用户输入的词为异步传输至服务器端。Specifically, the obtaining module 10 according to the embodiment of the present invention acquires a word input by a user through a client. When the user inputs content from the input terminal, because the commonly used input method mostly has a memory function, such as Sogou Pinyin input method, Baidu Pinyin input method, etc., the user is also accustomed to inputting the sentence word by word. While the user inputs the statement, the word input by the user can be obtained by asynchronous transmission. Asynchronous transmission as used herein refers to transmitting a word input by the user as a user to the server end of the intelligent interactive system when the user inputs a word. When the user submits a question, the statement is transmitted to the server as a whole. That is, the words entered by the user are asynchronously transmitted to the server.
一级识别模块20,用于判断所述用户输入的词的长度是否等于1或大于4,若是,则忽略所述用户输入的词;The first-level identification module 20 is configured to determine whether the length of the word input by the user is equal to 1 or greater than 4, and if yes, ignore the word input by the user;
具体地,在通过所述获取模块10获取到所述用户输入的词后,所述一级识别模块20首先进行第一级识别,通过计算所述用户输入的词的长度,判断所述用户输入的词的长度是否等于1或大于4,即是否为单字词或4个字以上的词,若是,则忽略所述用户输入的词,即过滤掉用户输入的单字词或4个字以上的词,否则进行第二级识别。Specifically, after acquiring the word input by the user by the acquiring module 10, the first-level identification module 20 first performs first-level recognition, and determines the user input by calculating the length of the word input by the user. Whether the length of the word is equal to 1 or greater than 4, that is, whether it is a single word or a word of 4 or more words, and if so, the word input by the user is ignored, that is, the word input by the user or more than 4 words is filtered out. The word, otherwise the second level of recognition.
二级识别模块30,用于当所述用户输入的词的长度为大于1且小于等于4时,判断所述用户输入的词是否为预设的分词词典或用户词典中存在的词,若是,则忽略所述用户输入的词;The secondary identification module 30 is configured to determine, when the length of the word input by the user is greater than 1 and less than or equal to 4, whether the word input by the user is a preset word segment dictionary or a word existing in the user dictionary, and if so, Then ignore the words entered by the user;
具体地,当所述一级识别模块20判断出所述用户输入的词的长度为大于1且小于等于4时,所述二级识别模块30对所述用户输入的词进行第二级识别,判断所述输入的词是否为预设的分词词典或用户词典中存在的词。本发明实施例所述预设的分词词典,为现有技术中的中文分词词典;所述用户词典是指在某个应用领域,例如健康管理应用领域,预先建立的本领域特有的词的集合,例如看片、食疗、理疗等。本发明实施例所述的用户词典也可以为空,在后续用户输入的过程中进行添加和丰富。本领域技术人员可以通过多种方法,例如针对所述用户输入的词对所述分词词典进行逐词遍历查找匹配,或预先建立基于所述用户输入的词的索引,再基于索引进行查找匹配,在此不作限定,只要能够判断所述用户输入的词是否为预设的分词词典或用户词典中存在的词即可,当所述用户输入的词已经存在于预设的分词词典或用户词典中时,则忽略所述用户输入的词,否则进行第三级识别。Specifically, when the first-level identification module 20 determines that the length of the word input by the user is greater than 1 and less than or equal to 4, the secondary identification module 30 performs second-level recognition on the word input by the user. It is judged whether the input word is a word in a preset word segment dictionary or a user dictionary. The preset word segment dictionary in the embodiment of the present invention is a Chinese word segment dictionary in the prior art; the user dictionary refers to a pre-established set of words unique to the field in a certain application field, such as a health management application field. For example, watching movies, diet, physiotherapy, etc. The user dictionary described in the embodiment of the present invention may also be empty, and added and enriched in the process of subsequent user input. A person skilled in the art may perform a word-by-word traversal search matching on the word segmentation dictionary by using various methods, for example, a word input by the user, or pre-establish an index based on the word input by the user, and perform search matching based on the index. It is not limited here, as long as it can be determined whether the word input by the user is a preset word segment dictionary or a word existing in the user dictionary, when the word input by the user already exists in the preset word segment dictionary or the user dictionary. At the time, the words entered by the user are ignored, otherwise the third level recognition is performed.
三级识别模块40,用于当所述用户输入的词不是所述分词词典或用户词典中存在的词时,判断所述用户输入的词是否包含于所述分词词典或用户词典的某个词中,若是,则忽略所述用户输入的词;The third-level identification module 40 is configured to determine, when the word input by the user is not a word in the word segment dictionary or the user dictionary, whether the word input by the user is included in the word dictionary or a word in the user dictionary If yes, ignore the words entered by the user;
具体地,当所述二级识别模块30判断出所述用户输入的词未存在于预设的分词词典或用户词典中时,所述三级识别模块40进一步判断所述用户输入的词是否包含于所述分词词典或用户词典的某个词中,此处所述的包含是指所述用户输入的词整体包含于所述分词词典或用户词典的某个词中。例如所述用户输入的词为“你好”,而所述分词词典或用户词典的某个词为“你好美”,“你好”包含于“你好美”中,则认为所述分词词典或用户词典中已经存在与“你好”类似的词,此时,忽略所述用户输入的词。若所述用户输入的词为“你美”,而所述分词词典或用户词典的某个词为“你好美”,则认为所述用户输入的词未包含于所述分词词典或用户词典的某个词中。Specifically, when the secondary identification module 30 determines that the word input by the user does not exist in the preset word segment dictionary or the user dictionary, the three-level identification module 40 further determines whether the word input by the user includes In a word segmentation dictionary or a word in a user dictionary, the inclusion herein means that the word input by the user is entirely included in a word dictionary or a word in a user dictionary. For example, the word entered by the user is “Hello”, and a word of the word dictionary or user dictionary is “Hello,” and “Hello” is included in “Hello”, and the word dictionary or A word similar to "Hello" already exists in the user dictionary, at which time the word entered by the user is ignored. If the word input by the user is “you are beautiful” and the word of the word segment dictionary or the user dictionary is “hello”, the word input by the user is considered not included in the word segment dictionary or the user dictionary. In a word.
用户输入词词典更新模块50,用于当所述三级识别模块40判断出所述用户输入的词未包含于所述分词词典或用户词典的某个词中时,将所述用户输入的词作为可能的未登录词添加到用户输入词词典中;The user input word dictionary update module 50 is configured to: when the three-level identification module 40 determines that the word input by the user is not included in a word of the word segment dictionary or the user dictionary, the word input by the user Added as a possible unregistered word to the user input word dictionary;
具体地,当所述三级识别模块40判断出所述用户输入的词未包含于所述分词词典或用户词典的某个词中时,所述用户输入词词典更新模块50将所述用户输入的词作为可能的未登录词添加到用户输入词词典中。所述用户输入词词典是用来临时存储用户在输入语句的过程中,输入后但又删除的词,经上述步骤逐级识别并将最终识别为可能的未登录词。Specifically, when the three-level identification module 40 determines that the word input by the user is not included in a word dictionary or a word in the user dictionary, the user input word dictionary update module 50 inputs the user. The words are added to the user input word dictionary as possible unregistered words. The user input word dictionary is used to temporarily store words that are input after the user inputs the sentence but are deleted, and are identified step by step through the above steps and will eventually be recognized as possible unregistered words.
四级识别模块60,用于当所述用户输入的词为网络词条中的词时,将所述用户输入的词作为未登录词加入所述用户词典中,并将所述用户输入的词从所述用户输入词词典中删除,否则忽略所述用户输入的词。a four-level identification module 60, configured to add a word input by the user as an unregistered word into the user dictionary when the word input by the user is a word in a network entry, and input the word entered by the user Deleted from the user input word dictionary, otherwise the words entered by the user are ignored.
具体地,当所述三级识别模块40判断出所述用户输入的词未包含于所述分词词典或用户词典的某个词中时,所述用户输入词词典更新模块50将所述用户输入的词作为可能的未登录词添加到用户输入词词典中,所述四级识别模块60判断所述用户输入的词是否为网络词条中的词,所述网络词条优先是指目前百度百科能提供的词条。百度百科本着平等、协作、分享、自由的互联网精神,提倡网络面前人人平等,所有人共同协作编写百科全书,让知识在一定的技术规则和文化脉络下得以不断组合和拓展。为用户提供一个创造性的网络平台,强调用户的参与和奉献精神,充分调动互联网所有用户的力量,汇聚上亿用户的头脑智慧,积极进行交流和分享,同时实现与搜索引擎的完美结合,从不同的层次上满足用户对信息的需求。因此百度百科词条中的词包括了目前最流行的新词,以此为识别基础能够最大限度的识别出未登录词。若是网络词条中的词,则将所述用户输入的词加入所述用户词典中,并将所述用户输入的词从所述用户输入词词典中删除,否则忽略所述用户输入的词。Specifically, when the three-level identification module 40 determines that the word input by the user is not included in a word dictionary or a word in the user dictionary, the user input word dictionary update module 50 inputs the user. The word is added as a possible unregistered word to the user input word dictionary, and the four-level identification module 60 determines whether the word input by the user is a word in a network entry, and the network entry priority refers to the current Baidu Encyclopedia Can provide the terms. Baidu Encyclopedia adheres to the spirit of equality, collaboration, sharing and freedom. It advocates equality before the network. All people work together to write an encyclopedia, so that knowledge can be continuously combined and expanded under certain technical rules and cultural contexts. Provide users with a creative network platform, emphasizing user participation and dedication, fully mobilizing the power of all users of the Internet, bringing together the wisdom of hundreds of millions of users, actively communicating and sharing, and achieving perfect integration with search engines, from different At the level of the user to meet the needs of information. Therefore, the words in the Baidu Encyclopedia entry include the most popular new words at present, which can identify the unregistered words to the maximum extent. If the words are in the network entry, the words input by the user are added to the user dictionary, and the words input by the user are deleted from the user input word dictionary, otherwise the words input by the user are ignored.
通过上述模块依次识别用户输入的所有词,在用户输入完语句后,服务器端基于分词词典和用户词典按照现有的切词、计算相似度、匹配算法从预设的数据库中匹配需要返回的内容。由于用户词典中增加了未登录词,在需要基于用户词典对用户输入的语句进行分词时,能够改善分词效果,提高智能交互系统的智能化水平。Through the above modules, all the words input by the user are sequentially identified. After the user inputs the sentence, the server side matches the content to be returned from the preset database according to the existing word segmentation, calculation similarity and matching algorithm based on the word segment dictionary and the user dictionary. . Since the unregistered words are added to the user dictionary, when the words input by the user based on the user dictionary need to be segmented, the word segmentation effect can be improved, and the intelligent level of the intelligent interactive system can be improved.
本发明实施例通过逐级识别用户输入的词的长度是否等于1或大于4、是否为预设的分词词典或用户词典中存在的词、是否包含于所述分词词典或用户词典的某个词中,筛选出可能的未登录词加入用户输入词词典中做临时记录,当进一步识别出所述用户输入的词为网络词条中的词时,将所述用户输入的词加入所述用户词典中,同时将其从所述用户输入词词典中删除。本发明实施例通过逐级识别用户输入的词,将可能的未登录词加入用户词典中,丰富了用户词典,当需要基于所述用户词典对用户输入的语句进行分词时,能够改善分词效果,提高智能交互系统的智能化水平。The embodiment of the present invention recognizes whether the length of the word input by the user is equal to 1 or greater than 4, whether it is a preset word segment dictionary or a word existing in the user dictionary, whether it is included in the word segment dictionary or a word in the user dictionary. In the middle, the possible unregistered words are filtered into the user input word dictionary for temporary recording, and when the words input by the user are further recognized as words in the network entry, the words input by the user are added to the user dictionary. At the same time, it is deleted from the user input word dictionary. The embodiment of the present invention enriches the user dictionary by identifying the words input by the user step by step, and enriching the user unregistered words. When the word segmentation of the sentence input by the user is needed based on the user dictionary, the word segmentation effect can be improved. Improve the intelligence level of intelligent interactive systems.
作为优选的实施例,所述获取模块具体用于:As a preferred embodiment, the acquiring module is specifically configured to:
获取用户输入时文本框的变化内容,将所述文本框的变化内容作为用户输入的词。Obtaining the changed content of the text box when the user inputs, and changing the content of the text box as a word input by the user.
当用户在客户端的文本框中输入语句时,根据用户的输入习惯,获取用户输入时文本框的变化内容,例如当用户输入“请问科技园怎么走?”这条语句时,按照习惯会逐个向文本框输入“请问”或者“请”“问”、“科技园”、“怎么”、“走”,客户端会获取到用户输入文本框的变化内容,例如先获取到“请问”,并将“请问”作为用户输入的词,按照本发明智能交互系统中未登录词识别方法第一优选实施例流程示意图那样逐级识别“请问”是否为未登录词。直到用户输入完语句,再将语句传输至服务器端,服务器端基于预设的分词词典和更新后的用户词典按照现有的切词、计算相似度、匹配算法从预设的数据库中匹配需要返回的内容。通过将文本框的变化的内容作为用户输入的词,达到了异步传输用户输入的词和语句的目的,能够在用户输入完语句之后基于预设的分词词典和更新后的用户词典按照现有的切词,进一步改善分词效果,提高智能交互系统的智能化水平。When the user inputs a statement in the text box of the client, according to the user's input habits, the change content of the text box when the user inputs the content is obtained. For example, when the user inputs the phrase "How do you go to the technology park?", according to the habit, the habit will be one by one. In the text box, enter "I ask" or "Please" "Q", "Technology Park", "How", "Go", the client will get the change content of the user input text box, for example, get "Excuse" first, and "I would like to ask" as the word input by the user, according to the flow chart of the first preferred embodiment of the unregistered word recognition method in the intelligent interactive system of the present invention, the "excuse me" is identified as an unregistered word. Until the user inputs the statement, and then transmits the statement to the server, the server based on the preset word segmentation dictionary and the updated user dictionary according to the existing word segmentation, calculation similarity, matching algorithm from the default database need to return Content. By using the changed content of the text box as the word input by the user, the purpose of asynchronously transmitting the words and sentences input by the user is achieved, and the existing word dictionary and the updated user dictionary can be used according to the existing after the user inputs the sentence. Cut the word to further improve the effect of word segmentation and improve the intelligence level of the intelligent interactive system.
参照图4,图4为本发明智能交互系统中未登录词识别装置第二优选实施例结构示意图。Referring to FIG. 4, FIG. 4 is a schematic structural diagram of a second preferred embodiment of an unregistered word recognition apparatus in an intelligent interactive system according to the present invention.
在一实施例中,如图4所示,基于图3所示的本发明智能交互系统中未登录词识别装置第一优选实施例,所述智能交互系统中未登录词的识别装置还包括:In an embodiment, as shown in FIG. 4, based on the first preferred embodiment of the unregistered word recognition device in the intelligent interactive system of the present invention shown in FIG. 3, the device for identifying the unregistered word in the intelligent interactive system further includes:
用户输入词词典词频统计模块70,用于统计所述用户输入词词典中每个词的词频;The user input word dictionary word frequency statistics module 70 is configured to count the word frequency of each word in the user input word dictionary;
用户词典更新模块80,用于若所述用户输入词词典中词的词频大于预设值,则将所述词作为未登录词加入用户词典中,并将所述词从所述用户输入词词典中删除。The user dictionary update module 80 is configured to add the word as an unregistered word into the user dictionary if the word frequency of the word in the user input word dictionary is greater than a preset value, and input the word from the user into the word dictionary Deleted.
所述用户输入词词典是用来临时存储用户在输入语句的过程中,输入后但又删除的词,经上述步骤逐级识别并将最终识别为可能的未登录词。这些词是用户经常输入的但未存在于网络词条中的词,统计所述用户输入词词典中每个词的词频,将词频大于预设值的词(可能为常用词但未被包含于预设的分词词典和用户词典中)加入到用户词典中,进一步丰富用户词典,并将所述词从所述用户输入词词典中删除。The user input word dictionary is used to temporarily store words that are input after the user inputs the sentence but are deleted, and are identified step by step through the above steps and will eventually be recognized as possible unregistered words. These words are words that the user often inputs but do not exist in the network entry. The word frequency of each word in the user input word dictionary is counted, and the word frequency is greater than the preset value (may be common words but not included in The preset word segment dictionary and the user dictionary are added to the user dictionary to further enrich the user dictionary and delete the words from the user input word dictionary.
作为优选的实施例,所述智能交互系统中未登录词的识别装置还包括:As a preferred embodiment, the device for identifying a non-registered word in the intelligent interactive system further includes:
用户词典建立模块,用于建立用户词典,在所述用户词典中存储用户特定应用领域的常用词。A user dictionary building module is configured to establish a user dictionary in which commonly used words of a user-specific application domain are stored.
本发明实施例所述用户词典为在某个应用领域,例如健康管理应用领域,预先建立的本领域特有的词的集合,例如看片、食疗、理疗等。在预先建立之后,所述用户词典可以在后续用户输入的过程中进行添加和丰富。The user dictionary in the embodiment of the present invention is a set of words unique to the field that are pre-established in a certain application field, such as a health management application field, such as watching movies, diet therapy, physical therapy, and the like. After pre-establishment, the user dictionary can be added and enriched during subsequent user input.
作为优选的实施例,所述智能交互系统中未登录词的识别装置还包括:As a preferred embodiment, the device for identifying a non-registered word in the intelligent interactive system further includes:
用户输入词词典词建立模块,用于建立用户输入词词典词,存储用户在输入语句过程中输入的可能的未登录词。从用户输入词词典词更新模块可以看出其主要存储的内容。The user inputs a word dictionary word building module for establishing a user input word dictionary word and storing possible unregistered words input by the user during the input sentence. The main stored content can be seen from the user input word dictionary word update module.
以上仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。The above are only the preferred embodiments of the present invention, and are not intended to limit the scope of the invention, and the equivalent structure or equivalent process transformations made by the description of the present invention and the drawings are directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of the present invention.

Claims (14)

  1. 一种智能交互系统中未登录词的识别方法,其特征在于,所述智能交互系统中未登录词的识别方法包括如下步骤: A method for identifying an unregistered word in an intelligent interactive system, characterized in that the method for identifying an unregistered word in the intelligent interactive system comprises the following steps:
    S10:获取用户输入的词;S10: Obtain a word input by a user;
    S20:判断所述用户输入的词的长度是否等于1或大于4,若是,则忽略所述用户输入的词,否则执行S30;S20: determining whether the length of the word input by the user is equal to 1 or greater than 4, and if so, ignoring the word input by the user, otherwise executing S30;
    S30:判断所述用户输入的词是否为预设的分词词典或用户词典中存在的词,若是,则忽略所述用户输入的词,否则执行S40; S30: determining whether the word input by the user is a preset word segment dictionary or a word existing in the user dictionary, and if so, ignoring the word input by the user, otherwise executing S40;
    S40:判断所述用户输入的词是否包含于所述分词词典或用户词典的某个词中,若是,则忽略所述用户输入的词,否则执行S50;S40: determining whether the word input by the user is included in a word dictionary or a word in the user dictionary, and if so, ignoring the word input by the user, otherwise executing S50;
    S50:将所述用户输入的词作为可能的未登录词添加到用户输入词词典中;S50: adding the word input by the user as a possible unregistered word to the user input word dictionary;
    S60:判断所述用户输入的词是否为网络词条中的词,若是,则将所述用户输入的词作为未登录词加入所述用户词典中,并将所述用户输入的词从所述用户输入词词典中删除,否则忽略所述用户输入的词。 S60: determining whether the word input by the user is a word in a network entry, and if yes, adding the word input by the user to the user dictionary as an unregistered word, and inputting the word input by the user from the The user enters the word dictionary to delete, otherwise the word entered by the user is ignored.
  2. 如权利要求1所述的智能交互系统中未登录词的识别方法,其特征在于,所述智能交互系统中未登录词的识别方法还包括如下步骤:The method for identifying an unregistered word in the intelligent interactive system according to claim 1, wherein the method for identifying the unregistered word in the intelligent interactive system further comprises the following steps:
    S90:建立用户词典,在所述用户词典中存储用户特定应用领域的常用词。S90: Establish a user dictionary in which commonly used words of a user-specific application domain are stored.
  3. 如权利要求1所述的智能交互系统中未登录词的识别方法,其特征在于,所述智能交互系统中未登录词的识别方法还包括如下步骤:The method for identifying an unregistered word in the intelligent interactive system according to claim 1, wherein the method for identifying the unregistered word in the intelligent interactive system further comprises the following steps:
    S100:建立用户输入词词典,在所述用户输入词词典中存储可能的未登录词。S100: Establish a user input word dictionary, and store possible unregistered words in the user input word dictionary.
  4. 如权利要求1所述的智能交互系统中未登录词的识别方法,其特征在于,所述步骤S10包括:The method for identifying an unregistered word in the intelligent interactive system according to claim 1, wherein the step S10 comprises:
    S11:获取用户输入时文本框的变化内容;S11: Obtain a change content of the text box when the user inputs;
    S12:将所述文本框的变化内容作为用户输入的词。S12: The changed content of the text box is used as a word input by the user.
  5. 如权利要求4所述的智能交互系统中未登录词的识别方法,其特征在于,所述智能交互系统中未登录词的识别方法还包括如下步骤:The method for identifying an unregistered word in the intelligent interactive system according to claim 4, wherein the method for identifying the unregistered word in the intelligent interactive system further comprises the following steps:
    S90:建立用户词典,在所述用户词典中存储用户特定应用领域的常用词。S90: Establish a user dictionary in which commonly used words of a user-specific application domain are stored.
  6. 如权利要求4所述的智能交互系统中未登录词的识别方法,其特征在于,所述智能交互系统中未登录词的识别方法还包括如下步骤:The method for identifying an unregistered word in the intelligent interactive system according to claim 4, wherein the method for identifying the unregistered word in the intelligent interactive system further comprises the following steps:
    S100:建立用户输入词词典,在所述用户输入词词典中存储可能的未登录词。S100: Establish a user input word dictionary, and store possible unregistered words in the user input word dictionary.
  7. 如权利要求1所述的智能交互系统中未登录词的识别方法,其特征在于,所述智能交互系统中未登录词的识别方法还包括如下步骤:The method for identifying an unregistered word in the intelligent interactive system according to claim 1, wherein the method for identifying the unregistered word in the intelligent interactive system further comprises the following steps:
    S70:统计所述用户输入词词典中每个词的词频;S70: Statistics the word frequency of each word in the user input word dictionary;
    S80:若所述用户输入词词典中某词的词频大于预设值,则将所述词作为未登录词加入用户词典中,并将所述词从所述用户输入词词典中删除。S80: If the word frequency of a word in the user input word dictionary is greater than a preset value, the word is added to the user dictionary as an unregistered word, and the word is deleted from the user input word dictionary.
  8. 如权利要求7所述的智能交互系统中未登录词的识别方法,其特征在于,所述智能交互系统中未登录词的识别方法还包括如下步骤:The method for identifying an unregistered word in the intelligent interactive system according to claim 7, wherein the method for identifying the unregistered word in the intelligent interactive system further comprises the following steps:
    S90:建立用户词典,在所述用户词典中存储用户特定应用领域的常用词。S90: Establish a user dictionary in which commonly used words of a user-specific application domain are stored.
  9. 如权利要求7所述的智能交互系统中未登录词的识别方法,其特征在于,所述智能交互系统中未登录词的识别方法还包括如下步骤:The method for identifying an unregistered word in the intelligent interactive system according to claim 7, wherein the method for identifying the unregistered word in the intelligent interactive system further comprises the following steps:
    S100:建立用户输入词词典,在所述用户输入词词典中存储可能的未登录词。S100: Establish a user input word dictionary, and store possible unregistered words in the user input word dictionary.
  10. 一种智能交互系统中未登录词的识别装置,其特征在于,所述智能交互系统中未登录词的识别装置包括:An apparatus for identifying an unregistered word in an intelligent interactive system, wherein the device for identifying a non-registered word in the intelligent interactive system includes:
    获取模块,用于获取用户输入的词;An acquisition module for obtaining a word input by a user;
    一级识别模块,用于判断所述用户输入的词的长度是否等于1或大于4,若是,则忽略所述用户输入的词;a first-level identification module, configured to determine whether the length of the word input by the user is equal to 1 or greater than 4, and if yes, ignore the word input by the user;
    二级识别模块,用于当所述用户输入的词的长度为大于1且小于等于4时,判断所述用户输入的词是否为预设的分词词典或用户词典中存在的词,若是,则忽略所述用户输入的词;a secondary identification module, configured to determine, when the length of the word input by the user is greater than 1 and less than or equal to 4, whether the word input by the user is a preset word segment dictionary or a word existing in a user dictionary, and if so, Ignore the words entered by the user;
    三级识别模块,用于当所述用户输入的词不是所述分词词典或用户词典中存在的词时,判断所述用户输入的词是否包含于所述分词词典或用户词典的某个词中,若是,则忽略所述用户输入的词;a three-level identification module, configured to determine, when the word input by the user is not a word in the word segment dictionary or the user dictionary, whether the word input by the user is included in a word dictionary or a word in a user dictionary If yes, ignore the words entered by the user;
    用户输入词词典更新模块,用于当所述用户输入的词未包含于所述分词词典或用户词典的某个词中时,将所述用户输入的词作为可能的未登录词添加到用户输入词词典中;a user input word dictionary update module, configured to add the word input by the user as a possible unregistered word to the user input when the word input by the user is not included in a word of the word segment dictionary or the user dictionary In the word dictionary;
    四级识别模块,用于当所述用户输入的词为网络词条中的词时,将所述用户输入的词作为未登录词加入所述用户词典中,并将所述用户输入的词从所述用户输入词词典中删除,否则忽略所述用户输入的词。a four-level identification module, configured to add a word input by the user as an unregistered word into the user dictionary when the word input by the user is a word in a network entry, and input the word input by the user from The user enters a word dictionary to delete, otherwise ignores the word input by the user.
  11. 如权利要求10所述的智能交互系统中未登录词的识别装置,其特征在于,所述获取模块具体用于:The device for identifying an unregistered word in the intelligent interactive system according to claim 10, wherein the obtaining module is specifically configured to:
    获取用户输入时文本框的变化内容,将所述文本框的变化内容作为用户输入的词。Obtaining the changed content of the text box when the user inputs, and changing the content of the text box as a word input by the user.
  12. 如权利要求10所述的智能交互系统中未登录词的识别装置,其特征在于,所述智能交互系统中未登录词的识别装置还包括:The device for identifying an unregistered word in the intelligent interactive system according to claim 10, wherein the device for identifying the unregistered word in the intelligent interactive system further comprises:
    用户输入词词典词频统计模块,用于统计所述用户输入词词典中每个词的词频;The user inputs a word dictionary word frequency statistics module for counting the word frequency of each word in the user input word dictionary;
    用户词典更新模块,用于若所述用户输入词词典中词的词频大于预设值,则将所述词作为未登录词加入用户词典中,并将所述词从所述用户输入词词典中删除。a user dictionary update module, configured to add the word as an unregistered word to the user dictionary if the word frequency of the word in the user input word dictionary is greater than a preset value, and input the word from the user into the word dictionary delete.
  13. 如权利要求10所述的智能交互系统中未登录词的识别装置,其特征在于,所述智能交互系统中未登录词的识别装置还包括:The device for identifying an unregistered word in the intelligent interactive system according to claim 10, wherein the device for identifying the unregistered word in the intelligent interactive system further comprises:
    用户词典建立模块,用于建立用户词典,在所述用户词典中存储用户特定应用领域的常用词和所述未登录词。A user dictionary building module is configured to establish a user dictionary in which commonly used words of the user-specific application domain and the unregistered words are stored.
  14. 如权利要求10所述的智能交互系统中未登录词的识别装置,其特征在于,所述智能交互系统中未登录词的识别装置还包括:The device for identifying an unregistered word in the intelligent interactive system according to claim 10, wherein the device for identifying the unregistered word in the intelligent interactive system further comprises:
    用户输入词词典词建立模块,用于建立用户输入词词典词,在所述用户输入词词典中存储可能的未登录词。The user inputs a word dictionary word building module for establishing a user input word dictionary word, and storing possible unregistered words in the user input word dictionary.
PCT/CN2015/073842 2015-02-12 2015-03-07 Method and device for recognizing unlogged word in intelligent interaction system WO2016127459A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510074982.3 2015-02-12
CN201510074982.3A CN104714940A (en) 2015-02-12 2015-02-12 Method and device for identifying unregistered word in intelligent interaction system

Publications (1)

Publication Number Publication Date
WO2016127459A1 true WO2016127459A1 (en) 2016-08-18

Family

ID=53414286

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/073842 WO2016127459A1 (en) 2015-02-12 2015-03-07 Method and device for recognizing unlogged word in intelligent interaction system

Country Status (2)

Country Link
CN (1) CN104714940A (en)
WO (1) WO2016127459A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010665A (en) * 2019-12-20 2021-06-22 北京搜狗科技发展有限公司 Word processing method and related device
CN113111655A (en) * 2021-05-12 2021-07-13 数库(上海)科技有限公司 Construction method of separation dictionary, word segmentation method and device based on separation dictionary
CN115221872A (en) * 2021-07-30 2022-10-21 苏州七星天专利运营管理有限责任公司 Vocabulary extension method and system based on near-sense extension

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108877939A (en) * 2018-05-10 2018-11-23 重庆大学 It is a kind of with the health management system arranged of intelligent characteristic abstraction function

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1629836A (en) * 2003-12-17 2005-06-22 北京大学 Method and apparatus for learning Chinese new words
CN1912872A (en) * 2006-07-25 2007-02-14 北京搜狗科技发展有限公司 Method and system for abstracting new word
CN101079027A (en) * 2007-06-27 2007-11-28 腾讯科技(深圳)有限公司 Chinese character word distinguishing method and system
CN101118556A (en) * 2007-09-17 2008-02-06 中国科学院计算技术研究所 New word of short-text discovering method and system
CN101539940A (en) * 2009-05-04 2009-09-23 清华大学 Method for acquiring new words and device thereof

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE19830225A1 (en) * 1998-07-07 2000-01-13 Wolfgang Hilberg Electronic system for input output and storage of lengthy text
US7403888B1 (en) * 1999-11-05 2008-07-22 Microsoft Corporation Language input user interface
CN100595760C (en) * 2007-08-31 2010-03-24 北京搜狗科技发展有限公司 Method for gaining oral vocabulary entry, device and input method system thereof
CN101556596B (en) * 2007-08-31 2012-04-18 北京搜狗科技发展有限公司 Input method system and intelligent word making method
CN101751386B (en) * 2009-12-28 2012-05-23 华建机器翻译有限公司 Identification method of unknown words
CN103020034A (en) * 2011-09-26 2013-04-03 北京大学 Chinese words segmentation method and device
CN103678684B (en) * 2013-12-25 2017-05-31 沈阳美行科技有限公司 A kind of Chinese word cutting method based on navigation information retrieval
CN104156349B (en) * 2014-03-19 2017-08-15 邓柯 Unlisted word discovery and Words partition system and method based on statistics dictionary model
CN103942190B (en) * 2014-04-16 2017-08-25 科大讯飞股份有限公司 Phonetic synthesis Chinese version segmenting method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1629836A (en) * 2003-12-17 2005-06-22 北京大学 Method and apparatus for learning Chinese new words
CN1912872A (en) * 2006-07-25 2007-02-14 北京搜狗科技发展有限公司 Method and system for abstracting new word
CN101079027A (en) * 2007-06-27 2007-11-28 腾讯科技(深圳)有限公司 Chinese character word distinguishing method and system
CN101118556A (en) * 2007-09-17 2008-02-06 中国科学院计算技术研究所 New word of short-text discovering method and system
CN101539940A (en) * 2009-05-04 2009-09-23 清华大学 Method for acquiring new words and device thereof

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010665A (en) * 2019-12-20 2021-06-22 北京搜狗科技发展有限公司 Word processing method and related device
CN113111655A (en) * 2021-05-12 2021-07-13 数库(上海)科技有限公司 Construction method of separation dictionary, word segmentation method and device based on separation dictionary
CN115221872A (en) * 2021-07-30 2022-10-21 苏州七星天专利运营管理有限责任公司 Vocabulary extension method and system based on near-sense extension

Also Published As

Publication number Publication date
CN104714940A (en) 2015-06-17

Similar Documents

Publication Publication Date Title
WO2020009297A1 (en) Domain extraction based language comprehension performance enhancement apparatus and performance enhancement method
WO2017143692A1 (en) Smart television and voice control method therefor
WO2016127459A1 (en) Method and device for recognizing unlogged word in intelligent interaction system
WO2017156893A1 (en) Voice control method and smart television
WO2018034426A1 (en) Method for automatically correcting error in tagged corpus by using kernel pdr
WO2017041484A1 (en) Method, apparatus, and system for recommending real-time information
WO2019177182A1 (en) Multimedia content search apparatus and search method using attribute information analysis
WO2019080406A1 (en) Television voice interaction method, voice interaction control device and storage medium
WO2012134180A2 (en) Emotion classification method for analyzing inherent emotions in a sentence, and emotion classification method for multiple sentences using context information
WO2013170662A1 (en) Method and device for adding friend information, and computer storage medium
WO2017028601A1 (en) Voice control method and device for intelligent terminal, and television system
WO2015131803A1 (en) Application recommending method and system
WO2016167424A1 (en) Answer recommendation device, and automatic sentence completion system and method
WO2019242090A1 (en) Intelligent customer service response method, device, and apparatus, and storage medium
WO2013139239A1 (en) Method for recommending users in social network and the system thereof
WO2017197802A1 (en) Character string fuzzy matching method and apparatus
WO2018023926A1 (en) Interaction method and system for television and mobile terminal
WO2020224247A1 (en) Blockchain–based data provenance method, apparatus and device, and readable storage medium
WO2019051902A1 (en) Terminal control method, air conditioner and computer-readable storage medium
WO2019169814A1 (en) Method, apparatus and device for automatically generating chinese annotation, and storage medium
WO2012130145A1 (en) Method and device for acquiring and searching for relevant knowledge information
WO2016032021A1 (en) Apparatus and method for recognizing voice commands
WO2019218527A1 (en) Multi-system combined natural language processing method and apparatus
WO2019085543A1 (en) Television system and television control method
WO2019062112A1 (en) Method and device for controlling air conditioner, air conditioner, and computer readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15881621

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15881621

Country of ref document: EP

Kind code of ref document: A1