WO2021227060A1 - 一种用于关键词搜索的多节点分词系统及方法 - Google Patents

一种用于关键词搜索的多节点分词系统及方法 Download PDF

Info

Publication number
WO2021227060A1
WO2021227060A1 PCT/CN2020/090648 CN2020090648W WO2021227060A1 WO 2021227060 A1 WO2021227060 A1 WO 2021227060A1 CN 2020090648 W CN2020090648 W CN 2020090648W WO 2021227060 A1 WO2021227060 A1 WO 2021227060A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
word segmentation
node
server
proxy server
Prior art date
Application number
PCT/CN2020/090648
Other languages
English (en)
French (fr)
Inventor
商良磊
Original Assignee
深圳市世强元件网络有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市世强元件网络有限公司 filed Critical 深圳市世强元件网络有限公司
Priority to US17/420,268 priority Critical patent/US11914658B2/en
Priority to PCT/CN2020/090648 priority patent/WO2021227060A1/zh
Publication of WO2021227060A1 publication Critical patent/WO2021227060A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/2866Architectures; Arrangements
    • H04L67/2876Pairs of inter-processing entities at each side of the network, e.g. split proxies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools

Definitions

  • the invention relates to the field of search engine search term recommendation, and more specifically, to a multi-node word segmentation system and method for keyword search.
  • FIG. 1 it is a schematic diagram of the system structure of the word segmenter in the prior art.
  • the existing search engine word segmenter uses a stand-alone memory storage, and its own dictionary has 420,000 sets of words.
  • third-party vocabulary will be added as needed to achieve the effect of supplementing semantics.
  • This method can only solve the vocabulary requirement of about 500,000. If the vocabulary reaches the million level and the tens of millions level, a large-scale memory expansion is required, and the cost will rise linearly, and the cost is particularly high.
  • the classification of categories is extremely detailed, and the number of word elements in the specification code is also very large, which can easily reach the order of millions or even tens of millions.
  • the existing word segmentation cannot meet the requirements.
  • programming languages have upper requirements for software memory. For example, when the memory of the Java virtual machine reaches a certain size of GB, it will cause unpredictable memory leaks, resulting in unavailability and even system crashes.
  • the technical problem to be solved by the present invention is to provide a multi-node word segmentation system and method for keyword search in view of the above-mentioned defects of the prior art.
  • the technical solution adopted by the present invention to solve its technical problems is to construct a multi-node word segmentation system for keyword search, including a word element distribution proxy server, a client proxy server and at least two word segmentation node servers;
  • the word element distribution proxy server is respectively communicatively connected to each of the word segmentation node servers, the word element distribution proxy server receives a large number of word elements, and distributes each of the word elements to the corresponding word segmentation node according to a preset word segmentation rule server;
  • the client proxy server is respectively communicatively connected to each of the word segmentation node servers, the client proxy server receives the search word uploaded by the user terminal, and distributes the search word to the corresponding word segmentation according to the preset word segmentation rule
  • the node server the word segmentation node server queries the word elements that match the word to be searched, and returns the word elements obtained from the query to the client proxy server; the client proxy server sends the word elements obtained from the query to The corresponding user terminal.
  • the preset word segmentation rule is word element attribute information, and each word segmentation node server sets corresponding word element attribute information;
  • the client proxy server distributes the word element to the corresponding word segmentation node server according to the word element attribute information of each word element, and the client proxy server distributes the word element to the corresponding word segmentation node server according to the word element attribute information of the word to be searched.
  • the search word is distributed to the corresponding word segmentation node server.
  • the word element attribute information is the first letter of the word element corresponding to the Chinese pinyin;
  • the word element attribute information is the first letter of the language spelling corresponding to the word element.
  • the plurality of word segmentation node servers are a plurality of independent entity servers;
  • the multiple word segmentation node servers are multiple independent virtual servers; or
  • the multiple word segmentation node servers are a mixture of multiple independent physical servers and multiple independent virtual servers.
  • a large number of word elements received by the word element distribution proxy server are vocabulary related to electronic components.
  • the multi-node word segmentation system for keyword search of the present invention further includes a spare word segmentation node server, which automatically starts the corresponding spare word segmentation node server when a certain word segmentation node server is damaged.
  • the present invention also provides a multi-node word segmentation method for keyword search, including:
  • the word element distribution proxy server receives a large number of word elements, and distributes each of the word elements to the corresponding word segmentation node server according to the preset word segmentation rules;
  • the client proxy server receives the word to be searched uploaded by the user terminal, and distributes the word to be searched to the corresponding word segmentation node server according to the preset word segmentation rule;
  • the word segmentation node server queries the word elements that match the word to be searched, and returns the word elements obtained from the query to the client proxy server;
  • the client proxy server delivers the word elements obtained from the query to the corresponding user terminal.
  • the preset word segmentation rule is word element attribute information, and each word segmentation node server sets corresponding word element attribute information;
  • step A distributing each word element to the corresponding word segmentation node server according to a preset word segmentation rule includes: the client proxy server distributes the word element to the corresponding word segmentation according to the word element attribute information of each word element Node server
  • distributing the word to be searched to the corresponding word segmentation node server according to the preset word segmentation rule includes: the client proxy server transfers the word to be searched according to the meta attribute information of the word to be searched Distributed to the corresponding word segmentation node server.
  • the word element attribute information is the first letter of the word element corresponding to the Chinese pinyin;
  • the word element attribute information is the first letter of the language spelling corresponding to the word element.
  • the multiple word segmentation node servers are multiple independent entity servers;
  • the multiple word segmentation node servers are multiple independent virtual servers; or
  • the multiple word segmentation node servers are a mixture of multiple independent physical servers and multiple independent virtual servers.
  • a large number of word elements received by the word element distribution proxy server are vocabulary related to electronic components.
  • step B if a certain word segmentation node server is damaged, the corresponding spare word segmentation node server is automatically started.
  • the multi-node word segmentation method for keyword search of the present invention further includes:
  • the word database expansion process The word element distribution proxy server receives expanded word elements, and distributes the expanded word elements to corresponding word segmentation node servers according to the preset word segmentation rules.
  • the multi-node word segmentation method for keyword search of the present invention further includes:
  • the word element distribution proxy server receives the word element to be deleted, and distributes the word element to be deleted to the corresponding word segmentation node server according to the preset word segmentation rule, and the word segmentation node server deletes the stored words The word element corresponding to the word element to be deleted.
  • the multi-node word segmentation method for keyword search of the present invention further includes:
  • the word element distribution proxy server receives the word element to be updated and its corresponding updated word element, and distributes the word element to be updated to the corresponding word segmentation node server according to the preset word segmentation rule;
  • the word segmentation node server where the word element to be updated is located deletes the word element corresponding to the word element to be updated;
  • the word element distribution proxy server distributes the updated word element to the corresponding word segmentation node server according to the preset word segmentation rule .
  • the present invention uses multiple word segmentation node servers to disperse the amount of data, and distributes the search words by setting the vocabulary distribution calculation. Efficient and low-cost retrieval under a large platform and large vocabulary.
  • Figure 1 is a schematic diagram of the structure of a word segmenter in the prior art
  • FIG. 2 is a schematic diagram of the structure of a multi-node word segmentation system for keyword search provided in Embodiment 1;
  • Embodiment 3 is a flowchart of a multi-node word segmentation method for keyword search provided in Embodiment 2.
  • the multi-node word segmentation system for keyword search in this embodiment includes a word element distribution proxy server 10, a client proxy server 20, and at least two word segmentation node servers 30, wherein the word element distribution proxy servers 10 are respectively connected in communication
  • the word element distribution proxy server 10 receives a large number of word elements, and distributes each word element to the corresponding word segmentation node server 30 according to a preset word segmentation rule.
  • a large number of vocabularies received by the word element distribution proxy server 10 are provided by users according to their needs.
  • a large number of word elements received by the word element distribution proxy server 10 are vocabulary related to electronic components.
  • the client proxy server 20 communicates with each word segmentation node server 30 respectively.
  • the client proxy server 20 receives the search words uploaded by the user terminal 40 and distributes the search words to the corresponding word segmentation node servers 30 according to preset word segmentation rules.
  • the word segmentation node server 30 inquires out the word elements that match the word to be searched, and returns the inquired word elements to the client proxy server 20; the client proxy server 20 delivers the inquired word elements to the corresponding user terminal 40.
  • the client proxy server 20 and the software and website on the user terminal 40 need to be deeply bound to ensure that the search term received by the user terminal 40 can be quickly sent to the client proxy server 20 and reduce response time.
  • the electronic component e-commerce software and website are deeply bound with the client proxy server 20.
  • the average response time of this embodiment is less than 100 ms under the million-level and tens of million-level word elements.
  • the word element distribution proxy server 10 and the client proxy server 20 use the same preset word segmentation rules to ensure that the client proxy server 20 can accurately distribute the search word input by the user to the corresponding word segmentation node server 30 , To realize the search for the word to be searched.
  • the preset word segmentation rule is the word element attribute information, and each word segmentation node server 30 sets corresponding word element attribute information, that is, the word element attribute information between each word segmentation node server 30 is different, and each word segmentation node server 30 only responsible for receiving the word elements corresponding to the word element attribute information, so as to ensure that each word segmentation node server 30 bears all the word elements in a balanced manner and disperses the data pressure.
  • the lemma attribute information is the first letter of the spelling of the language corresponding to the lemma.
  • the languages here include but are not limited to Chinese, English, Japanese, German, French, Arabic numerals, etc. Each language has a corresponding spelling rule.
  • the spelling rules of the language are used as lexical attribute information.
  • the word element attribute information is that the word element corresponds to the initial letter of the Chinese pinyin, and each word segmentation node server 30 corresponds to one or more Chinese letters, and the word element is distributed to the corresponding word segmentation node server according to the initial letter corresponding to the word element. 30.
  • multiple word segmentation node servers 30 are multiple independent physical servers; or multiple word segmentation node servers 30 are multiple independent virtual servers; or multiple word segmentation node servers 30 is a mixture of multiple independent physical servers and multiple independent virtual servers.
  • the above setting methods can be selected according to user performance requirements.
  • the multi-node word segmentation system for keyword search in some embodiments further includes a spare word segmentation node server 301, which automatically starts the corresponding spare word segmentation node server when a certain word segmentation node server 30 is damaged 301.
  • multiple word segmentation node servers are used to disperse the data volume, and by setting a vocabulary distribution calculation to distribute the search words, efficient and low-cost retrieval under a large platform and a large vocabulary is realized.
  • the multi-node word segmentation method for keyword search in this embodiment includes the following steps:
  • the word element distribution proxy server 10 receives a large number of word elements, and distributes each word element to the corresponding word segmentation node server 30 according to a preset word segmentation rule.
  • the preset word segmentation rule is word element attribute information
  • each word segmentation node server 30 sets corresponding word element attribute information
  • the client proxy server 20 distributes the word element to the corresponding word element according to the word element attribute information of each word element.
  • the word segmentation node server 30 A large number of vocabularies received by the word element distribution proxy server 10 are provided by users according to their needs. For example, a large number of word elements received by the word element distribution proxy server 10 are vocabulary related to electronic components. Various parameters, technical terms, etc.
  • the client proxy server 20 receives the word to be searched uploaded by the user terminal 40, and distributes the word to be searched to the corresponding word segmentation node server 30 according to a preset word segmentation rule.
  • the preset word segmentation rule is word element attribute information
  • each word segmentation node server 30 sets corresponding word element attribute information
  • the client proxy server 20 distributes the word to be searched to the corresponding word element attribute information according to the word element attribute information of the word to be searched.
  • the word segmentation node server 30 It can be understood that the client proxy server 20 and the software and website on the user terminal 40 need to be deeply bound to ensure that the search term received by the user terminal 40 can be quickly sent to the client proxy server 20 and reduce response time. For example, the electronic component e-commerce software and website are deeply bound with the client proxy server 20. Through actual testing, the average response time of this embodiment is less than 100 ms under the million-level and tens of million-level word elements.
  • the word segmentation node server 30 searches out the word elements that match the word to be searched, and returns the searched word elements to the client proxy server 20.
  • the client proxy server 20 delivers the word elements obtained from the query to the corresponding user terminal 40.
  • the word element distribution proxy server 10 and the client proxy server 20 use the same preset word segmentation rules to ensure that the client proxy server 20 can accurately distribute the search word input by the user to the corresponding word segmentation node server 30 , To realize the search for the word to be searched.
  • the preset word segmentation rule is the word element attribute information, and each word segmentation node server 30 sets corresponding word element attribute information, that is, the word element attribute information between each word segmentation node server 30 is different, and each word segmentation node server 30 only responsible for receiving the word elements corresponding to the word element attribute information, so as to ensure that each word segmentation node server 30 bears all the word elements in a balanced manner and disperses the data pressure.
  • the lemma attribute information is the first letter of the spelling of the language corresponding to the lemma.
  • the languages here include but are not limited to Chinese, English, Japanese, German, French, Arabic numerals, etc. Each language has a corresponding spelling rule.
  • the spelling rules of the language are used as lexical attribute information.
  • the word element attribute information is that the word element corresponds to the initial letter of the Chinese pinyin, and each word segmentation node server 30 corresponds to one or more Chinese letters, and the word element is distributed to the corresponding word segmentation node server according to the initial letter corresponding to the word element. 30.
  • the multiple word segmentation node servers 30 are multiple independent physical servers; or the multiple word segmentation node servers 30 are multiple independent virtual servers; or The word segmentation node server 30 is a mixture of multiple independent physical servers and multiple independent virtual servers, and the above-mentioned setting method can be selected according to user performance requirements.
  • multiple word segmentation node servers are used to disperse the data volume, and by setting a vocabulary distribution calculation to distribute the search words, efficient and low-cost retrieval under a large platform and a large vocabulary is realized.
  • step B if a certain word segmentation node server 30 is damaged, the corresponding spare word segmentation node server 301 is automatically started, which can be Improve the reliability of the system.
  • the multi-node word segmentation method for keyword search in this embodiment further includes: C.
  • the multi-node word segmentation method for keyword search in this embodiment further includes: D. token deletion process: token distribution proxy server 10
  • the word element to be deleted is received, and the word element to be deleted is distributed to the corresponding word segmentation node server 30 according to the preset word segmentation rule, and the word segmentation node server 30 deletes the stored word element corresponding to the word element to be deleted.
  • the multi-node word segmentation method for keyword search in this embodiment further includes:
  • the word element distribution proxy server 10 receives the word element to be updated and its corresponding updated word element, and distributes the word element to be updated to the corresponding word segmentation node server 30 according to the preset word segmentation rule; where the word element to be updated is located The word segmentation node server 30 deletes the word element corresponding to the word element to be updated; the word element distribution proxy server 10 distributes the updated word element to the corresponding word segmentation node server 30 according to a preset word segmentation rule.
  • the steps of the method or algorithm described in the embodiments disclosed in this document can be directly implemented by hardware, a software module executed by a processor, or a combination of the two.
  • the software module can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or all areas in the technical field. Any other known storage media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种用于关键词搜索的多节点分词系统及方法。该系统中词元分发代理服务器(10)分别通信连接每个分词节点服务器(30),词元分发代理服务器(10)接收大量词元,按照预设分词规则将每个词元分发至对应的分词节点服务器(30,A);客户端代理服务器(20)分别通信连接每个分词节点服务器(30),客户端代理服务器(20)接收用户终端(40)上传的待搜索词,按照预设分词规则将待搜索词分发至对应的分词节点服务器(30,b1);分词节点服务器(30)查询出与待搜索词匹配的词元,并将查询所得词元返回至客户端代理服务器(20,b2);客户端代理服务器(20)将查询所得词元下发至对应的用户终端(40,b3)。

Description

一种用于关键词搜索的多节点分词系统及方法 技术领域
本发明涉及搜索引擎搜索词推荐领域,更具体地说,涉及一种用于关键词搜索的多节点分词系统及方法。
背景技术
参考图1,为现有技术中分词器的系统结构示意图,现有搜索引擎分词器使用单机内存式存储,自带字典有42万组词汇。在某一领域具体使用时会根据需要添加第三方词汇,达到补充语义的效果。这种方法只能解决50万左右的词汇量需求,若词汇达到百万级别和千万级别时,就需要进行大规模内存扩容,费用会直线上升,成本特别高。例如电子元器件行业,对品类的划分极其细致,并且规格编码的词元量也很大,很容易就会达到百万乃至千万级量级,现有分词器不能满足要求。另外,编程语言对软件的内存有上限要求,如Java虚拟机内存达到一定的大小GB级别时,会引起不可预知的内存泄漏,导致不可用甚至带来系统崩溃的灾难。
技术问题
本发明要解决的技术问题在于,针对现有技术的上述缺陷,提供一种用于关键词搜索的多节点分词系统及方法。
技术解决方案
本发明解决其技术问题所采用的技术方案是:构造一种用于关键词搜索的多节点分词系统,包括词元分发代理服务器、客户端代理服务器和至少两个分词节点服务器;
所述词元分发代理服务器分别通信连接每个所述分词节点服务器,所述词元分发代理服务器接收大量词元,按照预设分词规则将每个所述词元分发至对应的所述分词节点服务器;
所述客户端代理服务器分别通信连接每个所述分词节点服务器,所述客户端代理服务器接收用户终端上传的待搜索词,按照所述预设分词规则将所述待搜索词分发至对应的分词节点服务器;所述分词节点服务器查询出与所述待搜索词匹配的词元,并将查询所得词元返回至所述客户端代理服务器;所述客户端代理服务器将查询所得词元下发至对应的所述用户终端。
进一步,在本发明所述的用于关键词搜索的多节点分词系统中,所述预设分词规则为词元属性信息,每个所述分词节点服务器设置对应的词元属性信息;
所述客户端代理服务器根据每个词元的词元属性信息将词元分发至对应的所述分词节点服务器,所述客户端代理服务器根据所述待搜索词的词元属性信息将所述待搜索词分发至对应的所述分词节点服务器。
进一步,在本发明所述的用于关键词搜索的多节点分词系统中,所述词元属性信息为词元对应汉语拼音的首字母;或
所述词元属性信息为词元对应语言拼写的首字母。
进一步,在本发明所述的用于关键词搜索的多节点分词系统中,多个所述分词节点服务器为多个独立的实体服务器;或
多个所述分词节点服务器为多个独立的虚拟服务器;或
多个所述分词节点服务器为多个独立的实体服务器和多个独立的虚拟服务器的混合。
进一步,在本发明所述的用于关键词搜索的多节点分词系统中,所述词元分发代理服务器接收的大量词元为电子元器件相关词汇。
进一步,本发明所述的用于关键词搜索的多节点分词系统还包括备用分词节点服务器,在某一所述分词节点服务器损坏时自动启动与其对应的备用分词节点服务器。
另外,本发明还提供一种用于关键词搜索的多节点分词方法,包括:
A、词库建立过程:词元分发代理服务器接收大量词元,按照预设分词规则将每个所述词元分发至对应的分词节点服务器;
B、待搜索词搜索过程:
b1、客户端代理服务器接收用户终端上传的待搜索词,按照所述预设分词规则将所述待搜索词分发至对应的分词节点服务器;
b2、所述分词节点服务器查询出与所述待搜索词匹配的词元,并将查询所得词元返回至所述客户端代理服务器;
b3、所述客户端代理服务器将查询所得词元下发至对应的所述用户终端。
进一步,在本发明所述的用于关键词搜索的多节点分词方法中,所述预设分词规则为词元属性信息,每个所述分词节点服务器设置对应的词元属性信息;
所述步骤A中按照预设分词规则将每个所述词元分发至对应的分词节点服务器包括:所述客户端代理服务器根据每个词元的词元属性信息将词元分发至对应的分词节点服务器;
所述步骤b1中按照所述预设分词规则将所述待搜索词分发至对应的分词节点服务器包括:所述客户端代理服务器根据所述待搜索词的词元属性信息将所述待搜索词分发至对应的分词节点服务器。
进一步,在本发明所述的用于关键词搜索的多节点分词方法中,所述词元属性信息为词元对应汉语拼音的首字母;或
所述词元属性信息为词元对应语言拼写的首字母。
进一步,在本发明所述的用于关键词搜索的多节点分词方法中,多个所述分词节点服务器为多个独立的实体服务器;或
多个所述分词节点服务器为多个独立的虚拟服务器;或
多个所述分词节点服务器为多个独立的实体服务器和多个独立的虚拟服务器的混合。
进一步,在本发明所述的用于关键词搜索的多节点分词方法中,所述词元分发代理服务器接收的大量词元为电子元器件相关词汇。
进一步,在本发明所述的用于关键词搜索的多节点分词方法中,在所述步骤B中,若某一所述分词节点服务器损坏,则自动启动与其对应的备用分词节点服务器。
进一步,本发明所述的用于关键词搜索的多节点分词方法还包括:
C、词库扩充过程:所述词元分发代理服务器接收扩充词元,按照所述预设分词规则将所述扩充词元分发至对应的分词节点服务器。
进一步,本发明所述的用于关键词搜索的多节点分词方法还包括:
D、词元删除过程:所述词元分发代理服务器接收待删除词元,按照所述预设分词规则将所述待删除词元分发至对应的分词节点服务器,所述分词节点服务器删除已存储的所述待删除词元对应的词元。
进一步,本发明所述的用于关键词搜索的多节点分词方法还包括:
E、词库更新过程:所述词元分发代理服务器接收待更新词元及其对应的更新词元,按照所述预设分词规则将所述待更新词元分发至对应的分词节点服务器;所述待更新词元所在所述分词节点服务器删除所述待更新词元对应的词元;所述词元分发代理服务器按照所述预设分词规则将所述更新词元分发至对应的分词节点服务器。
有益效果
实施本发明的一种用于关键词搜索的多节点分词系统及方法,具有以下有益效果:本发明使用多个分词节点服务器来分散数据量,且通过设置词汇分发算对待搜索词进行分发,实现大平台大词汇量下的高效低成本检索。
附图说明
下面将结合附图及实施例对本发明作进一步说明,附图中:
图1是现有技术中分词器的结构示意图;
图2是实施例1提供的一种用于关键词搜索的多节点分词系统的结构示意图;
图3是实施例2提供的一种用于关键词搜索的多节点分词方法的流程图。
本发明的最佳实施方式
为了对本发明的技术特征、目的和效果有更加清楚的理解,现对照附图详细说明本发明的具体实施方式。
实施例1
参考图2,本实施例的用于关键词搜索的多节点分词系统包括词元分发代理服务器10、客户端代理服务器20和至少两个分词节点服务器30,其中词元分发代理服务器10分别通信连接每个分词节点服务器30,词元分发代理服务器10接收大量词元,按照预设分词规则将每个词元分发至对应的分词节点服务器30。词元分发代理服务器10接收的大量词汇由用户根据需求提供,例如词元分发代理服务器10接收的大量词元为电子元器件相关词汇,电子元器件相关词汇包括但不限于电子器件名称、编号、各种参数、技术术语等。
客户端代理服务器20分别通信连接每个分词节点服务器30,客户端代理服务器20接收用户终端40上传的待搜索词,按照预设分词规则将待搜索词分发至对应的分词节点服务器30。分词节点服务器30查询出与待搜索词匹配的词元,并将查询所得词元返回至客户端代理服务器20;客户端代理服务器20将查询所得词元下发至对应的用户终端40。可以理解,客户端代理服务器20和用户终端40上的软件和网站需要深度绑定,以保障用户终端40接收的待搜索词能快速发送至客户端代理服务器20,降低响应时间。例如,电子元器件电商软件和网站与客户端代理服务器20进行深度绑定。经过实际测试,本实施例在百万级别和千万级别词元下,平均响应时间小于100ms。
本实施例中词元分发代理服务器10和客户端代理服务器20使用相同的预设分词规则,以保证客户端代理服务器20能将用户输入的待搜索词准确分发到与其对应的分词节点服务器30中,实现待搜索词的查找。其中预设分词规则为词元属性信息,每个分词节点服务器30设置对应的词元属性信息,即每个分词节点服务器30之间的词元属性信息是不同的,每个分词节点服务器30仅负责接收与其词元属性信息对应的词元,从而保证每个分词节点服务器30均衡的承担所有词元,分散数据压力。
可以理解,词元的词元属性信息有多种,凡是能根据某种词元属性信息对词元进行分类的,都属于本实施例的设计构思。例如,词元属性信息为词元对应语言拼写的首字母,这里的语言包括但不限于汉语、英语、日语、德语、法语、阿拉伯数字等,每种语言都有对应的拼写规则,将每种语言的拼写规则作为词元属性信息。优选地,词元属性信息为词元对应汉语拼音的首字母,每个分词节点服务器30对应一个或多个汉语字母,则根据词元对应的首字母将该词元分发至对应的分词节点服务器30。
本实施例的用于关键词搜索的多节点分词系统中多个分词节点服务器30为多个独立的实体服务器;或多个分词节点服务器30为多个独立的虚拟服务器;或多个分词节点服务器30为多个独立的实体服务器和多个独立的虚拟服务器的混合,上述设置方式可根据用户性能需求进行选择。
作为选择,为保障系统的可靠性,一些实施例的用于关键词搜索的多节点分词系统还包括备用分词节点服务器301,在某一分词节点服务器30损坏时自动启动与其对应的备用分词节点服务器301。
本实施例使用多个分词节点服务器来分散数据量,且通过设置词汇分发算对待搜索词进行分发,实现大平台大词汇量下的高效低成本检索。
实施例2
参考图3,本实施例的用于关键词搜索的多节点分词方法包括下述步骤:
A、词库建立过程:词元分发代理服务器10接收大量词元,按照预设分词规则将每个词元分发至对应的分词节点服务器30。作为选择,预设分词规则为词元属性信息,每个分词节点服务器30设置对应的词元属性信息,则客户端代理服务器20根据每个词元的词元属性信息将词元分发至对应的分词节点服务器30。词元分发代理服务器10接收的大量词汇由用户根据需求提供,例如词元分发代理服务器10接收的大量词元为电子元器件相关词汇,电子元器件相关词汇包括但不限于电子器件名称、编号、各种参数、技术术语等。
B、待搜索词搜索过程:
b1、客户端代理服务器20接收用户终端40上传的待搜索词,按照预设分词规则将待搜索词分发至对应的分词节点服务器30。作为选择,预设分词规则为词元属性信息,每个分词节点服务器30设置对应的词元属性信息,则客户端代理服务器20根据待搜索词的词元属性信息将待搜索词分发至对应的分词节点服务器30。可以理解,客户端代理服务器20和用户终端40上的软件和网站需要深度绑定,以保障用户终端40接收的待搜索词能快速发送至客户端代理服务器20,降低响应时间。例如,电子元器件电商软件和网站与客户端代理服务器20进行深度绑定。经过实际测试,本实施例在百万级别和千万级别词元下,平均响应时间小于100ms。
b2、分词节点服务器30查询出与待搜索词匹配的词元,并将查询所得词元返回至客户端代理服务器20。
b3、客户端代理服务器20将查询所得词元下发至对应的用户终端40。
本实施例中词元分发代理服务器10和客户端代理服务器20使用相同的预设分词规则,以保证客户端代理服务器20能将用户输入的待搜索词准确分发到与其对应的分词节点服务器30中,实现待搜索词的查找。其中预设分词规则为词元属性信息,每个分词节点服务器30设置对应的词元属性信息,即每个分词节点服务器30之间的词元属性信息是不同的,每个分词节点服务器30仅负责接收与其词元属性信息对应的词元,从而保证每个分词节点服务器30均衡的承担所有词元,分散数据压力。
可以理解,词元的词元属性信息有多种,凡是能根据某种词元属性信息对词元进行分类的,都属于本实施例的设计构思。例如,词元属性信息为词元对应语言拼写的首字母,这里的语言包括但不限于汉语、英语、日语、德语、法语、阿拉伯数字等,每种语言都有对应的拼写规则,将每种语言的拼写规则作为词元属性信息。优选地,词元属性信息为词元对应汉语拼音的首字母,每个分词节点服务器30对应一个或多个汉语字母,则根据词元对应的首字母将该词元分发至对应的分词节点服务器30。
作为选择,本实施例的用于关键词搜索的多节点分词方法中多个分词节点服务器30为多个独立的实体服务器;或多个分词节点服务器30为多个独立的虚拟服务器;或多个分词节点服务器30为多个独立的实体服务器和多个独立的虚拟服务器的混合,上述设置方式可根据用户性能需求进行选择。
本实施例使用多个分词节点服务器来分散数据量,且通过设置词汇分发算对待搜索词进行分发,实现大平台大词汇量下的高效低成本检索。
实施例3
在实施例2的基础上,本实施例的用于关键词搜索的多节点分词方法中在步骤B中,若某一分词节点服务器30损坏,则自动启动与其对应的备用分词节点服务器301,可提高系统的可靠性。
实施例4
在实施例2的基础上,在词库已建立后,为满足用户对新词元的添加需求,本实施例的用于关键词搜索的多节点分词方法还包括:C、词库扩充过程:词元分发代理服务器10接收扩充词元,按照预设分词规则将扩充词元分发至对应的分词节点服务器30。需要说明的是,本实施例的预设分词规则与上述实施例相同。
实施例5
在实施例2的基础上,为满足用户对已添加词元的删除需求,本实施例的用于关键词搜索的多节点分词方法还包括:D、词元删除过程:词元分发代理服务器10接收待删除词元,按照预设分词规则将待删除词元分发至对应的分词节点服务器30,分词节点服务器30删除已存储的待删除词元对应的词元。
实施例6
在实施例2的基础上,为满足用户为已经存在的词元的更新,本实施例的用于关键词搜索的多节点分词方法还包括:
E、词库更新过程:词元分发代理服务器10接收待更新词元及其对应的更新词元,按照预设分词规则将待更新词元分发至对应的分词节点服务器30;待更新词元所在分词节点服务器30删除待更新词元对应的词元;词元分发代理服务器10按照预设分词规则将更新词元分发至对应的分词节点服务器30。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
以上实施例只为说明本发明的技术构思及特点,其目的在于让熟悉此项技术的人士能够了解本发明的内容并据此实施,并不能限制本发明的保护范围。凡跟本发明权利要求范围所做的均等变化与修饰,均应属于本发明权利要求的涵盖范围。

Claims (15)

  1. 一种用于关键词搜索的多节点分词系统,其特征在于,包括词元分发代理服务器(10)、客户端代理服务器(20)和至少两个分词节点服务器(30);
    所述词元分发代理服务器(10)分别通信连接每个所述分词节点服务器(30),所述词元分发代理服务器(10)接收大量词元,按照预设分词规则将每个所述词元分发至对应的所述分词节点服务器(30);
    所述客户端代理服务器(20)分别通信连接每个所述分词节点服务器(30),所述客户端代理服务器(20)接收用户终端(40)上传的待搜索词,按照所述预设分词规则将所述待搜索词分发至对应的分词节点服务器(30);所述分词节点服务器(30)查询出与所述待搜索词匹配的词元,并将查询所得词元返回至所述客户端代理服务器(20);所述客户端代理服务器(20)将查询所得词元下发至对应的所述用户终端(40)。
  2. 根据权利要求1所述的用于关键词搜索的多节点分词系统,其特征在于,所述预设分词规则为词元属性信息,每个所述分词节点服务器(30)设置对应的词元属性信息;
    所述客户端代理服务器(20)根据每个词元的词元属性信息将词元分发至对应的所述分词节点服务器(30),所述客户端代理服务器(20)根据所述待搜索词的词元属性信息将所述待搜索词分发至对应的所述分词节点服务器(30)。
  3. 根据权利要求2所述的用于关键词搜索的多节点分词系统,其特征在于,所述词元属性信息为词元对应汉语拼音的首字母;或
    所述词元属性信息为词元对应语言拼写的首字母。
  4. 根据权利要求1所述的用于关键词搜索的多节点分词系统,其特征在于,多个所述分词节点服务器(30)为多个独立的实体服务器;或
    多个所述分词节点服务器(30)为多个独立的虚拟服务器;或
    多个所述分词节点服务器(30)为多个独立的实体服务器和多个独立的虚拟服务器的混合。
  5. 根据权利要求1所述的用于关键词搜索的多节点分词系统,其特征在于,所述词元分发代理服务器(10)接收的大量词元为电子元器件相关词汇。
  6. 根据权利要求1所述的用于关键词搜索的多节点分词系统,其特征在于,还包括备用分词节点服务器(301),在某一所述分词节点服务器(30)损坏时自动启动与其对应的备用分词节点服务器(301)。
  7. 一种用于关键词搜索的多节点分词方法,其特征在于,包括:
    A、词库建立过程:词元分发代理服务器(10)接收大量词元,按照预设分词规则将每个所述词元分发至对应的分词节点服务器(30);
    B、待搜索词搜索过程:
    b1、客户端代理服务器(20)接收用户终端(40)上传的待搜索词,按照所述预设分词规则将所述待搜索词分发至对应的分词节点服务器(30);
    b2、所述分词节点服务器(30)查询出与所述待搜索词匹配的词元,并将查询所得词元返回至所述客户端代理服务器(20);
    b3、所述客户端代理服务器(20)将查询所得词元下发至对应的所述用户终端(40)。
  8. 根据权利要求7所述的用于关键词搜索的多节点分词方法,其特征在于,所述预设分词规则为词元属性信息,每个所述分词节点服务器(30)设置对应的词元属性信息;
    所述步骤A中按照预设分词规则将每个所述词元分发至对应的分词节点服务器(30)包括:所述客户端代理服务器(20)根据每个词元的词元属性信息将词元分发至对应的分词节点服务器(30);
    所述步骤b1中按照所述预设分词规则将所述待搜索词分发至对应的分词节点服务器(30)包括:所述客户端代理服务器(20)根据所述待搜索词的词元属性信息将所述待搜索词分发至对应的分词节点服务器(30)。
  9. 根据权利要求8所述的用于关键词搜索的多节点分词方法,其特征在于,所述词元属性信息为词元对应汉语拼音的首字母;或
    所述词元属性信息为词元对应语言拼写的首字母。
  10. 根据权利要求7所述的用于关键词搜索的多节点分词方法,其特征在于,多个所述分词节点服务器(30)为多个独立的实体服务器;或
    多个所述分词节点服务器(30)为多个独立的虚拟服务器;或
    多个所述分词节点服务器(30)为多个独立的实体服务器和多个独立的虚拟服务器的混合。
  11. 根据权利要求7所述的用于关键词搜索的多节点分词方法,其特征在于,所述词元分发代理服务器(10)接收的大量词元为电子元器件相关词汇。
  12. 根据权利要求7所述的用于关键词搜索的多节点分词方法,其特征在于,在所述步骤B中,若某一所述分词节点服务器(30)损坏,则自动启动与其对应的备用分词节点服务器(301)。
  13. 根据权利要求7所述的用于关键词搜索的多节点分词方法,其特征在于,还包括:
    C、词库扩充过程:所述词元分发代理服务器(10)接收扩充词元,按照所述预设分词规则将所述扩充词元分发至对应的分词节点服务器(30)。
  14. 根据权利要求7所述的用于关键词搜索的多节点分词方法,其特征在于,还包括:
    D、词元删除过程:所述词元分发代理服务器(10)接收待删除词元,按照所述预设分词规则将所述待删除词元分发至对应的分词节点服务器(30),所述分词节点服务器(30)删除已存储的所述待删除词元对应的词元。
  15. 根据权利要求7所述的用于关键词搜索的多节点分词方法,其特征在于,还包括:
    E、词库更新过程:所述词元分发代理服务器(10)接收待更新词元及其对应的更新词元,按照所述预设分词规则将所述待更新词元分发至对应的分词节点服务器(30);所述待更新词元所在所述分词节点服务器(30)删除所述待更新词元对应的词元;所述词元分发代理服务器(10)按照所述预设分词规则将所述更新词元分发至对应的分词节点服务器(30)。
PCT/CN2020/090648 2020-05-15 2020-05-15 一种用于关键词搜索的多节点分词系统及方法 WO2021227060A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US17/420,268 US11914658B2 (en) 2020-05-15 2020-05-15 Multi-node word segmentation system and method for keyword search
PCT/CN2020/090648 WO2021227060A1 (zh) 2020-05-15 2020-05-15 一种用于关键词搜索的多节点分词系统及方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/090648 WO2021227060A1 (zh) 2020-05-15 2020-05-15 一种用于关键词搜索的多节点分词系统及方法

Publications (1)

Publication Number Publication Date
WO2021227060A1 true WO2021227060A1 (zh) 2021-11-18

Family

ID=78525977

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/090648 WO2021227060A1 (zh) 2020-05-15 2020-05-15 一种用于关键词搜索的多节点分词系统及方法

Country Status (2)

Country Link
US (1) US11914658B2 (zh)
WO (1) WO2021227060A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11914658B2 (en) * 2020-05-15 2024-02-27 Shenzhen Sekorm Component Network Co., Ltd Multi-node word segmentation system and method for keyword search

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040249795A1 (en) * 2003-06-05 2004-12-09 International Business Machines Corporation Semantics-based searching for information in a distributed data processing system
CN101021856A (zh) * 2006-10-11 2007-08-22 鲍东山 分布式语音检索系统
CN101599886A (zh) * 2008-06-05 2009-12-09 华为技术有限公司 分布式结构化网络中的查询方法、系统和设备
CN102156749A (zh) * 2011-04-22 2011-08-17 中国测绘科学研究院 一种地图网站的自动搜索判别方法、系统及其分布式服务器系统
CN102467570A (zh) * 2010-11-17 2012-05-23 日电(中国)有限公司 用于分布式数据仓库的连接查询系统和方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6029195A (en) * 1994-11-29 2000-02-22 Herz; Frederick S. M. System for customized electronic identification of desirable objects
US5758257A (en) * 1994-11-29 1998-05-26 Herz; Frederick System and method for scheduling broadcast of and access to video programs and other data using customer profiles
US7181438B1 (en) * 1999-07-21 2007-02-20 Alberti Anemometer, Llc Database access system
CN102043833B (zh) 2010-11-25 2013-12-25 北京搜狗科技发展有限公司 一种基于查询词进行搜索的方法和搜索装置
US10360269B2 (en) * 2015-10-23 2019-07-23 Oracle International Corporation Proxy databases
US11914658B2 (en) * 2020-05-15 2024-02-27 Shenzhen Sekorm Component Network Co., Ltd Multi-node word segmentation system and method for keyword search

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040249795A1 (en) * 2003-06-05 2004-12-09 International Business Machines Corporation Semantics-based searching for information in a distributed data processing system
CN101021856A (zh) * 2006-10-11 2007-08-22 鲍东山 分布式语音检索系统
CN101599886A (zh) * 2008-06-05 2009-12-09 华为技术有限公司 分布式结构化网络中的查询方法、系统和设备
CN102467570A (zh) * 2010-11-17 2012-05-23 日电(中国)有限公司 用于分布式数据仓库的连接查询系统和方法
CN102156749A (zh) * 2011-04-22 2011-08-17 中国测绘科学研究院 一种地图网站的自动搜索判别方法、系统及其分布式服务器系统

Also Published As

Publication number Publication date
US11914658B2 (en) 2024-02-27
US20220207097A1 (en) 2022-06-30

Similar Documents

Publication Publication Date Title
US11487744B2 (en) Domain name generation and searching using unigram queries
US7421386B2 (en) Full-form lexicon with tagged data and methods of constructing and using the same
US7987189B2 (en) Content data indexing and result ranking
US20020099536A1 (en) System and methods for improved linguistic pattern matching
US20080077569A1 (en) Integrated Search Service System and Method
US20030074183A1 (en) Method and system for encoding and accessing linguistic frequency data
US20100082656A1 (en) Methods and apparatus for generating dynamic program files based on input queries that facilitate use of persistent query services
US10380248B1 (en) Acronym identification in domain names
US10380210B1 (en) Misspelling identification in domain names
JP2000235540A (ja) Url階層構造を利用した情報自動フィルタリング方法および装置
US8620950B1 (en) Regional indexes
JPH1063597A (ja) クライアント側、サーバ側および協調部で実行するurlのスペルチェック
US20060101004A1 (en) Method and system for retrieving a document
JP2004280280A (ja) 語義関連ネットワークを用いた文書検索システム
US20030063113A1 (en) Method and system for generating help information using a thesaurus
WO2021227060A1 (zh) 一种用于关键词搜索的多节点分词系统及方法
JP2003091552A (ja) 検索要求情報抽出方法及びその実施システム並びにその処理プログラム
CN108388606A (zh) 一种校验Sql语句中库表字段名的方法以及计算机设备
CN113377876A (zh) 基于Domino平台的数据分库处理方法、装置及平台
US8682913B1 (en) Corroborating facts extracted from multiple sources
WO2021227059A1 (zh) 一种基于多叉树的搜索词推荐方法及系统
US20170124090A1 (en) Method of discovering and exploring feature knowledge
KR20080049428A (ko) 질의어 분석을 이용한 유사 검색 서비스 제공 방법 및 장치
CN116361416A (zh) 基于语义分析及高维建模的语音检索方法、系统及介质
US20150154198A1 (en) Method for in-loop human validation of disambiguated features

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20935124

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20935124

Country of ref document: EP

Kind code of ref document: A1