WO2017117782A1 - 网络信息的分词处理方法及系统 - Google Patents
网络信息的分词处理方法及系统 Download PDFInfo
- Publication number
- WO2017117782A1 WO2017117782A1 PCT/CN2016/070406 CN2016070406W WO2017117782A1 WO 2017117782 A1 WO2017117782 A1 WO 2017117782A1 CN 2016070406 W CN2016070406 W CN 2016070406W WO 2017117782 A1 WO2017117782 A1 WO 2017117782A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- preliminary
- word segmentation
- person name
- structure list
- name
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Definitions
- the present invention relates to the field of Internet, and in particular, to a word segmentation processing method and system for network information.
- the network consists of nodes and connections, representing many objects and their interconnections.
- a network is a kind of graph that is generally considered to be a weighted graph.
- the network has a specific physical meaning, that is, the network is abstracted from some practical problem of the same type.
- the network In the field of computers, the network is a virtual platform for information transmission, reception, and sharing. Through it, the information of various points, faces, and bodies is linked together to realize the sharing of these resources.
- the network is the most important invention in the history of human development. Improve the development of science and technology and human society.
- the existing word segmentation method generally treats vocabulary by means of comparison or symbol, etc. This method has no problem for normal vocabulary processing, but for the processing of person names, because the name does not have any characteristics, Its processing will be inaccurate.
- the application provides a word segmentation processing method for network information. It solves the shortcomings of the prior art technical solutions for inaccurate identification of names.
- a method for word segmentation processing of network information comprising the following steps:
- the vocabulary of the person name in the preliminary structure list is increased by the latter word to obtain the added person name. If the added person name appears in the preliminary structure list, it is confirmed that the added person name is the final person name recognition result.
- the method further includes:
- the method further includes:
- the person in the preliminary structure list is named the final person name recognition result.
- a word segmentation processing system for network information comprising:
- a word segmentation unit for performing preliminary word segmentation on network information to obtain preliminary word segmentation results
- a recording unit for recording the result of the preliminary word segmentation process in the preliminary result list
- the verification unit is configured to increase the vocabulary of the person name in the preliminary structure list by increasing the name of the person after the word is added, and if the added person name appears in the preliminary structure list, confirm that the added person name is the final person name recognition result. .
- system further includes:
- the checking unit is further configured to: if the added person name does not appear in the preliminary structure list, the person in the preliminary structure list is named the final person name recognition result.
- the technical solution provided by the present invention performs preliminary word segmentation processing on the network information, the specific number of words is added to the latter word and then compared again, and the advantage of effectively identifying the person name is never achieved.
- FIG. 1 is a flowchart of a method for processing word segmentation of network information according to a first preferred embodiment of the present invention
- FIG. 2 is a structural diagram of a word segmentation processing system for network information according to a second preferred embodiment of the present invention.
- FIG. 1 is a schematic diagram of a word segmentation processing method for network information according to a first preferred embodiment of the present invention. The method is as shown in FIG.
- Step S101 Perform preliminary word segmentation on the network information to obtain a preliminary word segmentation result
- the preliminary word segmentation processing in the above steps may be various, for example, the Baidu word segmentation processing method, and of course, other prior art methods may be used for the preliminary word segmentation processing.
- Step S102 Record the result of the preliminary word segmentation process in the preliminary result list
- Step S103 adding the vocabulary of the person name in the preliminary structure list to the next word to obtain the added person name. If the added person name appears in the preliminary structure list, confirm that the added person name is the final person name recognition result.
- the technical solution provided by the present invention performs preliminary word segmentation processing on the network information, the specific number of words is added to the latter word and then compared again, and the advantage of effectively identifying the person name is never achieved.
- the foregoing method may further include:
- the method may further include:
- the person in the preliminary structure list is named the final person name recognition result.
- FIG. 2 is a fragmentation processing system for network information according to a second preferred embodiment of the present invention.
- the system includes:
- the word segmentation unit 201 is configured to perform preliminary word segmentation processing on the network information to obtain a preliminary word segmentation result
- the manner of the preliminary word segmentation processing in the word segmentation unit 201 may be various, for example, the Baidu word segmentation processing method, and of course, other prior art methods may be used for the preliminary word segmentation process.
- a recording unit 202 configured to record the result of the preliminary word segmentation process in the preliminary result list
- the checking unit 203 is configured to increase the vocabulary of the person name in the preliminary structure list by increasing the name of the person after the word is added, and if the added person name appears in the preliminary structure list, confirm that the added person name is the final person name recognition. result.
- the technical solution provided by the present invention performs preliminary word segmentation processing on the network information, the specific number of words is added to the latter word and then compared again, and the advantage of effectively identifying the person name is never achieved.
- the above system may further include:
- the updating unit 204 is configured to replace the final name recognition result with the vocabulary of the person name in the preliminary structure list.
- the verification unit 203 is further configured to: if the added person name does not appear in the preliminary structure list, the person in the preliminary structure list is named the final person name recognition result.
- the program may be stored in a computer readable storage medium, and the storage medium may include: Flash drive, read-only memory (English: Read-Only Memory, referred to as: ROM), random accessor (English: Random Access Memory, referred to as: RAM), disk or CD.
- ROM Read-Only Memory
- RAM Random Access Memory
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Information Transfer Between Computers (AREA)
- Character Discrimination (AREA)
- Machine Translation (AREA)
Abstract
一种网络信息的分词处理方法及系统,所述方法包括如下步骤:对网络信息进行初步分词处理得到初步分词结果(101);将初步分词处理的结果记录在初步结果列表中(102);将初步结构列表中人名的词汇增加其后一个字得到增加后的人名,如果增加后的人名出现在初步结构列表中,则确认增加后的人名为最终的人名识别结果(103)。该方法具有分词效果好的优点。
Description
本发明涉及互联网领域,尤其涉及一种网络信息的分词处理方法及系统。
网络是由节点和连线构成,表示诸多对象及其相互联系。在数学上,网络是一种图,一般认为专指加权图。网络除了数学定义外,还有具体的物理含义,即网络是从某种相同类型的实际问题中抽象出来的模型。在计算机领域中,网络是信息传输、接收、共享的虚拟平台,通过它把各个点、面、体的信息联系到一起,从而实现这些资源的共享,网络是人类发展史来最重要的发明,提高了科技和人类社会的发展。
现有的分词处理的方法对词汇的处理一般都是通过比对或符号等方式来处理,此方式对于正常的词汇处理没有问题,但是对于人名的处理来说,因为人名没有任何的特性,所以其处理会不准确。
本申请提供一种网络信息的分词处理方法。其解决现有技术的技术方案对人名识别不准确的缺点。
一方面,提供一种网络信息的分词处理方法,所述方法包括如下步骤:
对网络信息进行初步分词处理得到初步分词结果;
将初步分词处理的结果记录在初步结果列表中;
将初步结构列表中人名的词汇增加其后一个字得到增加后的人名,如果增加后的人名出现在初步结构列表中,则确认增加后的人名为最终的人名识别结果。
可选的,所述方法还包括:
将最终的人名识别结果替换初步结构列表中的人名的词汇。
可选的,所述方法还包括:
如增加后的人名未出现在初步结构列表中,则初步结构列表中人名为最终的人名识别结果。
第二方面,提供一种网络信息的分词处理系统,所述系统包括:
分词单元,用于对网络信息进行初步分词处理得到初步分词结果;
记录单元,用于将初步分词处理的结果记录在初步结果列表中;
校验单元,用于将初步结构列表中人名的词汇增加其后一个字得到增加后的人名,如果增加后的人名出现在初步结构列表中,则确认增加后的人名为最终的人名识别结果。
可选的,所述系统还包括:
更新单元,用于将最终的人名识别结果替换初步结构列表中的人名的词汇。
可选的,所述校验单元,还用于如增加后的人名未出现在初步结构列表中,则初步结构列表中人名为最终的人名识别结果。
本发明提供的技术方案对网络信息进行初步分词处理后,将特定数量的词汇增加后面一个字后再次比对,从来达到对人名进行有效识别的优点。
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明第一较佳实施方式提供的一种网络信息的分词处理方法的流程图;
图2为本发明第二较佳实施方式提供的一种网络信息的分词处理系统的结构图。
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
请参考图1,图1是本发明第一较佳实施方式提出的一种网络信息的分词处理方法,该方法如图1所示,包括如下步骤:
步骤S101、对网络信息进行初步分词处理得到初步分词结果;
上述步骤中的初步分词处理的方式可以有多种,例如百度分词处理方法,当然也可以为其他的现有技术的方法来进行初步分词处理。
步骤S102、将初步分词处理的结果记录在初步结果列表中;
步骤S103、将初步结构列表中人名的词汇增加其后一个字得到增加后的人名,如果增加后的人名出现在初步结构列表中,则确认增加后的人名为最终的人名识别结果。
本发明提供的技术方案对网络信息进行初步分词处理后,将特定数量的词汇增加后面一个字后再次比对,从来达到对人名进行有效识别的优点。
可选的,上述方法在步骤S103之后还可以包括:
将最终的人名识别结果替换初步结构列表中的人名的词汇。
可选的,上述方法步骤S103之后还可以包括:
如增加后的人名未出现在初步结构列表中,则初步结构列表中人名为最终的人名识别结果。
请参考图2,图2是本发明第二较佳实施方式提出的一种网络信息的分词处理系统,该系统包括:
分词单元201,用于对网络信息进行初步分词处理得到初步分词结果;
上述分词单元201中的初步分词处理的方式可以有多种,例如百度分词处理方法,当然也可以为其他的现有技术的方法来进行初步分词处理。
记录单元202,用于将初步分词处理的结果记录在初步结果列表中;
校验单元203,用于将初步结构列表中人名的词汇增加其后一个字得到增加后的人名,如果增加后的人名出现在初步结构列表中,则确认增加后的人名为最终的人名识别结果。
本发明提供的技术方案对网络信息进行初步分词处理后,将特定数量的词汇增加后面一个字后再次比对,从来达到对人名进行有效识别的优点。
可选的,上述系统还可以包括:
更新单元204,用于将最终的人名识别结果替换初步结构列表中的人名的词汇。
可选的,上述校验单元203,还用于如增加后的人名未出现在初步结构列表中,则初步结构列表中人名为最终的人名识别结果。
需要说明的是,对于前述的各个方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某一些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详细描述的部分,可以参见其他实施例的相关描述。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(英文:Read-Only
Memory ,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。
以上对本发明实施例所提供的内容下载方法及相关设备、系统进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。
Claims (6)
- 一种网络信息的分词处理方法,其特征在于,所述方法包括如下步骤:对网络信息进行初步分词处理得到初步分词结果;将初步分词处理的结果记录在初步结果列表中;将初步结构列表中人名的词汇增加其后一个字得到增加后的人名,如果增加后的人名出现在初步结构列表中,则确认增加后的人名为最终的人名识别结果。
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:将最终的人名识别结果替换初步结构列表中的人名的词汇。
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:如增加后的人名未出现在初步结构列表中,则初步结构列表中人名为最终的人名识别结果。
- 一种网络信息的分词处理系统,其特征在于,所述系统包括:分词单元,用于对网络信息进行初步分词处理得到初步分词结果;记录单元,用于将初步分词处理的结果记录在初步结果列表中;校验单元,用于将初步结构列表中人名的词汇增加其后一个字得到增加后的人名,如果增加后的人名出现在初步结构列表中,则确认增加后的人名为最终的人名识别结果。
- 根据权利要求4所述的系统,其特征在于,所述系统还包括:更新单元,用于将最终的人名识别结果替换初步结构列表中的人名的词汇。
- 根据权利要求4所述的系统,其特征在于,所述校验单元,还用于如增加后的人名未出现在初步结构列表中,则初步结构列表中人名为最终的人名识别结果。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201680000011.7A CN105723361A (zh) | 2016-01-07 | 2016-01-07 | 网络信息的分词处理方法及系统 |
PCT/CN2016/070406 WO2017117782A1 (zh) | 2016-01-07 | 2016-01-07 | 网络信息的分词处理方法及系统 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2016/070406 WO2017117782A1 (zh) | 2016-01-07 | 2016-01-07 | 网络信息的分词处理方法及系统 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2017117782A1 true WO2017117782A1 (zh) | 2017-07-13 |
Family
ID=56162514
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2016/070406 WO2017117782A1 (zh) | 2016-01-07 | 2016-01-07 | 网络信息的分词处理方法及系统 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN105723361A (zh) |
WO (1) | WO2017117782A1 (zh) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070021956A1 (en) * | 2005-07-19 | 2007-01-25 | Yan Qu | Method and apparatus for generating ideographic representations of letter based names |
CN101950284A (zh) * | 2010-09-27 | 2011-01-19 | 北京新媒传信科技有限公司 | 中文分词方法及系统 |
CN102033879A (zh) * | 2009-09-27 | 2011-04-27 | 腾讯科技(深圳)有限公司 | 一种中文人名识别的方法和装置 |
CN104182423A (zh) * | 2013-05-27 | 2014-12-03 | 华东师范大学 | 一种基于条件随机场的中文人名自动识别方法 |
-
2016
- 2016-01-07 CN CN201680000011.7A patent/CN105723361A/zh active Pending
- 2016-01-07 WO PCT/CN2016/070406 patent/WO2017117782A1/zh active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070021956A1 (en) * | 2005-07-19 | 2007-01-25 | Yan Qu | Method and apparatus for generating ideographic representations of letter based names |
CN102033879A (zh) * | 2009-09-27 | 2011-04-27 | 腾讯科技(深圳)有限公司 | 一种中文人名识别的方法和装置 |
CN101950284A (zh) * | 2010-09-27 | 2011-01-19 | 北京新媒传信科技有限公司 | 中文分词方法及系统 |
CN104182423A (zh) * | 2013-05-27 | 2014-12-03 | 华东师范大学 | 一种基于条件随机场的中文人名自动识别方法 |
Also Published As
Publication number | Publication date |
---|---|
CN105723361A (zh) | 2016-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2017128362A1 (zh) | 基于大数据的搜索方法及系统 | |
CN104572243B (zh) | 用于共享Java虚拟机的方法和系统 | |
WO2017161578A1 (zh) | 数据抓取的方法及系统 | |
WO2023165538A1 (zh) | 语音识别方法、装置、计算机可读介质及电子设备 | |
WO2017117806A1 (zh) | 网络信息的搜词方法及系统 | |
WO2021184765A1 (zh) | 规则处理方法、装置、介质及电子设备 | |
CN114519306A (zh) | 一种去中心化的终端节点网络模型训练方法及系统 | |
WO2017117782A1 (zh) | 网络信息的分词处理方法及系统 | |
WO2017117783A1 (zh) | 网络信息的搜索方法及系统 | |
WO2017173633A1 (zh) | 教育项目的智能回复方法及系统 | |
WO2017120739A1 (zh) | 餐饮评论分析方法及系统 | |
WO2017128357A1 (zh) | 基于大数据的网页抓取方法及系统 | |
WO2017117781A1 (zh) | 网络信息的分类方法及系统 | |
WO2017128351A1 (zh) | 一种房产网中介评价方法及系统 | |
WO2017173653A1 (zh) | 基于互联网的教育问题回答方法及系统 | |
WO2017117716A1 (zh) | 智能城市的室外定位管理方法及系统 | |
WO2017117785A1 (zh) | 网络链接的搜索方法及系统 | |
WO2017128438A1 (zh) | 大数据的应用方法及系统 | |
WO2017117805A1 (zh) | 网络信息的抓取方法及系统 | |
WO2017128440A1 (zh) | 大数据的监控提醒方法及系统 | |
US12106038B1 (en) | System and method for text-to-text transformation of qualitative responses | |
WO2017161576A1 (zh) | 数据预警方法及系统 | |
WO2018027576A1 (zh) | 工作时间在物联网中统计方法及系统 | |
WO2018027572A1 (zh) | 物联网中机器人电量控制方法及系统 | |
WO2017128365A1 (zh) | 基于大数据的自动化信息分析方法及系统 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16882927 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 16882927 Country of ref document: EP Kind code of ref document: A1 |