WO2020082890A1 - 一种文本还原方法、装置及电子设备 - Google Patents

一种文本还原方法、装置及电子设备 Download PDF

Info

Publication number
WO2020082890A1
WO2020082890A1 PCT/CN2019/103103 CN2019103103W WO2020082890A1 WO 2020082890 A1 WO2020082890 A1 WO 2020082890A1 CN 2019103103 W CN2019103103 W CN 2019103103W WO 2020082890 A1 WO2020082890 A1 WO 2020082890A1
Authority
WO
WIPO (PCT)
Prior art keywords
word segmentation
text
matched
texts
characters
Prior art date
Application number
PCT/CN2019/103103
Other languages
English (en)
French (fr)
Inventor
周书恒
刘金星
祝慧佳
赵智源
郭亚
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020082890A1 publication Critical patent/WO2020082890A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Definitions

  • the embodiments of the present application relate to the technical field of network security, and in particular, to a text restoration method, device, and electronic equipment.
  • the gray and black products will spread spam in the form of word-breaking expression.
  • the normal content is "I am a lightning loan, you can force a loan of 5000-10000w", which is expressed as "I am a lightning of the past, and I can open the loan of 5000-10000w" through the word breaking.
  • the purpose of the embodiments of the present application is to provide a text restoring method, device and electronic equipment, which can restore the mutated text expressed by the split word back to the normal text.
  • a text restoration method including:
  • the reduced text of the target text is selected from the at least one matched word segmentation texts.
  • a text restoration device including:
  • the word segmentation module performs word segmentation processing on the target text to obtain the word segmentation text after the word segmentation of the target text, where the word segmentation text includes characters that cannot form a word segmentation;
  • the matching module based on the sample set of word splitting, matches characters in the word segmentation text that cannot form a word segmentation, to obtain at least one word segmentation text after matching;
  • An evaluation module inputting the at least one set of matched word segmentation texts into a preset language model to obtain the confidence of the at least one set of matched word segmentation texts;
  • the selection module selects the reduced text of the target text from the at least one matched word segmentation text based on the confidence of the at least one group of matched word segmentation texts.
  • an electronic device including: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program being executed by the processor:
  • the reduced text of the target text is selected from the at least one matched word segmentation texts.
  • a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the following steps are implemented:
  • the reduced text of the target text is selected from the at least one matched word segmentation texts.
  • the embodiments of the present application first perform word segmentation processing on the target text to determine the characters that cannot form the word segmentation. These characters that cannot form the word segmentation are matched and restored as the object of word splitting matching, and at least one After matching, the word segmentation text. After that, the at least one matched word segmentation text is evaluated for confidence through a preset language model, and the best matched word segmentation text is selected as the restored text of the target text based on the confidence level.
  • the solution of the embodiment of the present application can effectively restore the mutated text expressed by splitting characters to normal text, and can improve the network platform's ability to recognize spam.
  • FIG. 1 is a schematic diagram of steps of a text reduction method provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a text reduction method provided by an embodiment of the present application in actual application;
  • FIG. 3 is a schematic diagram of a hardware structure of an electronic device provided by an embodiment of this application.
  • FIG. 4 is a schematic diagram of a logical structure of a text restoration device provided by an embodiment of the present application.
  • the gray and black industry will send spam messages expressed in word-breaking mode to bypass the supervision of the network platform.
  • the present application aims to provide a technical solution capable of restoring the mutated text expressed by the split word back to the normal text, which can improve the network platform's ability to recognize spam.
  • FIG. 1 is a flowchart of a text reduction method according to an embodiment of the present application.
  • the text restoration method of FIG. 1 can be executed by the text restoration device.
  • the method includes:
  • Step S102 Obtain the target text.
  • the embodiment of the present application does not specifically limit the source of the target text.
  • the target text may be text information sent by a user obtained from an online social platform.
  • the evaluation information and chat information sent by the user can be obtained from the online shopping platform.
  • Step S104 Perform word segmentation processing on the target text to obtain the word segmentation text after the word segmentation of the target text.
  • the word segmentation text includes characters that cannot form a word segmentation.
  • any existing word segmentation method may be used to perform word segmentation processing on the target text, so as to determine characters in the target text that cannot form a word segmentation.
  • the determined characters that cannot form a participle may include: any one of Chinese characters, the radicals of Chinese characters, and the roots of Chinese characters. These characters that cannot form a participle are more likely to be expressed by splitting words, which is to be followed Key objects for word recognition.
  • Step S106 based on the sample set of word splitting, matching characters in the word segmentation text that cannot form the word segmentation to obtain at least one matched word segmentation text.
  • the word-sampling sample set includes preset word-splitting expressions. For example, “Huakoubei” corresponds to “Huabei”, “Jikoubei” corresponds to “borrow”, “Qianqian” corresponds to “borrowing”, “Qiqianqian” corresponds to "borrowing money” and so on.
  • the form of expression can also be a form of word splitting for a certain Chinese character, such as "Qi” corresponding to borrowing, "Koubei” corresponding to " ⁇ ".
  • the characters that cannot form the participle in the participle text can be splitted and matched to restore the normally expressed information.
  • characters that are adjacent to each other in the line direction of the word segmentation and cannot form a word segmentation can be matched.
  • the participle text is "Liuhe Caiyu Yuebei and one million yen”
  • the split word sample set record “Caiyu” corresponds to "color”
  • "Beijing” corresponds to "earn”.
  • "Cai”, “ ⁇ ”, “Month”, “Bei”, “Jian” and " ⁇ ” are characters that cannot be determined as numerators in the participle text.
  • " ⁇ ", " ⁇ ", “ ⁇ ” ", and” to match the resulting molecular text after the match is: "Liuhe Caiyue earns millions”.
  • the participle text is: "Add mobile phone number xx, can be cashed out
  • the "self” and “heart” adjacent to the column direction can be matched, and the determined molecular text after the match is: "add mobile phone number xx, which can be cashed with low interest.”
  • Step S108 input at least one set of matched word segmentation text into a preset language model to obtain the confidence of the at least one set of matched word segmentation text;
  • the matched word segmentation text determined based on the word splitting sample set is not necessarily the correct reduced text, so the confidence of the matched word segmentation text needs to be evaluated using a preset language model evaluation.
  • the confidence level of the segmented text after matching can reflect the reduction accuracy of the segmented text after matching.
  • the preset language model is flexibly set according to actual application scenarios, which is not specifically limited in this embodiment of the present application.
  • the preset language model can be obtained by training the spam sample set. After inputting at least one set of matched word segmentation texts into the preset language model, the preset language model scores the confidence of at least one set of matched word segmentation texts based on spam evaluation criteria. Among them, the higher the confidence score of the word segmentation text after matching, the more likely it is spam, and the corresponding reduction accuracy rate is also higher.
  • the preset language model of the embodiment of the present application uses the expression of the correct sentence as an evaluation criterion, and scores the confidence of at least one set of matched word segmentation texts. For example, based on the correct sentence structure of "subject, predicate, and object", the confidence of at least one set of matched word segmentation text is scored. Among them, the higher the confidence score of the word segmentation text after matching, the higher the corresponding reduction accuracy rate.
  • Step S110 Based on the confidence of the at least one set of matched word segmentation texts, select a reduced text of the target text from the at least one matched word segmentation texts.
  • the one with the highest confidence level can be selected from the at least one kind of matched word segmentation text as the restored text of the target text.
  • word segmentation processing is first performed on the target text to determine characters that cannot form a word segmentation. These characters that cannot form a word segmentation are matched and restored as an object of word splitting and matching, and at least one matched word segmentation text is obtained. After that, the at least one matched word segmentation text is evaluated for confidence through a preset language model, and the best matched word segmentation text is selected as the restored text of the target text based on the confidence level.
  • the solution of the embodiment of the present application can effectively restore the mutated text expressed by splitting characters to normal text, and can improve the network platform's ability to recognize spam.
  • Step 1 Obtain the target text
  • the target text sent by the user can be obtained from an online social platform (such as communication software and online shopping software).
  • an online social platform such as communication software and online shopping software.
  • the target text is spam expressed in the form of word splitting.
  • Step two determine the participle text
  • Step three split word matching
  • the word segmentation text is used for word segmentation matching of the above word segmentation words, where "Qi” can be matched as “borrow”, “Likou” can be matched as “plus”, and “ ⁇ ⁇ ” can be matched as "oh” ,
  • the final word segmentation text after matching includes the following two types:
  • the first is "need to borrow money, add my mobile phone number"
  • the second type is "need to borrow money, force mobile phone number.”
  • Step 4 Confidence assessment
  • step three the two matched word segmentation texts of step three are input into the preset language model to calculate the confidence level P1 of "need to borrow money, add my mobile phone number” and “need to borrow money, strong mobile phone number” Confidence P2.
  • the preset language model may be a classification model, which is trained by spam samples that borrow money illegally.
  • some common features of illegal borrowing can be used as the input vector of the preset language model, and the preset language model can be trained through the junk information samples, so as to continuously optimize the weight of the input vector.
  • the embodiments of the present application do not preset the functions adopted by the language model for specific limitation. However, all functions used for classification can be applied to the preset language model in the embodiments of the present application.
  • Step 6 restore text output
  • step five based on the comparison result of step five (P1> P2), the final output of the restored text is "need to borrow money, add my mobile phone number".
  • the text restoration method can recognize the characters represented by the split characters of the target text and perform matching reduction.
  • the target text is segmented first, and only the characters that cannot be segmented can be used as the target of word splitting, thereby effectively reducing the number of matches and improving the accuracy of matching.
  • the language model is used to further select the best matched word segmentation text as the text of the target text.
  • the calculation of the whole scheme is simple, and requires relatively few processing resources, so it is especially suitable for the network platform to identify the spam expressed by the word splitting.
  • the electronic device includes a processor, and optionally also includes an internal bus, a network interface, and a memory.
  • the memory may include a memory, such as a high-speed random access memory (Random-Access Memory, RAM), or may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
  • RAM Random-Access Memory
  • non-volatile memory such as at least one disk memory.
  • the electronic device may also include hardware required for other services.
  • the processor, network interface and memory can be connected to each other through an internal bus, which can be an ISA (Industry Standard Architecture, Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, or an EISA (Extended Industry, Standard, Architecture, extended industry standard structure) bus, etc.
  • the bus can be divided into an address bus, a data bus, and a control bus. For ease of representation, only one bidirectional arrow is used in FIG. 3, but it does not mean that there is only one bus or one type of bus.
  • the program may include program code, and the program code includes computer operation instructions.
  • the memory may include memory and non-volatile memory, and provide instructions and data to the processor.
  • the processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it, forming a question and answer pair data mining device at a logical level.
  • the processor executes the programs stored in the memory and is specifically used to perform the following operations:
  • the reduced text of the target text is selected from the at least one matched word segmentation texts.
  • the text restoration method disclosed in the embodiment shown in FIG. 1 of the present application may be applied to a processor, or implemented by a processor.
  • the processor may be an integrated circuit chip with signal processing capabilities.
  • each step of the above method may be completed by an integrated logic circuit of hardware in the processor or instructions in the form of software.
  • the above-mentioned processor may be a general-purpose processor, including a central processor (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc .; it may also be a digital signal processor (Digital Signal Processor, DSP), dedicated integration Circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the methods, steps, and logical block diagrams disclosed in the embodiments of the present application may be implemented or executed.
  • the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied and executed by a hardware decoding processor, or may be executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module may be located in a mature storage medium in the art, such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, an electrically erasable programmable memory, and a register.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
  • the electronic device can also perform the method shown in FIG. 1 and implement the functions of the embodiment shown in FIG. 1 and FIG. 2 of the text restoration device, which will not be repeated here.
  • the electronic device of the present application does not exclude other implementations, such as a logic device or a combination of software and hardware, etc., that is to say, the execution body of the following processing flow is not limited to each logical unit, It can also be a hardware or logic device.
  • Embodiments of the present application also provide a computer-readable storage medium that stores one or more programs, the one or more programs include instructions, and when the instructions are included in a portable electronic device of multiple application programs When executed, the portable electronic device can execute the method of the embodiment shown in FIG. 1 and is specifically used to execute the following method:
  • the reduced text of the target text is selected from the at least one matched word segmentation texts.
  • FIG. 4 is a schematic structural diagram of a text reduction device 400 according to an embodiment of the present application, including:
  • the obtaining module 410 obtains the target text
  • the word segmentation module 420 performs word segmentation processing on the target text to obtain a word segmentation text after word segmentation of the target text, where the word segmentation text includes characters that cannot form a word segmentation;
  • the matching module 430 matches the characters that cannot form a word segmentation in the word segmentation text based on the sample set of word splitting to obtain at least one word segmentation text after matching;
  • the evaluation module 440 inputs the at least one set of matched word segmentation texts into a preset language model to obtain the confidence of the at least one set of matched word segmentation texts;
  • the selection module 450 selects the restored text of the target text from the at least one matched word segmentation text based on the confidence of the at least one group of matched word segmentation texts.
  • word segmentation processing is first performed on the target text to determine characters that cannot form a word segmentation. These characters that cannot form a word segmentation are matched and restored as objects for word splitting and matching, and at least one matched word segmentation text is obtained. After that, the at least one matched word segmentation text is evaluated for confidence through a preset language model, and the best matched word segmentation text is selected as the restored text of the target text based on the confidence level.
  • the solution of the embodiment of the present application can effectively restore the mutated text expressed by splitting characters to normal text, and can improve the network platform's ability to recognize spam.
  • the matching module 430 is specifically used to:
  • the characters that are adjacent to each other in the row direction of the word segmentation text and cannot form a word segmentation are matched.
  • the matching module 430 is specifically used to:
  • the characters in the word segmentation text that are adjacent to each other in the row segment and the word segment that cannot form a word segmentation are matched.
  • the selection module 450 is specifically used to:
  • the one with the highest confidence from the at least one matched word segmentation text is selected as the restored text of the target text.
  • the characters that cannot form a word segmentation in the word segmentation text include: any one of Chinese characters, radicals of Chinese characters, and roots of Chinese characters.
  • the preset language model is trained based on the junk information sample set.
  • the obtaining module 410 is specifically used to:
  • the text restoration device of the embodiment of the present application may perform the method of FIG. 1 and implement the function of the method in the embodiments shown in FIGS. 1 and 2, and details are not described herein again.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

本申请实施例涉及一种文本还原方法、装置及电子设备。文本还原方法包括:获取目标文本;对所述目标文本进行分词处理,得到所述目标文本分词后的分词文本,所述分词文本包含无法组成分词的字符;基于拆字样本集,对所述分词文本中无法组成分词的字符进行匹配,得到至少一种匹配后分词文本;将所述至少一组匹配后分词文本输入预设语言模型,得到所述至少一组匹配后分词文本的置信度;基于所述至少一组匹配后分词文本的置信度,从所述至少一种匹配后分词文本中选取出所述目标文本的还原文本。

Description

一种文本还原方法、装置及电子设备 技术领域
本申请实施例涉及网络安全技术领域,尤其涉及一种文本还原方法、装置及电子设备。
背景技术
随着互联网的兴起,信息传递的便捷性使得互联网信息量成几何级增长。用户常常会收到互联网灰黑产发送的垃圾信息,比如推销信息、诈骗信息、非法宣传信息等。对于这些垃圾信息,一般可以通过网络平台进行拦截。然而,目前灰黑产为了绕过平台的各种防控手段,会以拆字表达的方式传播垃圾信息。比如正常内容是“我是闪电借款,可以强开借呗5000-10000w”,通过拆字方式表达为“我是闪电亻 昔款,可以弓虽开亻 昔呗5000-10000w”。
有鉴于此,为了提高网络平台针对垃圾信息的识别能力,如何将拆字表达的变异文本还原回正常文本,是本申请所要解决的技术问题。
发明内容
本申请实施例目的是提供一种文本还原方法、装置及电子设备,能够将拆字表达的变异文本还原回正常文本。
为了实现上述目的,本申请实施例是这样实现的:
第一方面,提供一种文本还原方法,包括:
获取目标文本;
对所述目标文本进行分词处理,得到所述目标文本分词后的分词文本,所述分词文本包含无法组成分词的字符;
基于拆字样本集,对所述分词文本中无法组成分词的字符进行匹配,得到至少一种匹配后分词文本;
将所述至少一组匹配后分词文本输入预设语言模型,得到所述至少一组匹配后分词文本的置信度;
基于所述至少一组匹配后分词文本的置信度,从所述至少一种匹配后分词文本中选取出所述目标文本的还原文本。
第二方面,提供了一种文本还原装置,包括:
获取模块,获取目标文本;
分词模块,对所述目标文本进行分词处理,得到所述目标文本分词后的分词文本,所述分词文本包含无法组成分词的字符;
匹配模块,基于拆字样本集,对所述分词文本中无法组成分词的字符进行匹配,得到至少一种匹配后分词文本;
评估模块,将所述至少一组匹配后分词文本输入预设语言模型,得到所述至少一组匹配后分词文本的置信度;
选取模块,基于所述至少一组匹配后分词文本的置信度,从所述至少一种匹配后分词文本中选取出所述目标文本的还原文本。
第三方面,提供了一种电子设备,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行:
获取目标文本;
对所述目标文本进行分词处理,得到所述目标文本分词后的分词文本,所述分词文本包含无法组成分词的字符;
基于拆字样本集,对所述分词文本中无法组成分词的字符进行匹配,得到至少一种匹配后分词文本;
将所述至少一组匹配后分词文本输入预设语言模型,得到所述至少一组匹配后分词文本的置信度;
基于所述至少一组匹配后分词文本的置信度,从所述至少一种匹配后分词文本中选取出所述目标文本的还原文本。
第四方面,提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如下步骤:
获取目标文本;
对所述目标文本进行分词处理,得到所述目标文本分词后的分词文本,所述分词文 本包含无法组成分词的字符;
基于拆字样本集,对所述分词文本中无法组成分词的字符进行匹配,得到至少一种匹配后分词文本;
将所述至少一组匹配后分词文本输入预设语言模型,得到所述至少一组匹配后分词文本的置信度;
基于所述至少一组匹配后分词文本的置信度,从所述至少一种匹配后分词文本中选取出所述目标文本的还原文本。
由以上本申请实施例提供的技术方案可见,本申请实施例首先对目标文本进行分词处理,确定出无法组成分词的字符,这些无法组成分词的字符作为拆字匹配的对象进行匹配还原,得到至少一种匹配后分词文本。之后,通过预设语言模型对至少一种匹配后分词文本进行置信度的评估,并基于置信度择优筛选出最优的匹配后分词文本作为目标文本的还原文本。本申请实施例的方案能够有效将拆字表达的变异文本还原成正常文本,可提高网络平台对垃圾信息的识别能力。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请实施例中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的文本还原方法的步骤示意图;
图2为本申请实施例提供的文本还原方法在实际应用中的流程示意图;
图3为本申请实施例提供的电子设备的硬件结构示意图;
图4为本申请实施例提供的文本还原装置的逻辑结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请中的技术方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领 域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
如前所述,目前灰黑产会发送拆字方式表达的垃圾信息,以绕过网络平台的监管。有鉴于此,本申请旨在提供一种能够将拆字表达的变异文本还原回正常文本的技术方案,可提高网络平台对垃圾信息的识别能力。
图1是本申请的一个实施例文本还原方法的流程图。图1的文本还原方法可由文本还原装置执行。该方法包括:
步骤S102,获取目标文本。
针对步骤S102而言:
本申请实施例并不对目标文本的来源作具体限定。
作为示例性介绍,目标文本可以是从网络社交平台中获取到的用户发送的文本信息。
比如,可以从网络购物平台中获取用户发送的评价信息、聊天信息等。
应理解,但凡是网络平台需要监管的信息对象,都可以作为目标文本。
步骤S104,对目标文本进行分词处理,得到目标文本分词后的分词文本,分词文本包含无法组成分词的字符。
针对步骤S104而言:
本实施例可以使用现有任意分词方法,对目标文本进行分词处理,从而确定出目标文本中无法组成分词的字符。
作为示例性介绍,确定出的无法组成分词的字符可以包括:汉字、汉字的偏旁、汉字的字根中任意一者,这些无法组成分词的字符较高概率是以拆字方式表达的,是后续进行拆字识别的重点对象。
步骤S106,基于拆字样本集,对分词文本中无法组成分词的字符进行匹配,得到至少一种匹配后分词文本。
针对步骤S106而言:
拆字样本集包括预先设置的拆字表达形式。比如,“花口贝”对应“花呗”、“借口贝”对应“借呗”、“亻 昔款”对应“借款”、“亻 昔钱”对应“借钱”等针对某些词语的拆字表达形式,也可以是“亻 昔”对应借、“口贝”对应“呗”等针对某一汉 字的拆字表达形式。
本步骤中,通过拆字样本集,可以对分词文本中无法组成分词的字符进行拆字匹配,还原回正常表达的信息。
具体地,可以对分词文本中行方向相邻的无法组成分词的字符进行匹配。
比如,分词文本为“六合采彡月贝兼百万¥”,拆字样本集记录“采彡”对应为“彩”,“贝兼”对应为“赚”。可以知道的“采”、“彡”、“月”、“贝”、“兼”“¥”是分词文本中无法确定为分子的字符,则基于拆字样本集对上述相邻的“采”、“彡”、“月”、“贝”“、兼”进行匹配,得到的匹配后分子文本为:“六合彩月赚百万”。
同理,也可以对分词文本中列方向相邻的无法组成分词的字符进行匹配;
比如,分词文本为:“加手机号xx,可低自套现
心”;
则可以基于拆字样本集,对列方向相邻的“自”、“心”进行匹配,确定到的匹配后分子文本为:“加手机号xx,可低息套现”。
步骤S108,将至少一组匹配后分词文本输入预设语言模型,得到该至少一组匹配后分词文本的置信度;
针对步骤108而言:
应理解,基于拆字样本集所确定到的匹配后分词文本并不一定是正确的还原文本,因此需要使用预设语言模型评估对匹配后分词文本的置信度进行评估。匹配后分词文本的置信度的大小,能够反映该匹配后分词文本的还原准确率。
应理解,预设语言模型是根据实际的应用场景灵活设置的,本申请实施例对此不作具体限定。
作为示例性介绍,假设本申请实施例的方案用于还原网络中以拆字方式表达的垃圾信息。预设语言模型可以通过垃圾信息样本集训练得到。在将至少一组匹配后分词文本输入预设语言模型后,预设语言模型基于垃圾信息的评估标准,对至少一组匹配后分词文本的置信度进行打分。其中,匹配后分词文本的置信度分值越高,则越可能是垃圾信息,对应的还原准确率也越高。
或者,本申请实施例的预设语言模型将正确语句的表达方式作为评估标准,对至少一组匹配后分词文本的置信度进行打分。比如,基于“主、谓、宾”的正确语句结构, 对至少一组匹配后分词文本的置信度进行打分。其中,匹配后分词文本的置信度分值越高,则对应的还原准确率也越高。
由于预设语言模型的实现方式并不唯一,本文不再举例赘述。
步骤S110,基于上述至少一组匹配后分词文本的置信度,从该至少一种匹配后分词文本中选取出目标文本的还原文本。
针对步骤S110而言:
本步骤可以从上述至少一种匹配后分词文本中选取置信度最高的一者作为目标文本的还原文本。
本申请实施例中,首先对目标文本进行分词处理,确定出无法组成分词的字符,这些无法组成分词的字符作为拆字匹配的对象进行匹配还原,得到至少一种匹配后分词文本。之后,通过预设语言模型对至少一种匹配后分词文本进行置信度的评估,并基于置信度择优筛选出最优的匹配后分词文本作为目标文本的还原文本。本申请实施例的方案能够有效将拆字表达的变异文本还原成正常文本,可提高网络平台对垃圾信息的识别能力。
下面对本申请实施例的文本还原方法在实际应用中的流程进行详细介绍。
本申请实施例的文本还原方法的主要流程包括:
步骤一,获取目标文本;
本步骤中,可以从网络社交平台(比如通讯软件、网购软件)中,获取由用户发送的目标文本。
作为示例性介绍,假设目标文本的内容为“需要亻 昔钱,力口我手机号”。显然,该目标文本是以拆字方式表达的垃圾信息。
步骤二,确定分词文本;
本步骤中,可以对“需要亻昔钱,力口我手机号”进行分词处理。为方便理解,分词之间以空格分隔,对应得到的分词文本为:“需要亻 昔 钱,力 口 我 手机号”。
应理解,上述目标文本中“需要”、“我”、“手机号”可以确定为分词,“亻”、“昔”、“钱”、“力”、“口”为无法作为分词的字符。
步骤三,拆字匹配;
本步骤中,利用拆字表资源对上述分词文本进行拆字匹配,其中“亻 昔”可以匹配为“借”,“力 口”可以匹配为“加”,“口 我”可以匹配为匹配“哦”,基于拆字表资源,最终得到的匹配后分词文本包括以下两种:
第一种为“需要借钱,加我手机号”;
第二种为“需要借钱,力哦手机号”。
步骤四,置信度评估;
本步骤中,将步骤三种的两种匹配后分词文本输入预设语言模型,以计算出“需要借钱,加我手机号”的置信度P1以及“需要借钱,力哦手机号”的置信度P2。
其中,预设语言模型可以是分类模型,由非法借钱的垃圾信息样本训练得到。
比如,可以将一些与非法借款常见的特征作为预设语言模型的输入向量,并通过垃圾信息样本对预设语言模型进行训练,从而不断优化输入向量的权重。
在将“需要借钱,加我手机号”和“需要借钱,力哦手机号”输入至训练完成的预设语言模型后,显然前者具有非法借钱常见特征“加我手机号”,因此输入分类模型后,可以得到更高的置信度。
需要说明的是,本申请实施例并不预设语言模型所采用函数作具体限定。但凡是用于分类的函数都可以适用于本申请实施例的预设语言模型。
步骤五,概率比较;
本步骤中,对第一种匹配后分词文本的置信度和第二种匹配后分词文本的置信度进行大小比较(P1>P2)。显然,置信度较大的一者作为正确的还原文本的概率更高。
步骤六,还原文本输出;
本步骤中,基于步骤五的比较结果(P1>P2),最终输出的还原文本为“需要借钱,加我手机号”。
综上所述,本申请实施例的文本还原方法可以识别目标文本的拆字表示的字符,并进行匹配还原。在具体实施时,先对目标文本进行分词处理,可以仅将无法作为分词的字符作为拆字匹配的对象,从而有效降低匹配次数,并提高了匹配的准确率。之后,再结合语言模型进一步择优筛选最佳的匹配后分词文本作为目标文本的文本。整个方案的计算简单,需要占用处理资源相对较少,因此特别适用于网络平台识别拆字表达的垃圾信息。
图3是本申请的一个实施例电子设备的结构示意图。请参考图3,在硬件层面,该电子设备包括处理器,可选地还包括内部总线、网络接口、存储器。其中,存储器可能包含内存,例如高速随机存取存储器(Random-Access Memory,RAM),也可能还包括非易失性存储器(non-volatile memory),例如至少1个磁盘存储器等。当然,该电子设备还可能包括其他业务所需要的硬件。
处理器、网络接口和存储器可以通过内部总线相互连接,该内部总线可以是ISA(Industry Standard Architecture,工业标准体系结构)总线、PCI(Peripheral Component Interconnect,外设部件互连标准)总线或EISA(Extended Industry Standard Architecture,扩展工业标准结构)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图3中仅用一个双向箭头表示,但并不表示仅有一根总线或一种类型的总线。
存储器,用于存放程序。具体地,程序可以包括程序代码,所述程序代码包括计算机操作指令。存储器可以包括内存和非易失性存储器,并向处理器提供指令和数据。
处理器从非易失性存储器中读取对应的计算机程序到内存中然后运行,在逻辑层面上形成问答对数据挖掘装置。处理器,执行存储器所存放的程序,并具体用于执行以下操作:
获取目标文本;
对所述目标文本进行分词处理,得到所述目标文本分词后的分词文本,所述分词文本包含无法组成分词的字符;
基于拆字样本集,对所述分词文本中无法组成分词的字符进行匹配,得到至少一种匹配后分词文本;
将所述至少一组匹配后分词文本输入预设语言模型,得到所述至少一组匹配后分词文本的置信度;
基于所述至少一组匹配后分词文本的置信度,从所述至少一种匹配后分词文本中选取出所述目标文本的还原文本。
本申请图1所示实施例揭示的文本还原方法可以应用于处理器中,或者由处理器实现。处理器可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器,包括中央处理器(Central Processing Unit,CPU)、网络处理器(Network Processor,NP)等;还可以是数字信号处理器(Digital Signal Processor, DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。
该电子设备还可执行图1所示的方法,并实现文本还原装置在图1、图2所示实施例的功能,本文不再进行赘述。
当然,除了软件实现方式之外,本申请的电子设备并不排除其他实现方式,比如逻辑器件抑或软硬件结合的方式等等,也就是说以下处理流程的执行主体并不限定于各个逻辑单元,也可以是硬件或逻辑器件。
本申请实施例还提出了一种计算机可读存储介质,该计算机可读存储介质存储一个或多个程序,该一个或多个程序包括指令,该指令当被包括多个应用程序的便携式电子设备执行时,能够使该便携式电子设备执行图1所示实施例的方法,并具体用于执行以下方法:
获取目标文本;
对所述目标文本进行分词处理,得到所述目标文本分词后的分词文本,所述分词文本包含无法组成分词的字符;
基于拆字样本集,对所述分词文本中无法组成分词的字符进行匹配,得到至少一种匹配后分词文本;
将所述至少一组匹配后分词文本输入预设语言模型,得到所述至少一组匹配后分词文本的置信度;
基于所述至少一组匹配后分词文本的置信度,从所述至少一种匹配后分词文本中选取出所述目标文本的还原文本。
应理解,本申请的计算机可读存储介质在本处理执行时,可以实现文本还原装置在图1、图2所示实施例的功能,本文不再进行赘述。
图4是本申请的一个实施例文本还原装置400的结构示意图,包括:
获取模块410,获取目标文本;
分词模块420,对所述目标文本进行分词处理,得到所述目标文本分词后的分词文本,所述分词文本包含无法组成分词的字符;
匹配模块430,基于拆字样本集,对所述分词文本中无法组成分词的字符进行匹配,得到至少一种匹配后分词文本;
评估模块440,将所述至少一组匹配后分词文本输入预设语言模型,得到所述至少一组匹配后分词文本的置信度;
选取模块450,基于所述至少一组匹配后分词文本的置信度,从所述至少一种匹配后分词文本中选取出所述目标文本的还原文本。
本申请实施例首先对目标文本进行分词处理,确定出无法组成分词的字符,这些无法组成分词的字符作为拆字匹配的对象进行匹配还原,得到至少一种匹配后分词文本。之后,通过预设语言模型对至少一种匹配后分词文本进行置信度的评估,并基于置信度择优筛选出最优的匹配后分词文本作为目标文本的还原文本。本申请实施例的方案能够有效将拆字表达的变异文本还原成正常文本,可提高网络平台对垃圾信息的识别能力。
可选地,作为一个实施例,匹配模块430具体用于:
基于拆字样本资源,对所述分词文本中行方向相邻的无法组成分词的字符进行匹配。
可选地,作为一个实施例,匹配模块430具体用于:
基于拆字样本资源,对所述分词文本中行列向相邻的无法组成分词的字符进行匹配。
可选地,作为一个实施例,选取模块450具体用于:
从所述至少一种匹配后分词文本中选取置信度最高的一者作为所述目标文本的还原文本。
可选地,作为一个实施例,所述分词文本中无法组成分词的字符包括:汉字、汉字的偏旁、汉字的字根中任意一者。
可选地,作为一个实施例,所述预设语言模型基于垃圾信息样本集训练得到。
可选地,作为一个实施例,获取模块410具体用于:
从网络社交平台中,获取用户发送的目标文本。
应理解,本申请实施例的文本还原装置可执行图1的方法,并实现该方法在图1、图2所示实施例的功能,本文不再进行赘述。
本领域技术人员应明白,本说明书的实施例可提供为方法、系统或计算机程序产品。因此,本说明书可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本说明书可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
以上仅为本说明书的实施例而已,并不用于限制本说明书。对于本领域技术人员来说,本说明书可以有各种更改和变化。凡在本说明书的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本说明书的权利要求范围之内。

Claims (10)

  1. 一种文本还原方法,包括:
    获取目标文本;
    对所述目标文本进行分词处理,得到所述目标文本分词后的分词文本,所述分词文本包含无法组成分词的字符;
    基于拆字样本集,对所述分词文本中无法组成分词的字符进行匹配,得到至少一种匹配后分词文本;
    将所述至少一组匹配后分词文本输入预设语言模型,得到所述至少一组匹配后分词文本的置信度;
    基于所述至少一组匹配后分词文本的置信度,从所述至少一种匹配后分词文本中选取出所述目标文本的还原文本。
  2. 根据权利要求1所述的文本还原方法,
    基于拆字样本资源,对所述分词文本中无法组成分词的字符进行匹配,包括:
    基于拆字样本资源,对所述分词文本中行方向相邻的无法组成分词的字符进行匹配。
  3. 根据权利要求1所述的文本还原方法,
    基于拆字样本资源,对所述分词文本中无法组成分词的字符进行匹配,包括:
    基于拆字样本资源,对所述分词文本中列方向相邻的无法组成分词的字符进行匹配。
  4. 根据权利要求1所述的文本还原方法,
    基于所述至少一组匹配后分词文本的置信度,从所述至少一种匹配后分词文本中选取出所述目标文本的还原文本,包括:
    从所述至少一种匹配后分词文本中选取置信度最高的一者作为所述目标文本的还原文本。
  5. 根据权利要求1所述的文本还原方法,
    所述分词文本中无法组成分词的字符包括:汉字、汉字的偏旁、汉字的字根中任意一者。
  6. 根据权利要求1所述的文本还原方法,
    所述预设语言模型基于垃圾信息样本集训练得到。
  7. 根据权利要求1所述的文本还原方法,
    获取目标文本,包括:
    从网络社交平台中,获取用户发送的目标文本。
  8. 一种文本还原装置,包括:
    获取模块,获取目标文本;
    分词模块,对所述目标文本进行分词处理,得到所述目标文本分词后的分词文本,所述分词文本包含无法组成分词的字符;
    匹配模块,基于拆字样本集,对所述分词文本中无法组成分词的字符进行匹配,得到至少一种匹配后分词文本;
    评估模块,将所述至少一组匹配后分词文本输入预设语言模型,得到所述至少一组匹配后分词文本的置信度;
    选取模块,基于所述至少一组匹配后分词文本的置信度,从所述至少一种匹配后分词文本中选取出所述目标文本的还原文本。
  9. 一种电子设备包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行:
    获取目标文本;
    对所述目标文本进行分词处理,得到所述目标文本分词后的分词文本,所述分词文本包含无法组成分词的字符;
    基于拆字样本集,对所述分词文本中无法组成分词的字符进行匹配,得到至少一种匹配后分词文本;
    将所述至少一组匹配后分词文本输入预设语言模型,得到所述至少一组匹配后分词文本的置信度;
    基于所述至少一组匹配后分词文本的置信度,从所述至少一种匹配后分词文本中选取出所述目标文本的还原文本。
  10. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如下步骤:
    获取目标文本;
    对所述目标文本进行分词处理,得到所述目标文本分词后的分词文本,所述分词文本包含无法组成分词的字符;
    基于拆字样本集,对所述分词文本中无法组成分词的字符进行匹配,得到至少一种匹配后分词文本;
    将所述至少一组匹配后分词文本输入预设语言模型,得到所述至少一组匹配后分词文本的置信度;
    基于所述至少一组匹配后分词文本的置信度,从所述至少一种匹配后分词文本中选取出所述目标文本的还原文本。
PCT/CN2019/103103 2018-10-25 2019-08-28 一种文本还原方法、装置及电子设备 WO2020082890A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811248320.3A CN109597987A (zh) 2018-10-25 2018-10-25 一种文本还原方法、装置及电子设备
CN201811248320.3 2018-10-25

Publications (1)

Publication Number Publication Date
WO2020082890A1 true WO2020082890A1 (zh) 2020-04-30

Family

ID=65957463

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/103103 WO2020082890A1 (zh) 2018-10-25 2019-08-28 一种文本还原方法、装置及电子设备

Country Status (3)

Country Link
CN (1) CN109597987A (zh)
TW (1) TWI749349B (zh)
WO (1) WO2020082890A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114040409A (zh) * 2021-11-11 2022-02-11 中国联合网络通信集团有限公司 短信识别方法、装置、设备及存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109597987A (zh) * 2018-10-25 2019-04-09 阿里巴巴集团控股有限公司 一种文本还原方法、装置及电子设备
CN117408248A (zh) * 2022-07-07 2024-01-16 马上消费金融股份有限公司 文本分词方法、装置、计算机设备及存储介质
CN117033612B (zh) * 2023-08-18 2024-06-04 中航信移动科技有限公司 一种文本匹配方法、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6167367A (en) * 1997-08-09 2000-12-26 National Tsing Hua University Method and device for automatic error detection and correction for computerized text files
CN101876968A (zh) * 2010-05-06 2010-11-03 复旦大学 对网络文本与手机短信进行不良内容识别的方法
CN102999533A (zh) * 2011-09-19 2013-03-27 腾讯科技(深圳)有限公司 一种火星文识别方法和系统
CN105550169A (zh) * 2015-12-11 2016-05-04 北京奇虎科技有限公司 一种基于字符长度识别兴趣点名称的方法和装置
CN107357778A (zh) * 2017-06-22 2017-11-17 达而观信息科技(上海)有限公司 一种变形词的识别验证方法及系统
CN109597987A (zh) * 2018-10-25 2019-04-09 阿里巴巴集团控股有限公司 一种文本还原方法、装置及电子设备

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7257564B2 (en) * 2003-10-03 2007-08-14 Tumbleweed Communications Corp. Dynamic message filtering
US8396927B2 (en) * 2004-12-21 2013-03-12 Alcatel Lucent Detection of unwanted messages (spam)
CN102567304B (zh) * 2010-12-24 2014-02-26 北大方正集团有限公司 一种网络不良信息的过滤方法及装置
CN102231873A (zh) * 2011-06-22 2011-11-02 中兴通讯股份有限公司 垃圾短信监控方法、系统和监控处理装置
CN103874033B (zh) * 2012-12-12 2017-11-24 上海粱江通信系统股份有限公司 一种基于中文分词识别不规则垃圾短信的方法
CN106156017A (zh) * 2015-03-23 2016-11-23 北大方正集团有限公司 信息识别方法和信息识别系统
CN105574090B (zh) * 2015-12-10 2017-12-26 北京中科汇联科技股份有限公司 一种敏感词过滤方法及系统
CN106874253A (zh) * 2015-12-11 2017-06-20 腾讯科技(深圳)有限公司 识别敏感信息的方法及装置
CN107239447B (zh) * 2017-06-05 2020-12-18 厦门美柚股份有限公司 垃圾信息识别方法及装置、系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6167367A (en) * 1997-08-09 2000-12-26 National Tsing Hua University Method and device for automatic error detection and correction for computerized text files
CN101876968A (zh) * 2010-05-06 2010-11-03 复旦大学 对网络文本与手机短信进行不良内容识别的方法
CN102999533A (zh) * 2011-09-19 2013-03-27 腾讯科技(深圳)有限公司 一种火星文识别方法和系统
CN105550169A (zh) * 2015-12-11 2016-05-04 北京奇虎科技有限公司 一种基于字符长度识别兴趣点名称的方法和装置
CN107357778A (zh) * 2017-06-22 2017-11-17 达而观信息科技(上海)有限公司 一种变形词的识别验证方法及系统
CN109597987A (zh) * 2018-10-25 2019-04-09 阿里巴巴集团控股有限公司 一种文本还原方法、装置及电子设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114040409A (zh) * 2021-11-11 2022-02-11 中国联合网络通信集团有限公司 短信识别方法、装置、设备及存储介质
CN114040409B (zh) * 2021-11-11 2023-06-06 中国联合网络通信集团有限公司 短信识别方法、装置、设备及存储介质

Also Published As

Publication number Publication date
TW202016765A (zh) 2020-05-01
TWI749349B (zh) 2021-12-11
CN109597987A (zh) 2019-04-09

Similar Documents

Publication Publication Date Title
WO2020082890A1 (zh) 一种文本还原方法、装置及电子设备
US11494648B2 (en) Method and system for detecting fake news based on multi-task learning model
KR102061987B1 (ko) 위험 평가 방법 및 시스템
US20200311844A1 (en) Identifying duplicate user accounts in an identification document processing system
JP6697584B2 (ja) データリスクを識別する方法及び装置
WO2020073673A1 (zh) 一种文本分析方法及终端
TWI700632B (zh) 使用者意圖識別方法及裝置
WO2020244066A1 (zh) 一种文本分类方法、装置、设备及存储介质
CN109145299B (zh) 一种文本相似度确定方法、装置、设备及存储介质
US9189746B2 (en) Machine-learning based classification of user accounts based on email addresses and other account information
CN110442712B (zh) 风险的确定方法、装置、服务器和文本审理系统
CN111506708A (zh) 一种文本审核方法、装置、设备和介质
CN106874253A (zh) 识别敏感信息的方法及装置
US10291629B2 (en) Cognitive detection of malicious documents
US11847423B2 (en) Dynamic intent classification based on environment variables
TW201734893A (zh) 信用分的獲取、特徵向量值的輸出方法及其裝置
US10909235B1 (en) Password security warning system
CN110046648B (zh) 基于至少一个业务分类模型进行业务分类的方法及装置
WO2021012649A1 (zh) 问答样本的扩展方法及装置
CN111930623A (zh) 一种测试案例构建方法、装置及电子设备
Mambina et al. Classifying swahili smishing attacks for mobile money users: A machine-learning approach
CN110827036A (zh) 一种欺诈交易的检测方法、装置、设备及存储介质
CN113887214A (zh) 基于人工智能的意愿推测方法、及其相关设备
CN107665443B (zh) 获取目标用户的方法及装置
CN115563281A (zh) 基于文本数据增强的文本分类方法及装置

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19876352

Country of ref document: EP

Kind code of ref document: A1