WO2020133960A1 - 文本质检方法、电子装置、计算机设备及存储介质 - Google Patents

文本质检方法、电子装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2020133960A1
WO2020133960A1 PCT/CN2019/091879 CN2019091879W WO2020133960A1 WO 2020133960 A1 WO2020133960 A1 WO 2020133960A1 CN 2019091879 W CN2019091879 W CN 2019091879W WO 2020133960 A1 WO2020133960 A1 WO 2020133960A1
Authority
WO
WIPO (PCT)
Prior art keywords
quality inspection
training
text
word
model
Prior art date
Application number
PCT/CN2019/091879
Other languages
English (en)
French (fr)
Inventor
任鹏飞
谢宇峰
张雨嘉
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020133960A1 publication Critical patent/WO2020133960A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of intelligent decision-making technology, and in particular to a text quality inspection method, electronic device, computer equipment, and storage medium.
  • the currently used keyword retrieval system usually requires business personnel to spend a lot of energy to summarize the keywords, and deploy a large number of regular expressions to search, and the search results are then reviewed by the quality inspection personnel.
  • This retrieval-based system cannot understand the semantics of the text, and its accuracy is extremely low in some more complex quality inspection points, which greatly increases the workload of quality inspection personnel.
  • this application proposes a text quality inspection method, electronic device, computer equipment, and storage medium, which have a certain semantic understanding ability, improve the accuracy of quality inspection, reduce the pressure of quality inspection personnel, and greatly improve the quality of the text Inspection efficiency.
  • this application proposes a text quality inspection method, which is applied to an electronic device.
  • Prediction refers to checking the WeChat text with the saved quality inspection model.
  • each word is mapped to the word vector using the Word2vec model.
  • the quality inspection text data set is divided into the training set and the verification set at a ratio of 99:1 through the neural network.
  • the training set is shuffled in order, and then the shuffled training set is segmented from the beginning by a certain length to be divided into different sub-training sets.
  • the training of each iteration step includes forward propagation and back propagation
  • a prediction result is obtained through the forward propagation
  • the back propagation is used to calculate the Predict the difference between the results and the real results, and adjust the parameters in the network.
  • the accuracy rate (the number of correctly predicted violations/(the number of correctly predicted violations + the number of incorrectly predicted violations) Number)
  • the recall rate (number of correctly predicted violations/number of actual violations in the verification set)
  • the present application also provides an electronic device, which includes a data collection and labeling module, a word segmentation and mapping module, a data processing module, a training module, and a prediction module.
  • the data collection and labeling module is used to collect multiple keywords of WeChat text and label the multiple keywords to obtain a quality inspection text data set with a quality inspection label.
  • the data processing module is used to construct a neural network, and the quality inspection text data set is divided into a training set and a verification set according to a fixed ratio through the neural network.
  • the word segmentation and mapping module is used to segment the text in the training set and the verification set with a Chinese word segmentation tool to obtain multiple words, and map each word to a word vector.
  • the training module is used to split the mapped training set into multiple sub-training sets, use the multiple sub-training sets to alternately train multiple quality inspection models, and save the multiple quality inspection models during training The quality inspection model in line with the requirements.
  • the prediction module is used to make predictions using the quality inspection model that meets the requirements, and to review the prediction results. Prediction refers to checking the WeChat text with the saved quality inspection model.
  • the data processing module divides the text data set with quality inspection into the training set and the verification set at a ratio of 99:1 through the neural network.
  • the present application also provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the computer program to implement the following steps :
  • the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by the processor, the following steps are realized:
  • the text quality inspection method, electronic device, computer equipment and storage medium proposed in this application have a certain semantic understanding ability, which improves the accuracy of quality inspection, reduces the pressure of quality inspection personnel, and greatly improves The efficiency of text quality inspection.
  • 1 is a schematic diagram of an optional hardware architecture of the electronic device of the present application.
  • FIG. 2 is a schematic diagram of a program module of an embodiment of an electronic device of the present application.
  • FIG. 3 is a schematic flowchart of an embodiment of a text quality inspection method of the present application.
  • Electronic device 10 Memory 110 processor 120 Text Quality Inspection System 130 Data collection and annotation 210 Data processing module 220 Word Segmentation and Mapping Module 230 Training module 240
  • FIG. 1 is a schematic diagram of an optional hardware architecture of the electronic device 10 of the present application.
  • the electronic device 10 includes, but is not limited to, the memory 110, the processor 120, and the text quality inspection system 130 can be connected to each other through a system bus.
  • FIG. 2 only shows the electronic device 10 having the components 110-130, but it should be understood that It is not required to implement all the components shown, and more or fewer components may be implemented instead.
  • the memory 110 includes at least one type of readable storage medium, and the readable storage medium includes a flash memory, a hard disk, a multimedia card, a card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
  • the memory 110 may be an internal storage unit of the electronic device 10, such as a hard disk or a memory of the electronic device 10.
  • the memory may also be an external storage device of the electronic device 10, for example, a plug-in hard disk equipped on the electronic device 10, a smart memory card (Smart Media Card, SMC), and a secure digital ( Secure (Digital, SD) card, flash card (Flash Card), etc.
  • the memory 110 may also include both the internal storage unit of the electronic device 10 and its external storage device.
  • the memory 110 is generally used to store an operating system and various application software installed in the electronic device 10, such as program codes of the text quality inspection system 130.
  • the memory 110 may also be used to temporarily store various types of data that have been output or will be output.
  • the processor 120 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip.
  • the processor 120 is generally used to control the overall operation of the electronic device 10.
  • the processor 120 is used to run the program code or process data stored in the memory 110, for example, to run the text quality inspection system 130.
  • FIG. 2 is a schematic diagram of a program module of an embodiment of an electronic device 10 of the present application.
  • the electronic device 10 includes a series of computer program instructions stored on the memory 110.
  • the computer program instructions are executed by the processor 120, the text quality inspection operations of the embodiments of the present application may be implemented.
  • the electronic device 10 may be divided into one or more modules based on specific operations implemented by the various parts of the computer program instructions. For example, in FIG. 3, the electronic device 10 may be divided into a data collection and annotation module 210, a data processing module 220, a word segmentation and mapping module 230, a training module 240, and a prediction module 250.
  • the data collection and labeling module 210 collects multiple keywords of the WeChat text and annotates the multiple keywords to obtain a keyword data set with a quality inspection label, also known as a quality inspection text data set. Keywords are words that violate the rules, such as curse words, unpleasant words, and keywords that cannot be found in some business regulations.
  • the data processing module 220 constructs a bidirectional long-term short-term memory recurrent neural network (Bi-directional Long-Term Memory Recurrent Neural Network, Bi-LSTM RNN), and divides the quality inspection text data set into a training set at a ratio of 99:1 And validation set. Randomly extract 99% of the data from the quality inspection text data set as the training set, and the remaining 1% as the verification set.
  • Bi-LSTM RNN Bi-directional Long-Term Memory Recurrent Neural Network
  • TensorFlow is an open source software library that uses Data Flow Graphs to express numerical operations. Nodes in the data flow graph are used to represent mathematical operations, while Edges are used to represent multi-dimensional data arrays, namely tensors, that are interconnected between nodes.
  • the Attention mechanism is to simulate the process that humans will scan their eyes first and then pick out a few keywords to confirm the semantics.
  • the aforementioned quality inspection point is a violation point.
  • “swearing” is a quality inspection point
  • "fraud” is another quality inspection point.
  • the quality inspection model may give a corresponding result, that is, which quality inspection point is violated, or the quality inspection point is not violated.
  • the word segmentation and mapping module 230 uses a Jieba tool to segment the WeChat text message to obtain multiple words, and maps each word to a word vector using the Word2vec model to obtain the semantics of each word.
  • Word vectors are used to express semantics. Word vectors are generated by the Word2vec algorithm based on a large amount of text data. Specifically, each word is represented by a vector, so it is called a word vector.
  • the manually labeled data will be divided into a training set and a verification set.
  • the training set is used to train the model
  • the verification set is used to verify the accuracy of the model.
  • the stutter (Jieba) tool is a Chinese word segmentation tool developed by Python, and supports a custom dictionary, providing three word segmentation modes: (1) precise mode: trying to cut the sentence most accurately, suitable for text analysis; (2) full Mode: Scan all the words that can be worded in the sentence, the speed is very fast, but can not solve the ambiguity; and (3) Search engine mode: Based on the precise mode, the long words are segmented again to improve the recall rate, Suitable for search engine word segmentation.
  • the Word2vec model is a tool for mapping words into digital vectors, which is generated by training on the corpus of the embodiment of the present application through the Word2vec algorithm. After training, the Word2vec model can be used to map each word to a vector, which can be used to express the relationship between word-to-word. Word2vec represents each word itself as a multidimensional vector, and projects the word into a vector space. Words with the same attributes may be close together, and even some vectors have a logical linear relationship.
  • the algorithm of the Word2vec model includes the following three main steps: (1) treat common word pairs or phrases as a single "words"; (2) sample high-frequency words to reduce the number of training samples ; And (3) Use the "negative" sampling method for the optimization goal, so that the training of each training sample will only update a small part of the model weights, thereby reducing the computational burden.
  • mapping there are two main ways of mapping: one is CBOW, one is skip-gram, and CBOW is the words w(t-2), w(t-1), w(t+1) that use the context of word w(t) ), the vector of w(t+2), through the three-layer network to predict whether the intermediate position is a vector of w(t), in order to determine the real vector representing these words; skip-gram is the opposite, predict it by w(t) Is the context of w(t-2), w(t-1), w(t+1), w(t+2).
  • the training module 240 splits the training set into a plurality of sub-training sets, uses the plurality of the sub-training sets to alternately train multiple quality inspection models, and saves quality inspections that meet the requirements in the multiple quality inspection models during the training process model.
  • the length refers to the number of texts, such as 512 sentences.
  • Way to save the quality inspection model 1 Save once every fixed training iteration step.
  • the number of iteration steps refers to the number of times to repeat the operation before meeting the specified numerical conditions.
  • each iteration step consists of two parts: forward propagation and back propagation.
  • the forward propagation is responsible for calculating the input and the parameters in the network to obtain the prediction result
  • the reverse propagation is responsible for calculating the difference between the prediction and the real result, and adjusting the parameters in the network.
  • the combination of these two steps is a one-step iteration (or, one iteration step) in the training process.
  • the parameters in the model are saved in the hard disk in the form of files.
  • Way of saving the quality inspection model 2 Save the accuracy rate on the verification set (the number of correctly predicted violations/(the number of correctly predicted violations+the number of incorrectly predicted violations)) and the recall rate (the number of correctly predicted violations/ The number of actual violation messages in the verification set) are relatively high quality inspection models, for example, the accuracy rate needs to be greater than 0.7, and the recall rate needs to be greater than 0.4.
  • the saved model is the quality inspection model after training. For saving the model waiting for me, physically, it is a model file, and the learned parameters are inside the model. Entering a paragraph of text can output whether it violates the rules and which quality control point is violated. Training is an iterative process, each step can save a model, but the results of this model may not be good.
  • the prediction module 250 uses the quality inspection model that meets the requirements to make predictions, and submits the prediction results to the quality inspection personnel for review. Prediction refers to checking the WeChat text with the saved quality inspection model.
  • FIG. 3 is a schematic flowchart of an embodiment of a text quality inspection method of the present application.
  • the text quality inspection method is applied to the electronic device 10.
  • the execution order of the steps in the flowchart shown in FIG. 3 may be changed, and some steps may be omitted.
  • Step 301 Collect multiple keywords of WeChat text and mark the multiple keywords to obtain a quality inspection text data set with a quality inspection label. Keywords are words that violate the rules, such as curse words, unpleasant words, and keywords that cannot be found in some business regulations.
  • Step 302 construct a bi-directional long-term short-term memory recurrent neural network (Bi-directional Long-Term Memory Recurrent Neural Network, Bi-LSTM RNN), and divide the quality inspection text data set into a training set and a training set at a ratio of 99:1. Validation set. Randomly extract 99% of the data from the quality inspection text data set as the training set, and the remaining 1% as the verification set.
  • Bi-LSTM RNN Bi-directional Long-Term Memory Recurrent Neural Network
  • TensorFlow is an open source software library that uses Data Flow Graphs to express numerical operations. Nodes in the data flow graph are used to represent mathematical operations, while Edges are used to represent multi-dimensional data arrays, namely tensors, that are interconnected between nodes.
  • the Attention mechanism is to simulate the process that humans will scan their eyes first and then pick out a few keywords to confirm the semantics.
  • the aforementioned quality inspection point is a violation point.
  • “swearing” is a quality inspection point
  • "fraud” is another quality inspection point.
  • the quality inspection model may give a corresponding result, that is, which quality inspection point is violated, or the quality inspection point is not violated.
  • Step 303 Use the Jieba tool to segment the text in the training set and the verification set to obtain multiple words, and use the Word2vec model to map each word to a word vector to obtain the semantics of each word.
  • the word vector is used to express semantics.
  • the word vector is generated by the word2vec algorithm based on a large amount of text data. Specifically, each word is represented by a vector, so it is called a word vector.
  • the manually labeled data will be divided into a training set and a validation set.
  • the training set is used to train the model, and the validation set is used to verify the accuracy of the model.
  • the stammer (Jieba) tool is a Chinese word segmentation tool developed by Python, and supports custom dictionaries. It provides three word segmentation modes: (1) precise mode: trying to cut the sentence most accurately, suitable for text analysis; (2) full Mode: Scan all the words that can be worded in the sentence, the speed is very fast, but can not solve the ambiguity; and (3) Search engine mode: Based on the precise mode, the long words are segmented again to improve the recall rate, Suitable for search engine word segmentation.
  • the Word2vec model is a tool for mapping words into digital vectors, which is generated by training on the corpus of the embodiment of the present application through the Word2vec algorithm. After training, the Word2vec model can be used to map each word to a vector, which can be used to express the relationship between word-to-word.
  • the Word2vec model represents each word itself as a multidimensional vector and projects the words into a vector space. Words with the same attributes may be close together, and even some vectors have a logical linear relationship.
  • the algorithm of the Word2vec model includes the following three main steps: (1) treat common word pairs or phrases as a single "words"; (2) sample high-frequency words to reduce the number of training samples ; And (3) Use the "negative" sampling method for the optimization goal, so that the training of each training sample will only update a small part of the model weights, thereby reducing the computational burden.
  • mapping there are two main ways of mapping: one is CBOW, one is skip-gram, and CBOW is the words w(t-2), w(t-1), w(t+1) that use the context of word w(t) ), the vector of w(t+2), through the three-layer network to predict whether the intermediate position is a vector of w(t), in order to determine the real vector representing these words; skip-gram is the opposite, predict it by w(t) Is the context of w(t-2), w(t-1), w(t+1), w(t+2).
  • Step 304 Split the mapped training set into multiple sub-training sets, use the multiple sub-training sets to alternately train multiple quality inspection models, and save the multiple quality inspection models during training to meet the requirements Quality inspection model.
  • the length refers to the number of texts, such as 512 sentences.
  • Way to save the quality inspection model 1 Save once every fixed training iteration step.
  • the number of iteration steps refers to the number of times to repeat the operation before meeting the specified numerical conditions.
  • each iteration step consists of two parts: forward propagation and back propagation.
  • the forward propagation is responsible for calculating the input and the parameters in the network to obtain the prediction result
  • the reverse propagation is responsible for calculating the difference between the prediction and the real result, and adjusting the parameters in the network.
  • the combination of these two steps is a one-step iteration (or, one iteration step) in the training process.
  • the parameters in the model are saved in the hard disk in the form of files.
  • Way to save the quality inspection model 2 Save the accuracy rate on the validation set (the number of correctly predicted violations/(the number of correctly predicted violations+the number of incorrectly predicted violations)) and the recall rate (the number of correctly predicted violations/ The number of actual violation messages in the verification set) are relatively high quality inspection models, for example, the accuracy rate needs to be greater than 0.7, and the recall rate needs to be greater than 0.4.
  • the saved model is the quality inspection model after training. For saving the model waiting for me, physically, it is a model file, and the learned parameters are inside the model. Entering a paragraph of text can output whether it violates the rules and which quality control point is violated. Training is an iterative process, each step can save a model, but the results of this model may not be good.
  • Step 305 Use the quality inspection model that meets the requirements to make predictions, and submit the prediction results to the quality inspection personnel for review.
  • Prediction refers to checking the WeChat text with the saved quality inspection model.
  • This application introduces deep learning methods for text quality inspection, uses stutter word segmentation to segment text content, uses Word2vec to map words into word vectors, uses TensorFlow to construct Bi-LSTM (RNN), and introduces Attention mechanism in the network, which can have certain The ability of semantic understanding improves the accuracy of quality inspection, reduces the pressure of quality inspectors, and greatly improves the efficiency of text quality inspection.
  • This application also provides a computer device, such as a smartphone, tablet computer, notebook computer, desktop computer, rack server, blade server, tower server or rack server (including a stand-alone server, or multiple Server cluster composed of servers).
  • the computer device of this embodiment includes at least but not limited to: a memory, a processor, etc. that can be communicatively connected to each other through a system bus.
  • This embodiment also provides a computer-readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), only Read memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, server, App store, etc., on which computer programs are stored, When the program is executed by the processor, the corresponding function is realized.
  • the computer-readable storage medium of this embodiment is used to store the electronic device 10, and when executed by a processor, implements the text quality inspection method of the present application.
  • the methods in the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course, can also be implemented by hardware, but in many cases the former is better Implementation.
  • the technical solution of the present application can essentially be embodied in the form of a software product that contributes to the existing technology, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, or optical disk) )
  • a terminal device which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文本质检方法、电子装置、计算机设备及存储介质,该方法包括:采集微信文本的多个关键词,并对所述多个关键词进行标注,以得到带质检标签的质检文本数据集(301);构建神经网络,通过所述神经网络将所述质检文本数据集按固定比例分为训练集和验证集(302);采用中文分词工具对所述训练集和所述验证集中的文本进行分词以取得多个单词,将每一个单词映射为单词向量(303);将所述映射后的训练集拆分成多个子训练集,使用多个所述子训练集交替训练多个质检模型,在训练过程中保存所述多个质检模型中符合要求的质检模型(304);利用所述符合要求的质检模型进行预测,并对预测结果进行复核,预测就是指用保存的质检模型对微信文本进行检查(305)。该文本质检方法、电子装置、计算机及存储介质,具有一定的语义理解能力,提高了质检准确率,减轻了质检人员的压力,大大提高了文本质检的效率。

Description

文本质检方法、电子装置、计算机设备及存储介质
本申请要求于2018年12月25日提交中国专利局,专利名称为“文本质检方法、电子装置、计算机设备及存储介质”,申请号为201811589528.1的发明专利的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及智能决策技术领域,尤其涉及一种文本质检方法、电子装置、计算机设备及存储介质。
背景技术
在文本质检系统中,目前使用的关键词检索系统通常需要业务人员花费大量精力总结关键词,并调配大量的正则表达式进行搜索,搜索出的结果再交由质检人员复核。这种基于检索的系统无法理解文本的语义,在某些较为复杂的质检点上的准确率极低,大大增加了质检人员的工作负荷。
发明内容
有鉴于此,本申请提出一种文本质检方法、电子装置、计算机设备及存储介质,具有一定的语义理解能力,提高了质检准确率,减轻了质检人员的压力,大大提高了文本质检的效率。
为实现上述目的,本申请提出一种文本质检方法,应用于电子装置中,该方法包括步骤:
采集微信文本的多个关键词,并对所述多个关键词进行标注,以得到带质检标签的质检文本数据集;
构建神经网络,通过所述神经网络将所述质检文本数据集按固定比例分 为训练集和验证集;
采用中文分词工具对所述训练集和所述验证集中的文本进行分词以取得多个单词,将每一个单词映射为单词向量;
将所述映射后的训练集拆分成多个子训练集,使用多个所述子训练集交替训练多个质检模型,在训练过程中保存所述多个质检模型中符合要求的质检模型;及
利用所述符合要求的质检模型进行预测,并对预测结果进行复核。预测就是指用保存的质检模型对微信文本进行检查。
进一步地,利用Word2vec模型将所述每一个单词映射为所述单词向量。
进一步地,通过所述神经网络将所述质检文本数据集按99:1的比例分为所述训练集和所述验证集。
进一步地,将所述训练集打乱顺序,然后将打乱顺序后的训练集从头开始按一定长度进行分段,以分成不同的子训练集。
进一步地,隔固定的训练迭代步数保存一次,其中,每一个迭代步数的训练包括正向传播和反向传播,通过所述正向传播得到预测结果,通过所述反向传播计算所述预测结果和真实结果的差别,并调整网络中的参数。
进一步地,保存在验证集上准确率和召回率高于默认值的质检模型,其中,所述准确率=(正确预测违规的消息数/(正确预测违规的消息数+错误预测违规的消息数)),所述召回率=(正确预测违规的消息数/验证集中实际违规的消息数))。
为实现上述目的,本申请还提供一种电子装置,其包括数据采集及标注模块、分词与映射模块、数据处理模块、训练模块与预测模块。
所述数据采集及标注模块,用于采集微信文本的多个关键词,并对所述多个关键词进行标注以得到带质检标签的质检文本数据集。
所述数据处理模块,用于构建神经网络,通过所述神经网络将所述质检文本数据集按固定比例分为训练集和验证集。
所述分词与映射模块,用于采用中文分词工具对所述训练集和所述验证集中的文本进行分词以取得多个单词,将每一个单词映射为单词向量。
所述训练模块用于将所述映射后的训练集拆分成多个子训练集,使用多个所述子训练集交替训练多个质检模型,在训练过程中保存所述多个质检模型中符合要求的质检模型。
所述预测模块用于利用所述符合要求的质检模型进行预测,并对预测结果进行复核。预测就是指用保存的质检模型对微信文本进行检查。
进一步地,所述数据处理模块通过所述神经网络将所述带质检文本数据集按99:1的比例分为所述训练集和所述验证集。
为实现上述目的,本申请还提供一种计算机设备,包括存储器、处理器以及存储在存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现以下步骤:
采集微信文本的多个关键词,并对所述多个关键词进行标注,以得到带质检标签的质检文本数据集;
构建神经网络,通过所述神经网络将所述质检文本数据集按固定比例分为训练集和验证集;
采用中文分词工具对所述训练集和所述验证集中的文本进行分词以取得多个单词,将每一个单词映射为单词向量;
将所述映射后的训练集拆分成多个子训练集,使用多个所述子训练集交替训练多个质检模型,在训练过程中保存所述多个质检模型中符合要求的质检模型;及
利用所述符合要求的质检模型进行预测,并对预测结果进行复核。。
为实现上述目的,本申请还提供计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现以下步骤:
采集微信文本的多个关键词,并对所述多个关键词进行标注,以得到带质检标签的质检文本数据集;
构建神经网络,通过所述神经网络将所述质检文本数据集按固定比例分为训练集和验证集;
采用中文分词工具对所述训练集和所述验证集中的文本进行分词以取得多个单词,将每一个单词映射为单词向量;
将所述映射后的训练集拆分成多个子训练集,使用多个所述子训练集交替训练多个质检模型,在训练过程中保存所述多个质检模型中符合要求的质检模型;及
利用所述符合要求的质检模型进行预测,并对预测结果进行复核。
相较于现有技术,本申请所提出的文本质检方法、电子装置、计算机设备及存储介质,具有一定的语义理解能力,提高了质检准确率,减轻了质检人员的压力,大大提高了文本质检的效率。
附图说明
图1是本申请电子装置一可选的硬件架构示意图;
图2是本申请电子装置一实施例的程序模块示意图;及
图3是本申请文本质检方法一实施例的流程示意图。
附图标记:
电子装置 10
存储器 110
处理器 120
文本质检系统 130
数据采集及标注 210
数据处理模块 220
分词与映射模块 230
训练模块 240
预测模块 250
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。
图1是本申请电子装置10一可选的硬件架构示意图。电子装置10包括,但不仅限于,可通过系统总线相互通信连接存储器110、处理器120以及文本质检系统130,图2仅示出了具有组件110-130的电子装置10,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
所述存储器110至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器110可以是所述电子装置 10的内部存储单元,例如该电子装置10的硬盘或内存。在另一些实施例中,所述存储器也可以是所述电子装置10的外部存储设备,例如该电子装置10上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器110还可以既包括所述电子装置10的内部存储单元也包括其外部存储设备。本实施例中,所述存储器110通常用于存储安装于所述电子装置10的操作系统和各类应用软件,例如文本质检系统130的程序代码等。此外,所述存储器110还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器120在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器120通常用于控制所述电子装置10的总体操作。本实施例中,所述处理器120用于运行所述存储器110中存储的程序代码或者处理数据,例如运行所述文本质检系统130等。
至此,己经详细介绍了本申请相关设备的硬件结构和功能。下面,将基于上述介绍提出本申请的各个实施例。
图2是本申请电子装置10一实施例的程序模块示意图。
本实施例中,所述电子装置10包括一系列的存储于存储器110上的计算机程序指令,当该计算机程序指令被处理器120执行时,可以实现本申请各实施例的文本质检操作。在一些实施例中,基于该计算机程序指令各部分所实现的特定的操作,电子装置10可以被划分为一个或多个模块。例如,在图3中,所述电子装置10可以被分割成数据采集及标注模块210、数据处理模块220、分词与映射模块230、训练模块240与预测模块250。
数据采集与标注模块210采集微信文本的多个关键词,并对所述多个关键词进行标注,以得到带质检标签的关键词数据集,又称为质检文本数据集。关键词是指违规的词,比如,骂人的话、不好听的话以及一些业务规定不能出现的关键词等等。
比如,[你真是个傻子],这句话含有侮辱性的词语“傻子”,因此违反“侮辱客户”这个质检点,因此会被关键词检索出来并被质检人员标注为“侮辱客户”。
[我真是个傻子,如果记得带钥匙,就不至于一直在门外等着了],这句话同样会被关键词检索出来,但经过质检人员质检后,并不会标注为“侮辱客户”,而会标注为“正常”。
[我的联系方式是18911111111,请惠存],整句话含有“联系方式”这个违规词语,违反了公司关于严禁给客户私留联系方式的规定,因此被检索出来,交由质检人员,并被质检人员标注为“私留联系方式”。
数据处理模块220构建双向长短时记忆的循环神经网络(Bi-directional Long Short-Term Memory Recurrent Neural Network,Bi-LSTM RNN),将所述质检文本数据集按99:1的比例分为训练集和验证集。从所述质检文本数据集中随机抽取99%的数据作为训练集,剩下1%的为验证集。
利用TensorFlow构建Bi-LSTM RNN,同时在Bi-LSTM RNN中引入Attention机制,使质检模型更加关注对质检点有影响的单词。对质检点有影响的单词是通过神经网络中的注意力机制获取的,具体来说就是为每一句需要质检的文本的每个词赋予一个权重,这些权重具体表现为网络中的参数,是在训练过程中的反向传播阶段网络进行调整得到的。
TensorFlow是利用数据流图(Data Flow Graphs)来表达数值运算的开源软件库。数据流图中的节点(Nodes)被用来表示数学运算,而边(Edges)则用来表示在节点之间互相联系的多维数据数组,即张量(Tensors)。Attention机制是模拟人类在在看文章时,会先用眼睛扫过一遍,然后挑出几个关键字来确认语义的过程。
前述质检点就是违规点,比如说「骂人」就是一个质检点,「骗人」又是一个质检点。当对所述质检模型输入一句话或一段话,所述质检模型可以给出一个相应的结果,即,违反哪个质检点,或者不违反质检点。
分词与映射模块230采用结巴(Jieba)工具对所述微信文本的消息进行分词以取得多个单词,利用Word2vec模型将每一个单词映射为单词向量,以获得每一个单词的语义。词向量是用来表示语义的,词向量根据大量的文本数据通过Word2vec算法生成,具体来说就是每个词用一个向量表示,所以叫做词向量。
在项目启动时,因为没有标注数据,所以需要利用业务总结的,可能会违规的关键词在历史微信聊天文本中搜索一些数据以供业务进行标注(历史数据太多,不可能没条都人工验证过,所以只能用用关键字搜索)。人工标注的数据会被分成训练集和验证集,训练集用来训练模型,验证集用来验证模型的准确性。
结巴(Jieba)工具是由Python开发的中文分词工具,并且支持自定义词典,提供了三种分词模式:(1)精确模式:试图将句子最精确地切开,适合文本分析;(2)全模式:把句子中所有可以成词的词语都扫描出来,速度非常快,但是不能解决歧义;及(3)搜索引擎模式:在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。
比例,「李小春真的很笨,笨得跟猪一样」经过结巴(Jieba)处理后可得到:「李小春/真的/很笨/笨得/跟猪一样」,因此可得到「李小春」、「真的」、「很笨」、「笨得」与「跟猪一样」这些分词,根据设定的不同规则可得到不同种类的分词。
Word2vec模型是一个把单词映射为数字向量的工具,它是用通过Word2vec算法在本申请实施例之语料库上训练生成的。训练完成之后,Word2vec模型可用来映像每个词到一个向量,可用来表示词对词之间的关系。Word2vec把每个词本身用一个多维向量来表示,把词投影到一个向量空间里。相同属性的词可能会靠得很近,甚至部份的向量有逻辑上的线性关系。
Word2vec模型的算法包括以下3个主要步骤:(1)将常见的单词组合(word pairs)或者词组作为单个「words」来处理;(2)对高频次单词进行抽样来减 少训练样本的个数;及(3)对优化目标采用「negative sampling」方法,这样每个训练样本的训练只会更新一小部分的模型权重,从而降低计算负担。
单词向量就是单词的分布式表达,基本思想是每个词表达成n维稠密、连续的实数向量,为每个词向量赋予一些特征表达能力。例如把“北京”这个单词映像为一个实数向量:北京=[0.85,-0.15,0.64,0.54,……,0.98],它是通过分散表示(Distributed Representation)来产生的。Distributed Representation是一种固定长度的稠密词向量,信息分布式地存储在向量的各个维度中,让相关或者相似的词在距离上更接近。
同样把“中国”、“东京”、“日本”等词映射为各自的向量,使得“中国”-“北京”=“日本”-“东京”。映像的方式主要有两种:一种是CBOW,一种是skip-gram,CBOW是利用单词w(t)上下文的单词w(t-2)、w(t-1)、w(t+1)、w(t+2)的向量,通过三层网络预测中间位置是否为w(t)的向量,以此确定代表这些单词的实数向量;skip-gram则相反,通过w(t)预测它的上下文是否为w(t-2)、w(t-1)、w(t+1)、w(t+2)。
训练模块240将所述训练集拆分成多个子训练集,使用多个所述子训练集交替训练多个质检模型,在训练过程中保存所述多个质检模型中符合要求的质检模型。
如何将所述训练集拆分成多个子训练集的细节说明:将训练集打乱顺序,然后将打乱顺序后的训练集从头开始按一定长度进行分段,以分成不同的子训练集,其中,长度是指文本的数量,比如512个句子。
保存质检模型的方式1:隔固定的训练迭代步数保存一次。迭代步数是指在符合特定的数值条件之前,重复执行运算的次数。
在深度学习中,每一个迭代步数的训练由两个部分组成:正向传播和反向传播。正向传播负责将输入通过与网络中的参数进行计算得到预测结果,反向传播负责计算预测结果和真实结果的差别,并调整网络中的参数。这两个步骤合在一起是训练过程中的一步迭代(或称,一个迭代步数),一般经过 多步训练就将模型中的参数以文件的形式保存在硬盘中。
保存质检模型的方式2:保存在验证集上准确率(正确预测违规的消息数/(正确预测违规的消息数+错误预测违规的消息数))和召回率(正确预测违规的消息数/验证集中实际违规的消息数))都比较高的质检模型,比如准确率需要大于0.7,召回率需要大于0.4。
保存的模型就是训练后的质检模型。对于保存等我模型,物理上,是一个模型文件,模型内部是学习到的参数,输入一段文字就可以输出是否违规,违反哪个质检点。训练是一个迭代过程,每一步都可以保存一个模型,只是这个模型的结果不一定好。
预测模块250利用所述符合要求的质检模型进行预测,并将预测结果交由质检人员复核。预测就是指用保存的质检模型对微信文本进行检查。
图3是本申请文本质检方法一实施例的流程示意图。所述文本质检方法应用于电子装置10中。在本实施例中,根据不同的需求,图3所示的流程图中的步骤的执行顺序可以改变,某些步骤可以省略。
步骤301,采集微信文本的多个关键词,并对所述多个关键词进行标注,以得到带质检标签的质检文本数据集。关键词是指违规的词,比如,骂人的话、不好听的话以及一些业务规定不能出现的关键词等等。
比如,[你真是个傻子],这句话含有侮辱性的词语“傻子”,因此违反“侮辱客户”这个质检点,因此会被关键词检索出来并被质检人员标注为“侮辱客户”。
[我真是个傻子,如果记得带钥匙,就不至于一直在门外等着了],这句话同样会被关键词检索出来,但经过质检人员质检后,并不会标注为“侮辱客户”,而会标注为“正常”。
[我的联系方式是18911111111,请惠存],整句话含有“联系方式”这个违规词语,违反了公司关于严禁给客户私留联系方式的规定,因此被检索出来,交由质检人员,并被质检人员标注为“私留联系方式”。
步骤302,构建双向长短时记忆的循环神经网络(Bi-directional Long Short-Term Memory Recurrent Neural Network,Bi-LSTM RNN),将所述质检文本数据集按99:1的比例分为训练集和验证集。从所述质检文本数据集中随机抽取99%的数据作为训练集,剩下1%的为验证集。
利用TensorFlow构建Bi-LSTM RNN,同时在Bi-LSTM RNN中引入Attention机制,使质检模型更加关注对质检点有影响的单词。对质检点有影响的单词是通过神经网络中的注意力机制获取的,具体来说就是为每一句需要质检的文本的每个词赋予一个权重,这些权重具体表现为网络中的参数,是在训练过程中的反向传播阶段网络进行调整得到的。
TensorFlow是利用数据流图(Data Flow Graphs)来表达数值运算的开源软件库。数据流图中的节点(Nodes)被用来表示数学运算,而边(Edges)则用来表示在节点之间互相联系的多维数据数组,即张量(Tensors)。Attention机制是模拟人类在在看文章时,会先用眼睛扫过一遍,然后挑出几个关键字来确认语义的过程。
前述质检点就是违规点,比如说「骂人」就是一个质检点,「骗人」又是一个质检点。当对所述质检模型输入一句话或一段话,所述质检模型可以给出一个相应的结果,即,违反哪个质检点,或者不违反质检点。
步骤303,采用结巴(Jieba)工具对所述训练集和所述验证集中的文本进行分词以取得多个单词,利用Word2vec模型将每一个单词映射为单词向量,以获得每一个单词的语义。词向量是用来表示语义的,词向量根据大量的文本数据通过word2vec算法生成,具体来说就是每个词用一个向量表示,所以叫做词向量。
在项目启动时,因为没有标注数据,所以需要利用业务总结的,可能会违规的关键词在历史微信聊天文本中搜索一些数据以供业务进行标注(历史数据太多,不可能没条都人工验证过,所以只能用用关键字搜索)。人工标注的数据会被分成训练集和验证集,训练集用来训练模型,验证集用来验证模 型的准确性。
结巴(Jieba)工具是由Python开发的中文分词工具,并且支持自定义词典,提供了三种分词模式:(1)精确模式:试图将句子最精确地切开,适合文本分析;(2)全模式:把句子中所有可以成词的词语都扫描出来,速度非常快,但是不能解决歧义;及(3)搜索引擎模式:在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。
比例,「李小春真的很笨,笨得跟猪一样」经过结巴(Jieba)处理后可得到:「李小春/真的/很笨/笨得/跟猪一样」,因此可得到「李小春」、「真的」、「很笨」、「笨得」与「跟猪一样」这些分词,根据设定的不同规则可得到不同种类的分词。
Word2vec模型是一个把单词映射为数字向量的工具,它是用通过Word2vec算法在本申请实施例之语料库上训练生成的。训练完成之后,Word2vec模型可用来映像每个词到一个向量,可用来表示词对词之间的关系。Word2vec模型把每个词本身用一个多维向量来表示,把词投影到一个向量空间里。相同属性的词可能会靠得很近,甚至部份的向量有逻辑上的线性关系。
Word2vec模型的算法包括以下3个主要步骤:(1)将常见的单词组合(word pairs)或者词组作为单个「words」来处理;(2)对高频次单词进行抽样来减少训练样本的个数;及(3)对优化目标采用「negative sampling」方法,这样每个训练样本的训练只会更新一小部分的模型权重,从而降低计算负担。
单词向量就是单词的分布式表达,基本思想是每个词表达成n维稠密、连续的实数向量,为每个词向量赋予一些特征表达能力。例如把“北京”这个单词映像为一个实数向量:北京=[0.85,-0.15,0.64,0.54,……,0.98],它是通过分散表示(Distributed Representation)来产生的。Distributed Representation是一种固定长度的稠密词向量,信息分布式地存储在向量的各个维度中,让相关或者相似的词在距离上更接近。
同样把“中国”、“东京”、“日本”等词映射为各自的向量,使得“中国” -“北京”=“日本”-“东京”。映像的方式主要有两种:一种是CBOW,一种是skip-gram,CBOW是利用单词w(t)上下文的单词w(t-2)、w(t-1)、w(t+1)、w(t+2)的向量,通过三层网络预测中间位置是否为w(t)的向量,以此确定代表这些单词的实数向量;skip-gram则相反,通过w(t)预测它的上下文是否为w(t-2)、w(t-1)、w(t+1)、w(t+2)。
步骤304,将所述映射后的训练集拆分成多个子训练集,使用多个所述子训练集交替训练多个质检模型,在训练过程中保存所述多个质检模型中符合要求的质检模型。
如何将所述训练集拆分成多个子训练集的细节说明:将训练集打乱顺序,然后将打乱顺序后的训练集从头开始按一定长度进行分段,以分成不同的子训练集,其中,长度是指文本的数量,比如512个句子。
保存质检模型的方式1:隔固定的训练迭代步数保存一次。迭代步数是指在符合特定的数值条件之前,重复执行运算的次数。
在深度学习中,每一个迭代步数的训练由两个部分组成:正向传播和反向传播。正向传播负责将输入通过与网络中的参数进行计算得到预测结果,反向传播负责计算预测结果和真实结果的差别,并调整网络中的参数。这两个步骤合在一起是训练过程中的一步迭代(或称,一个迭代步数),一般经过多步训练就将模型中的参数以文件的形式保存在硬盘中。
保存质检模型的方式2:保存在验证集上准确率(正确预测违规的消息数/(正确预测违规的消息数+错误预测违规的消息数))和召回率(正确预测违规的消息数/验证集中实际违规的消息数))都比较高的质检模型,比如准确率需要大于0.7,召回率需要大于0.4。
保存的模型就是训练后的质检模型。对于保存等我模型,物理上,是一个模型文件,模型内部是学习到的参数,输入一段文字就可以输出是否违规,违反哪个质检点。训练是一个迭代过程,每一步都可以保存一个模型,只是这个模型的结果不一定好。
步骤305,利用所述符合要求的质检模型进行预测,并将预测结果交由质检人员复核。预测就是指用保存的质检模型对微信文本进行检查。
本申请引入深度学习方法对文本进行质检,采用结巴分词对文本内容进行分词,利用Word2vec将单词映射为单词向量,利用TensorFlow构建Bi-LSTM RNN),同时在网络中引入Attention机制,可具有一定的语义理解能力,提高了质检准确率,减轻了质检人员的压力,大大提高了文本质检的效率。
本申请还提供一种计算机设备,如可以执行程序的智能手机、平板电脑、笔记本电脑、台式计算机、机架式服务器、刀片式服务器、塔式服务器或机柜式服务器(包括独立的服务器,或者多个服务器所组成的服务器集群)等。本实施例的计算机设备至少包括但不限于:可通过系统总线相互通信连接的存储器、处理器等。
本实施例还提供一种计算机可读存储介质,如闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘、服务器、App应用商城等等,其上存储有计算机程序,程序被处理器执行时实现相应功能。本实施例的计算机可读存储介质用于存储电子装置10,被处理器执行时实现本申请的文本质检方法。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁盘、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种文本质检方法,应用于电子装置中,其特征在于,所述方法包括步骤:
    采集微信文本的多个关键词,并对所述多个关键词进行标注,以得到带质检标签的质检文本数据集;
    构建神经网络,通过所述神经网络将所述质检文本数据集按固定比例分为训练集和验证集;
    采用中文分词工具对所述训练集和所述验证集中的文本进行分词以取得多个单词,将每一个单词映射为单词向量;
    将所述映射后的训练集拆分成多个子训练集,使用多个所述子训练集交替训练多个质检模型,在训练过程中保存所述多个质检模型中符合要求的质检模型;及
    利用所述符合要求的质检模型进行预测,并对预测结果进行复核。
  2. 如权利要求1所述的文本质检方法,其特征在于,所述方法还包括:利用Word2vec模型将所述每一个单词映射为所述单词向量。
  3. 如权利要求1所述的文本质检方法,其特征在于,所述方法还包括:通过所述神经网络将所述带质检文本数据集按99:1的比例分为所述训练集和所述验证集。
  4. 如权利要求1所述的文本质检方法,其特征在于,所述方法还包括:
    将所述训练集打乱顺序,然后将打乱顺序后的训练集从头开始按一定长度进行分段,以分成不同的子训练集。
  5. 如权利要求1所述的文本质检方法,其特征在于,所述保存质检模型的操作还包括:
    隔固定的训练迭代步数保存一次,其中,每一个迭代步数的训练包括正向传播和反向传播,通过所述正向传播得到预测结果,通过所述反向传播计 算所述预测结果和真实结果的差别,并调整网络中的参数。
  6. 如权利要求1所述的文本质检方法,其特征在于,所述保存质检模型的操作还包括:
    保存在验证集上准确率和召回率高于默认值的质检模型,其中,所述准确率=(正确预测违规的消息数/(正确预测违规的消息数+错误预测违规的消息数)),所述召回率=(正确预测违规的消息数/验证集中实际违规的消息数))。
  7. 一种电子装置,其特征在于,包括:
    数据采集及标注模块,用于采集微信文本的多个关键词,并对所述多个关键词进行标注,以得到带质检标签的质检文本数据集;
    数据处理模块,用于构建神经网络,通过所述神经网络将所述质检文本数据集按固定比例分为训练集和验证集;
    分词与映射模块,用于采用中文分词工具对所述训练集和所述验证集中的文本进行分词以取得多个单词,将每一个单词映射为单词向量;
    训练模块,用于将所述映射后的训练集拆分成多个子训练集,使用多个所述子训练集交替训练多个质检模型,在训练过程中保存所述多个质检模型中符合要求的质检模型;及
    预测模块,用于利用所述符合要求的质检模型进行预测,并对预测结果进行复核。
  8. 如权利要求7所述的电子装置,其特征在于,所述分词与映射模块还用于:
    利用Word2vec模型将所述每一个单词映射为所述单词向量。
  9. 如权利要求7所述的电子装置,其特征在于,所述数据处理模块还用于:
    通过所述神经网络将所述质检文本数据集按99:1的比例分为所述训练集和所述验证集。
  10. 如权利要求7所述的电子装置,其特征在于,所述训练模块还用于:
    将所述训练集打乱顺序,然后将打乱顺序后的训练集从头开始按一定长 度进行分段,以分成不同的子训练集。
  11. 如权利要求7所述的电子装置,其特征在于,所述训练模块还用于:
    隔固定的训练迭代步数保存一次,其中,每一个迭代步数的训练包括正向传播和反向传播,通过所述正向传播得到预测结果,通过所述反向传播计算所述预测结果和真实结果的差别,并调整网络中的参数。
  12. 如权利要求7所述的电子装置,其特征在于,所述训练模块还用于:
    保存在验证集上准确率和召回率高于默认值的质检模型,其中,所述准确率=(正确预测违规的消息数/(正确预测违规的消息数+错误预测违规的消息数)),所述召回率=(正确预测违规的消息数/验证集中实际违规的消息数))。
  13. 一种计算机设备,包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现以下步骤:
    采集微信文本的多个关键词,并对所述多个关键词进行标注,以得到带质检标签的质检文本数据集;
    构建神经网络,通过所述神经网络将所述质检文本数据集按固定比例分为训练集和验证集;
    采用中文分词工具对所述训练集和所述验证集中的文本进行分词以取得多个单词,将每一个单词映射为单词向量;
    将所述映射后的训练集拆分成多个子训练集,使用多个所述子训练集交替训练多个质检模型,在训练过程中保存所述多个质检模型中符合要求的质检模型;及
    利用所述符合要求的质检模型进行预测,并对预测结果进行复核。
  14. 如权利要求13所述的计算机设备,其特征在于,所述计算机程序被所述处理器执行时还实现以下步骤:利用Word2vec模型将所述每一个单词映射为所述单词向量。
  15. 如权利要求13所述的计算机设备,其特征在于,所述计算机程序被所 述处理器执行时还实现以下步骤:通过所述神经网络将所述带质检文本数据集按99:1的比例分为所述训练集和所述验证集。
  16. 如权利要求13所述的计算机设备,其特征在于,所述计算机程序被所述处理器执行时还实现以下步骤:将所述训练集打乱顺序,然后将打乱顺序后的训练集从头开始按一定长度进行分段,以分成不同的子训练集。
  17. 如权利要求13所述的计算机设备,其特征在于,所述计算机程序被所述处理器执行时还实现以下步骤:
    隔固定的训练迭代步数保存一次,其中,每一个迭代步数的训练包括正向传播和反向传播,通过所述正向传播得到预测结果,通过所述反向传播计算所述预测结果和真实结果的差别,并调整网络中的参数。
  18. 如权利要求13所述的计算机设备,其特征在于,所述计算机程序被所述处理器执行时还实现以下步骤:
    保存在验证集上准确率和召回率高于默认值的质检模型,其中,所述准确率=(正确预测违规的消息数/(正确预测违规的消息数+错误预测违规的消息数)),所述召回率=(正确预测违规的消息数/验证集中实际违规的消息数))。
  19. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于:所述计算机程序被处理器执行时还实现以下步骤:
    采集微信文本的多个关键词,并对所述多个关键词进行标注,以得到带质检标签的质检文本数据集;
    构建神经网络,通过所述神经网络将所述质检文本数据集按固定比例分为训练集和验证集;
    采用中文分词工具对所述训练集和所述验证集中的文本进行分词以取得多个单词,将每一个单词映射为单词向量;
    将所述映射后的训练集拆分成多个子训练集,使用多个所述子训练集交替训练多个质检模型,在训练过程中保存所述多个质检模型中符合要求的质检模型;及
    利用所述符合要求的质检模型进行预测,并对预测结果进行复核。
  20. 如权利要求19所述的计算机可读存储介质,其特征在于,所述计算机程序被所述处理器执行时还实现以下步骤:通过所述神经网络将所述带质检文本数据集按99:1的比例分为所述训练集和所述验证集。
PCT/CN2019/091879 2018-12-25 2019-06-19 文本质检方法、电子装置、计算机设备及存储介质 WO2020133960A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811589528.1A CN109815487B (zh) 2018-12-25 2018-12-25 文本质检方法、电子装置、计算机设备及存储介质
CN201811589528.1 2018-12-25

Publications (1)

Publication Number Publication Date
WO2020133960A1 true WO2020133960A1 (zh) 2020-07-02

Family

ID=66602469

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/091879 WO2020133960A1 (zh) 2018-12-25 2019-06-19 文本质检方法、电子装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN109815487B (zh)
WO (1) WO2020133960A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723182A (zh) * 2020-07-10 2020-09-29 云南电网有限责任公司曲靖供电局 一种用于漏洞文本的关键信息抽取方法及装置
CN111782684A (zh) * 2020-07-14 2020-10-16 广东电网有限责任公司电力调度控制中心 一种配网电子化移交信息匹配方法及装置
CN112131345A (zh) * 2020-09-22 2020-12-25 腾讯科技(深圳)有限公司 文本质量的识别方法、装置、设备及存储介质
CN112685396A (zh) * 2020-12-30 2021-04-20 平安普惠企业管理有限公司 财务数据违规检测方法、装置、计算机设备及存储介质
CN113590825A (zh) * 2021-07-30 2021-11-02 平安科技(深圳)有限公司 文本质检方法、装置及相关设备
CN114925920A (zh) * 2022-05-25 2022-08-19 中国平安财产保险股份有限公司 离线位置预测方法、装置、电子设备及存储介质
CN116029291A (zh) * 2023-03-29 2023-04-28 摩尔线程智能科技(北京)有限责任公司 关键词识别方法、装置、电子设备和存储介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815487B (zh) * 2018-12-25 2023-04-18 平安科技(深圳)有限公司 文本质检方法、电子装置、计算机设备及存储介质
CN111177380A (zh) * 2019-12-21 2020-05-19 厦门快商通科技股份有限公司 一种意图数据质检方法及系统
CN111291162B (zh) * 2020-02-26 2024-04-09 深圳前海微众银行股份有限公司 质检例句挖掘方法、装置、设备及计算机可读存储介质
CN111581195A (zh) * 2020-04-29 2020-08-25 厦门快商通科技股份有限公司 一种质检标注数据的方法及系统及装置
CN112465399A (zh) * 2020-12-16 2021-03-09 作业帮教育科技(北京)有限公司 基于策略自动迭代的智能质检方法、装置和电子设备
CN112668857A (zh) * 2020-12-23 2021-04-16 深圳壹账通智能科技有限公司 分阶段质检的数据分类方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107730087A (zh) * 2017-09-20 2018-02-23 平安科技(深圳)有限公司 预测模型训练方法、数据监控方法、装置、设备及介质
CN108446388A (zh) * 2018-03-22 2018-08-24 平安科技(深圳)有限公司 文本数据质检方法、装置、设备及计算机可读存储介质
US20180285740A1 (en) * 2017-04-03 2018-10-04 Royal Bank Of Canada Systems and methods for malicious code detection
CN109815487A (zh) * 2018-12-25 2019-05-28 平安科技(深圳)有限公司 文本质检方法、电子装置、计算机设备及存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10289678B2 (en) * 2013-12-16 2019-05-14 Fairwords, Inc. Semantic analyzer for training a policy engine
AU2016102425A4 (en) * 2015-04-28 2019-10-24 Red Marker Pty Ltd Device, process and system for risk mitigation
CN108491388B (zh) * 2018-03-22 2021-02-23 平安科技(深圳)有限公司 数据集获取方法、分类方法、装置、设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180285740A1 (en) * 2017-04-03 2018-10-04 Royal Bank Of Canada Systems and methods for malicious code detection
CN107730087A (zh) * 2017-09-20 2018-02-23 平安科技(深圳)有限公司 预测模型训练方法、数据监控方法、装置、设备及介质
CN108446388A (zh) * 2018-03-22 2018-08-24 平安科技(深圳)有限公司 文本数据质检方法、装置、设备及计算机可读存储介质
CN109815487A (zh) * 2018-12-25 2019-05-28 平安科技(深圳)有限公司 文本质检方法、电子装置、计算机设备及存储介质

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723182A (zh) * 2020-07-10 2020-09-29 云南电网有限责任公司曲靖供电局 一种用于漏洞文本的关键信息抽取方法及装置
CN111723182B (zh) * 2020-07-10 2023-12-08 云南电网有限责任公司曲靖供电局 一种用于漏洞文本的关键信息抽取方法及装置
CN111782684A (zh) * 2020-07-14 2020-10-16 广东电网有限责任公司电力调度控制中心 一种配网电子化移交信息匹配方法及装置
CN111782684B (zh) * 2020-07-14 2023-12-29 广东电网有限责任公司电力调度控制中心 一种配网电子化移交信息匹配方法及装置
CN112131345A (zh) * 2020-09-22 2020-12-25 腾讯科技(深圳)有限公司 文本质量的识别方法、装置、设备及存储介质
CN112131345B (zh) * 2020-09-22 2024-02-06 腾讯科技(深圳)有限公司 文本质量的识别方法、装置、设备及存储介质
CN112685396A (zh) * 2020-12-30 2021-04-20 平安普惠企业管理有限公司 财务数据违规检测方法、装置、计算机设备及存储介质
CN113590825A (zh) * 2021-07-30 2021-11-02 平安科技(深圳)有限公司 文本质检方法、装置及相关设备
CN114925920A (zh) * 2022-05-25 2022-08-19 中国平安财产保险股份有限公司 离线位置预测方法、装置、电子设备及存储介质
CN114925920B (zh) * 2022-05-25 2024-05-03 中国平安财产保险股份有限公司 离线位置预测方法、装置、电子设备及存储介质
CN116029291A (zh) * 2023-03-29 2023-04-28 摩尔线程智能科技(北京)有限责任公司 关键词识别方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN109815487A (zh) 2019-05-28
CN109815487B (zh) 2023-04-18

Similar Documents

Publication Publication Date Title
WO2020133960A1 (zh) 文本质检方法、电子装置、计算机设备及存储介质
CN107436922B (zh) 文本标签生成方法和装置
WO2021174919A1 (zh) 简历数据信息解析及匹配方法、装置、电子设备及介质
US11016966B2 (en) Semantic analysis-based query result retrieval for natural language procedural queries
US11693894B2 (en) Conversation oriented machine-user interaction
CN108763510B (zh) 意图识别方法、装置、设备及存储介质
WO2020143314A1 (zh) 一种基于搜索引擎的问答方法、装置、存储介质及计算机设备
CN110909122B (zh) 一种信息处理方法及相关设备
CN108304373B (zh) 语义词典的构建方法、装置、存储介质和电子装置
CN108491389B (zh) 点击诱饵标题语料识别模型训练方法和装置
WO2020259280A1 (zh) 日志管理方法、装置、网络设备和可读存储介质
US11232263B2 (en) Generating summary content using supervised sentential extractive summarization
CN110096572B (zh) 一种样本生成方法、装置及计算机可读介质
CN111414746B (zh) 一种匹配语句确定方法、装置、设备及存储介质
US20170017716A1 (en) Generating Probabilistic Annotations for Entities and Relations Using Reasoning and Corpus-Level Evidence
US20220067290A1 (en) Automatically identifying multi-word expressions
CN110874536A (zh) 语料质量评估模型生成方法和双语句对互译质量评估方法
CN111199151A (zh) 数据处理方法、及数据处理装置
CN113609847A (zh) 信息抽取方法、装置、电子设备及存储介质
CN112613293A (zh) 摘要生成方法、装置、电子设备及存储介质
CN116629238A (zh) 文本增强质量评估方法、电子设备、存储介质
WO2020181800A1 (zh) 预测问答内容的评分的装置、方法及存储介质
CN116662518A (zh) 问答方法、装置、电子设备及可读存储介质
CN107729509B (zh) 基于隐性高维分布式特征表示的篇章相似度判定方法
CN108733702B (zh) 用户查询上下位关系提取的方法、装置、电子设备和介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19904393

Country of ref document: EP

Kind code of ref document: A1