WO2020140620A1 - 基于智能决策的文本分类方法、装置、服务器及存储介质 - Google Patents

基于智能决策的文本分类方法、装置、服务器及存储介质 Download PDF

Info

Publication number
WO2020140620A1
WO2020140620A1 PCT/CN2019/117861 CN2019117861W WO2020140620A1 WO 2020140620 A1 WO2020140620 A1 WO 2020140620A1 CN 2019117861 W CN2019117861 W CN 2019117861W WO 2020140620 A1 WO2020140620 A1 WO 2020140620A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
comment
preset
comment text
word
Prior art date
Application number
PCT/CN2019/117861
Other languages
English (en)
French (fr)
Inventor
金戈
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020140620A1 publication Critical patent/WO2020140620A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • the present application relates to the field of computer technology, in particular to a text classification method, device, server and storage medium based on intelligent decision.
  • neural networks such as recurrent neural networks are often used to classify text.
  • neural networks such as recurrent neural networks for text classification, there are many problems such as low calculation efficiency and low classification accuracy.
  • Embodiments of the present application provide a text classification method, device, server, and storage medium based on intelligent decision, which can improve calculation efficiency and classification accuracy.
  • an embodiment of the present application provides a text classification method based on intelligent decision-making, including:
  • the training text to build a first bag of words model;
  • the first bag of words model includes the word features of each comment text in the training text;
  • the cascade forest model is called to classify the target comment text to obtain a classification result of the target comment text.
  • an intelligent decision-based text classification device including:
  • a construction unit configured to construct a first bag of words model using training text;
  • the first bag of words model includes word features of each comment text in the training text;
  • the processing unit is configured to determine a word feature set satisfying a preset condition from the word features of the first word bag model, and generate a second word bag model according to the word feature set;
  • the building unit is also used to build a cascade forest model for text classification through the second bag of words model;
  • the processing unit is also used to call the cascade forest model to classify the target comment text when it is necessary to classify the target comment text to be classified to obtain the classification result of the target comment text.
  • an embodiment of the present application provides a server, including a processor, an input device, an output device, and a memory, where the processor, input device, output device, and memory are connected to each other, wherein the memory is used to store a computer A program, the computer program includes program instructions, and the processor is configured to call the program instructions to perform the method according to the first aspect.
  • an embodiment of the present application provides a computer non-volatile readable storage medium, where the computer non-volatile readable storage medium stores a computer program, the computer program includes program instructions, and the program instructions When executed by a processor, the processor is caused to perform the method according to the first aspect.
  • the server can use the training text to build a first bag of words model, and filter out word feature sets that satisfy preset conditions from the word features of the first bag of words model to build a second bag of words model, thereby using the
  • the two-word bag model constructs a cascading forest model for text classification, and uses the cascading forest model to classify the target comment text to obtain the classification result of the target comment text.
  • Using the constructed cascade forest model for text classification not only improves the calculation rate, but also improves the classification accuracy.
  • FIG. 1 is a schematic flowchart of a text classification method based on intelligent decision provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a text classification method based on intelligent decision provided by another embodiment of the present application
  • FIG. 3 is a schematic structural diagram of a text classification device based on intelligent decision provided by an embodiment of the present application
  • FIG. 4 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • FIG. 1 is a schematic flowchart of a text classification method based on intelligent decision provided by an embodiment of the present application.
  • This method can be applied to the server. Specifically, the method may include the following steps:
  • the first word bag model may include word features of each comment text in the training text, and may also include values of word features of each comment text.
  • each comment text may be a comment text from a superior to a subordinate.
  • the comment text of the superior to the inferior can be classified into work, study, personality and other categories.
  • the comment texts may also be user comment texts on insurance products.
  • the text of the user's comment on the insurance product can be divided into categories such as service, quality, life cycle, and price.
  • the comment texts may also be user comment texts for novels, videos and other objects.
  • the user's comment text for the video can be divided into categories such as introduction, advertisement, technology, history, theory, and perception.
  • the server may use the function in sk-learn to implement the process of constructing the first bag of words model using the training text.
  • the first bag of words model may be a word frequency-inverse document frequency TF-IDF bag of words model.
  • the TF-IDF word bag model may be a model obtained by combining the word bag model and the TF-IDF model.
  • S102 Determine a word feature set satisfying a preset condition from the word features of the first word bag model, and generate a second word bag model according to the word feature set.
  • the first word bag model Due to the appearance of some long texts, the first word bag model is relatively large and inconvenient to store. Therefore, it is possible to determine from the word features of the first word bag model that meet the preset conditions by means of chi-square verification or information gain.
  • Word feature set and generate a second word bag model according to the word feature set.
  • the second word bag model may include a set of word features, and may also include values of word features in the set of word features corresponding to each comment text.
  • the server determines a word feature set that satisfies a preset condition from the word features of the first word bag model, and generates a second word bag model according to the word feature set, which may include : The server performs a chi-square operation on the word features in the first word bag model to obtain the chi-square value of each feature; the server determines the word features whose chi-square value is greater than the preset value from each feature to construct a word feature set, And generate a second bag-of-words model that includes the set of word features.
  • the server performs a chi-square operation on the 1000 word features in the first word bag model to obtain the chi-square value of each word feature; the server determines the word features whose chi-square value is greater than the preset value from the 1000 word features, And use the word features whose chi-square value is greater than the preset value to construct a word feature set; the server generates a second word bag model including the word feature set.
  • the server determines a word feature set that satisfies a preset condition from the word features of the first word bag model, and generates a second word bag model according to the word feature set, and may further include: The chi-square operation of the word features of each word is obtained to obtain the chi-square value of each word feature; the server sorts each word feature according to the chi-square value from high to low, selects a preset number of word features to construct a word feature set, and generates The second bag-of-words model of this word feature set.
  • the server performs a chi-square operation on the 1000 word features in the first word bag model to obtain the chi-square value of each word feature; the server sorts the 1000 word features according to the chi-square value from high to low, and selects the top 500 The word feature constructs a word feature set; the server generates a second word bag model that includes the word feature set.
  • the cascade forest model may include a preset number of cascade forests.
  • the preset number of layers may be 3-8 layers.
  • the preset number of layers in this solution may be 5 layers.
  • Each cascade forest can include a preset number of random forests, such as 4 random forests.
  • each level of cascade forest may also include a first number of completely random tree forests and a second number of random forests, such as 2 completely random tree forests and 2 random forests.
  • the server constructs a cascade forest model for text classification through the second bag of words model, which may include: the server divides the second bag of words model into a growth subset and an evaluation subset; the server uses growth Subset training the current cascade forest, and use the evaluation subset to verify whether the accuracy of the current cascade forest has improved; if there is no improvement, stop adding the cascade layer to get the final cascade forest model; if there is an improvement, continue to increase Cascade layer, and use the growing subset to train the cascade layer added step by step.
  • the server uses the evaluation subset to verify whether the accuracy rate of the current cascade forest is improved, which may include: the server inputs the evaluation subset into the current cascade forest, and obtains the classification result at the output end of the current cascade forest; The classification results are compared with known categories to obtain the accuracy rate of the current cascade forest; the accuracy rate of the current cascade forest is compared with the accuracy rate of the cascade forest of the previous cascade layer to determine the current cascade Whether the accuracy of the forest has improved.
  • the cascade forest model is called to classify the target comment text to obtain a classification result of the target comment text.
  • the target comment text may be a new text or other text, and the classification result includes the classification category of the target comment.
  • each comment text is a comment text from a superior to a subordinate.
  • the classification result may be any one or more of the categories of work, study, personality, etc.
  • each of the comment texts is a user comment text on the insurance product.
  • the classification result may be any one or more of service, quality, life cycle, price and other categories.
  • the respective comment texts are user comment texts for objects such as novels and videos.
  • the classification result may be any one or more categories of introductory type, advertising type, technical type, historical type, theoretical type, and visual type.
  • the server calls the cascade forest model to classify the target comment text to obtain a classification result of the target comment text, which may include: the server inputs the word feature of the target comment text into the cascade forest model For classification and recognition; the server outputs the classification result of the target comment text through the cascade forest model.
  • the server inputs the word feature of the target comment text into the cascade forest model, which may include: the server inputs the value of the word feature of the target comment text into the cascade forest model.
  • the server may determine the word feature of the target comment text through the second word bag model, and may also obtain the value of the word feature of the target comment text through the second word bag model.
  • the value of the word feature of the target comment text is the value of the word feature in the set of word features corresponding to the target comment text.
  • the embodiment of the present application may also obtain the word feature and the value of the word feature of the target comment text in other ways, which are not enumerated here.
  • the server can use the training text to build a first bag of words model, and filter out word feature sets that satisfy preset conditions from the word features of the first bag of words model to build a second bag of words Model, so that the second word bag model is used to construct a cascade forest model for text classification.
  • the cascade forest model is called to classify the target comment text to obtain The classification result of the target comment text.
  • Using the constructed cascade forest model for text classification not only improves the calculation rate, but also improves the classification accuracy.
  • FIG. 2 is a schematic flowchart of a text classification method based on intelligent decision provided by yet another embodiment of the present application.
  • This method can be applied to the server. Specifically, the method may include the following steps:
  • the server acquiring the comment text collection from the designated platform may include: the server downloading the comment text collection from the designated platform. Or, if the server database stores the comment text collection, the server can obtain the comment text collection from the database.
  • the comment text set includes multiple comment texts.
  • the designated platform may be different according to the classification object.
  • the designated platform may be a server of an incumbent company.
  • the classification object is the user's comment text on the insurance product, and the designated platform may be the server of the insurance company.
  • the classification object is the user's comment text on the novel, and the designated platform may be a novel server.
  • the classification object is the user's comment text on the video, and the designated platform may be a server such as a video server or a movie review server.
  • the preset filtering rules may include any one or more of the following: useless comment text filtering rules, comment time filtering rules, text length filtering rules, and comment text category filtering rules.
  • the preset filtering rules include useless comment text filtering rules
  • the server selects training texts from the review text collection according to the instructions of the preset filtering rules, including: the server determines useless comments from the review text collection Text and delete the useless comment text in the comment text collection; the server determines the comment text collection that has been deleted as the training text.
  • the useless comment text is any one or more of the following: vulgar comment text, comment text with a useful index lower than the first preset value, comment text with a useless index higher than the second preset value, which does not belong to the category category and And/or the comment text of the classified object.
  • the vulgar comment text can be determined through keyword detection and the like. For example, if it is detected that a certain comment text includes garbage, the comment text may be determined as a vulgar comment text.
  • the useful index can be determined according to the number of clicks or views on the useful icons; or, it can also be determined according to parameters such as forwarding amount and favorite amount.
  • the useless index can be determined according to the parameters such as the number of clicks or views on the useless icons.
  • the comment text that does not belong to the classification category and/or classification object can be determined by manual screening or machine learning, and the embodiment of the present application does not limit it. For example, if the classification object is the comment text of the superior to the subordinate, the comment text such as the employee's comment text on the company environment and the comment text on the traffic near the company in the comment text collection is determined as useless text.
  • the preset screening rule includes a comment time screening rule
  • the server screens out the training text from the comment text set according to the instruction of the preset screening rule, including: the server obtains each comment text in the comment text set Comment time; the server determines the comment text whose comment time is within the preset time range from each comment text, and determines the comment text within the preset time range as the training text.
  • the preset time range may be a time range of nearly a year, nearly half a year, and nearly a quarter.
  • the server obtains the comment time of each comment text in the comment text set; the server determines the comment text whose comment time is within the past six months from each comment text, and determines the comment text within the past six months as the training text.
  • the preset time range may use different policy settings. For example, in the scenario of employee assessment, the preset time range may be set according to a preset assessment period, and the preset time range is set to nearly half a year. In the scene of video analysis, the preset time range may be set according to the time when the video is released, for example, the preset time range is set to a certain time period after the video is released.
  • the preset screening rules include useless comment text screening rules and comment time screening rules
  • the server screens out training texts from the comment text collection according to the instructions of the preset screening rules, including: the server obtaining the comment text The comment time of each comment text in the collection; the server determines the comment text with the comment time within the preset time range from each comment text, and deletes the useless comment text in the comment text with the comment time within the preset time range; The server determines the comment text of the comment time within the preset time range that has been deleted as the training text.
  • the preset screening rule includes a text length screening rule
  • the server screens out the training text from the comment text collection according to the instruction of the preset screening rule, including: the server counts each comment text in the comment text collection Text length; the server determines the comment text whose text length is greater than the preset text length from each comment text, and determines the comment text whose text length is greater than the preset text length as the training text.
  • the server counts the text length of each review text in the review text collection; the server determines the review text with a text length greater than 30 from each review text, and determines the review text with the text length greater than 30 as the training text.
  • the server uses the comment text with the text length greater than the preset text length as the training text, including: the server deletes the comment text with the text length greater than the preset text length, and the number of repeated words is greater than the preset number of comments Text; the server will use the comment text of which the delete operation is longer than the preset text length as the training text.
  • the server By deleting the comment text with the number of repeated words greater than the preset number, the reliability of the training text can be improved.
  • S203 Use the training text to build a first bag of words model; the first bag of words model includes word features of each comment text in the training text;
  • the cascade forest model is called to classify the target comment text to obtain a classification result of the target comment text.
  • steps S203-S206 reference may be made to steps S101-S104 in the embodiment of FIG. 1, and the embodiments of the present application will not repeat them here.
  • the server can filter out the training text from the set of comment texts obtained from the designated platform according to certain filtering rules, thereby improving the reference of the training text. Subsequently, the server can use the filtered training text to obtain the cascade forest model, and use the cascade forest model to classify the target comment text to be classified, which not only improves the calculation rate, but also improves the classification accuracy.
  • FIG. 3 is a schematic structural diagram of a text classification device based on intelligent decision provided by an embodiment of the present application.
  • the device can be applied to a server.
  • the device may include:
  • the construction unit 31 is configured to construct a first bag-of-words model using training text; the first bag-of-words model includes word features of each comment text in the training text;
  • the processing unit 32 is configured to determine a word feature set satisfying preset conditions from the word features of the first word bag model, and generate a second word bag model according to the word feature set;
  • the building unit 31 is also used to build a cascade forest model for text classification through the second bag of words model;
  • the processing unit 32 is also used to call the cascade forest model to classify the target comment text when it is necessary to classify the target comment text to be classified to obtain the classification result of the target comment text.
  • the obtaining unit 33 is configured to obtain a set of comment texts from a designated platform; the set of comment texts includes multiple comment texts;
  • the filtering unit 34 is configured to filter training text from the comment text set according to an instruction of a preset filtering rule;
  • the preset filtering rule includes any one or more of the following : Useless comment text filtering rules, comment time filtering rules, text length filtering rules, comment text category filtering rules.
  • the preset filtering rules include useless comment text filtering rules, and the filtering unit 34 is specifically configured to determine useless comment text from the comment text collection and delete the comment text collection
  • the useless comment text is any one or more of the following: a vulgar comment text, a comment text with a useful index lower than a first preset value, a comment text with a useless index higher than a second preset value, Comment texts that do not belong to the classification category and/or classification object; the set of comment texts that have been deleted are determined as training texts.
  • the preset screening rule includes a comment time screening rule, and the screening unit 34 is specifically configured to obtain the comment time of each comment text in the comment text collection; from each comment text The comment text whose comment time is within the preset time range is determined, and the comment text within the preset time range is determined as the training text.
  • the preset filtering rules include text length filtering rules.
  • the filtering unit 34 is specifically configured to count the text length of each review text in the review text collection; determine the review text whose text length is greater than the preset text length from each review text, and set the text length greater than the preset The comment text of the text length is determined as the training text.
  • the filtering unit 34 uses the comment text whose text length is greater than the preset text length as the training text, specifically deleting the comment text in the comment text whose length is greater than the preset text length, repeating words The number of comment texts greater than the preset number; the comment texts of which the length of the deleted text is greater than the preset text length are used as training texts.
  • the processing unit 32 determines a word feature set satisfying preset conditions from the word features of the first word bag model, and generates a second word bag model according to the word feature set, Specifically, the chi-square operation is performed on the word features in the first word bag model to obtain the chi-square value of each word feature; each word feature is sorted according to the chi-square value from high to low, and the preset number of word features are selected Construct a word feature set and generate a second bag of words model including the word feature set.
  • the server can use the training text to construct a first bag of words model, and select a set of word features that satisfy preset conditions from the word features of the first bag of words model to construct a second bag of words Model, so that the second word bag model is used to construct a cascade forest model for text classification.
  • the cascade forest model is called to classify the target comment text to obtain The classification result of the target comment text.
  • Using the constructed cascade forest model for text classification not only improves the calculation rate, but also improves the classification accuracy.
  • FIG. 4 is a schematic structural diagram of a server according to an embodiment of the present application.
  • the server described in this embodiment may include: one or more processors 1000, one or more input devices 2000, one or more output devices 3000, and memory 4000.
  • the processor 1000, the input device 2000, the output device 3000, and the memory 4000 may be connected through a bus.
  • the input device 2000 and the output device 3000 may be standard wired or wireless communication interfaces.
  • the processor 1000 may be a central processing module (Central Processing Unit, CPU), the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC) 2. Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 4000 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as a disk memory.
  • the memory 4000 is used to store a set of program codes, and the input device 2000, the output device 3000, and the processor 1000 can call the program codes stored in the memory 4000. specifically:
  • the processor 1000 is configured to construct a first bag of words model using training text; the first bag of words model includes word features of each comment text in the training text; and is determined from the word features of the first bag of words model A set of word features that meet preset conditions and generate a second bag of words model based on the set of word features; construct a cascade forest model for text classification through the second bag of words model; when the target comment text needs to be classified When performing classification and recognition, the cascade forest model is called to classify the target comment text to obtain a classification result of the target comment text.
  • the processor 1000 is also used to obtain a set of comment texts from a designated platform; the set of comment texts includes multiple comment texts; and according to the instruction of a preset filtering rule, the training text is selected from the set of comment texts;
  • the preset filtering rules include any one or more of the following: useless comment text filtering rules, comment time filtering rules, text length filtering rules, and comment text category filtering rules.
  • the preset filtering rules include useless comment text filtering rules
  • the processor 1000 is specifically configured to determine useless comment text from the comment text collection and delete the useless comment text in the comment text collection;
  • the useless comment text is any one or more of the following: vulgar comment text, comment text with a useful index lower than the first preset value, comment text with a useless index higher than the second preset value, which does not belong to the category category and/or Comment text of the classification object; the set of comment texts on which the deletion operation has been performed is determined as the training text.
  • the preset screening rule includes a comment time screening rule, and the processor 1000 is specifically used to obtain the comment time of each comment text in the comment text collection; it is determined from the comment text that the comment time is in Set the comment text within the time range, and determine the comment text within the preset time range as the training text.
  • the preset filtering rules include text length filtering rules
  • the processor 1000 is specifically configured to count the text length of each review text in the review text collection; it is determined from the review text that the text length is greater than the The comment text of the text length is set, and the comment text of which the text length is greater than the preset text length is determined as the training text.
  • the processor 1000 uses the comment text with the text length greater than the preset text length as the training text, specifically deleting the comment text with the text length greater than the preset text length, the number of repeated words is greater than the preset number Comment text; the comment text whose length of the deletion operation is greater than the preset text length is used as the training text.
  • the processor 1000 is specifically configured to perform a chi-square operation on the word features in the first bag-of-words model to obtain the chi-square value of each word feature; sort each word feature according to the chi-square value from high to low, Selecting a preset number of word features to construct a word feature set, and generating a second bag of words model including the word feature set.
  • the processor 1000, the input device 2000, and the output device 3000 described in the embodiments of the present application may execute the implementation modes described in the embodiments of FIGS. 1-2, and may also implement the implementation modes described in the embodiments of the application. And will not be repeated here.
  • the functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules are integrated into one module.
  • the above integrated modules can be implemented in the form of sampling hardware or in the form of sampling software function modules.
  • the computer non-volatile readable storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM) or a random storage memory (Random Access Memory, RAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于智能决策的文本分类方法、装置、服务器及存储介质,其中,该方法包括:利用训练文本构建第一词袋模型(S101);所述第一词袋模型包括所述训练文本中各评语文本的词特征;从所述第一词袋模型的词特征中确定出满足预设条件的词特征集合,并根据所述词特征集合生成第二词袋模型(S102);通过所述第二词袋模型构建用于文本分类的级联森林模型(S103);在需要对待分类的目标评语文本进行分类识别时,调用所述级联森林模型对所述目标评语文本进行分类,得到对所述目标评语文本的分类结果(S104)。采用所述方法,可以提高计算速率和分类精度。

Description

基于智能决策的文本分类方法、装置、服务器及存储介质
本申请要求于2019年01月04日提交中国专利局、申请号为2019100078386、申请名称为“基于智能决策的文本分类方法、装置、服务器及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种基于智能决策的文本分类方法、装置、服务器及存储介质。
背景技术
在自然语言处理中,通常会使用循环神经网络等神经网络来对文本进行分类。然而,采用循环神经网络等神经网络进行文本分类时,会存在计算效率较低,分类精度不高等诸多问题。
发明内容
本申请实施例提供了一种基于智能决策的文本分类方法、装置、服务器及存储介质,可以提高计算效率和分类精度。
第一方面,本申请实施例提供了一种基于智能决策的文本分类方法,包括:
利用训练文本构建第一词袋模型;所述第一词袋模型包括所述训练文本中各评语文本的词特征;
从所述第一词袋模型的词特征中确定出满足预设条件的词特征集合,并根据所述词特征集合生成第二词袋模型;
通过第二词袋模型构建用于文本分类的级联森林模型;
在需要对待分类的目标评语文本进行分类识别时,调用所述级联森林模型对目标评语文本进行分类,得到对所述目标评语文本的分类结果。
第二方面,本申请实施例提供了一种基于智能决策的文本分类装置,包括:
构建单元,用于利用训练文本构建第一词袋模型;所述第一词袋模型包括所述训练文本中各评语文本的词特征;
处理单元,用于从所述第一词袋模型的词特征中确定出满足预设条件的词 特征集合,并根据所述词特征集合生成第二词袋模型;
所述构建单元,还用于通过所述第二词袋模型构建用于文本分类的级联森林模型;
所述处理单元,还用于在需要对待分类的目标评语文本进行分类识别时,调用所述级联森林模型对目标评语文本进行分类,得到对所述目标评语文本的分类结果。
第三方面,本申请实施例提供了一种服务器,包括处理器、输入设备、输出设备和存储器,所述处理器、输入设备、输出设备和存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行如第一方面所述的方法。
第四方面,本申请实施例提供了一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如第一方面所述的方法。
综上所述,服务器可以利用训练文本构建第一词袋模型,并从第一词袋模型的词特征中筛选出满足预设条件的词特征集合以构建第二词袋模型,从而利用该第二词袋模型构建用于文本分类的级联森林模型,以调用该级联森林模型对目标评语文本进行分类,得到该目标评语文本的分类结果。采用构建的级联森林模型进行文本分类,不仅提高了计算速率,还提高了分类精度。
附图说明
图1是本申请实施例提供的一种基于智能决策的文本分类方法的流程示意图;
图2是本申请再一实施例提供的一种基于智能决策的文本分类方法的流程示意图;
图3是本申请实施例提供的一种基于智能决策的文本分类装置的结构示意图;
图4是本申请实施例提供的一种服务器的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。
请参阅图1,为本申请实施例提供的一种基于智能决策的文本分类方法的流程示意图。该方法可以应用于服务器,具体的,该方法可以包括以下步骤:
S101、利用训练文本构建第一词袋模型。
其中,第一词袋模型可以包括该训练文本中各评语文本的词特征,还可以包括各评语文本的词特征的值。
本申请实施例中,该各评语文本可以是上级对下级的评语文本。在一个实施例中,该上级对下级的评语文本可以分为工作、学习、性格等类别。或者,该各评语文本还可以是用户对保险产品的评语文本。在一个实施例中,该用户对保险产品的评语文本可以分为服务、质量、生命周期、价格等类别。或,该各评语文本还可以是用户针对小说、视频等对象的评语文本。在一个实施例中,用户针对视频的评语文本可以分为简介式、广告式、技术式、史学式、理论式、观感式等类别。
在一个实施例中,服务器可以通过sk-learn中的函数,实现利用训练文本构建第一词袋模型的过程。
在一个实施例中,该第一词袋模型可以为词频-逆文档频率TF-IDF词袋模型。TF-IDF词袋模型可以是结合词袋模型和TF-IDF模型得到的模型。
S102、从所述第一词袋模型的词特征中确定出满足预设条件的词特征集合,并根据所述词特征集合生成第二词袋模型。
由于一些长文本的出现,导致第一词袋模型较为庞大,不方便存储,因此可以通过卡方校验或信息增益等方式,从第一词袋模型的词特征中确定出满足预设条件的词特征集合,并根据该词特征集合生成第二词袋模型。该第二词袋模型可以包括词特征集合,还可以包括各评语文本对应该词特征集合中的词特征的值。
对于卡方校验,在一个实施例中,服务器从该第一词袋模型的词特征中确定出满足预设条件的词特征集合,并根据该词特征集合生成第二词袋模型,可以包括:服务器对第一词袋模型中的词特征进行卡方运算,得到每个特征的卡方值;服务器从该各个特征中确定出卡方值大于预设值的词特征以构建词特征 集合,并生成包括该词特征集合的第二词袋模型。
例如,服务器对第一词袋模型中的1000个词特征进行卡方运算,得到每个词特征的卡方值;服务器从1000个词特征中确定出卡方值大于预设值的词特征,并利用该卡方值大于预设值的词特征构建词特征集合;服务器生成包括该词特征集合的第二词袋模型。
或,服务器从该第一词袋模型的词特征中确定出满足预设条件的词特征集合,并根据该词特征集合生成第二词袋模型,还可以包括:服务器对第一词袋模型中的词特征进行卡方运算,得到每个词特征的卡方值;服务器将每个词特征按照卡方值从高到低排序,选取前预设数量个词特征构建词特征集合,并生成包括该词特征集合的第二词袋模型。
例如,服务器对第一词袋模型中的1000个词特征进行卡方运算,得到每个词特征的卡方值;服务器将1000个词特征按照卡方值从高到低排序,选取前500个词特征构建词特征集合;服务器生成包括该词特征集合的第二词袋模型。
S103、通过所述第二词袋模型构建用于文本分类的级联森林模型。
其中,该级联森林模型可以包括预设层数的级联森林。例如,该预设层数可以为3-8层。本方案的预设层数可以为5层。每层级联森林可以包括预设数量的随机森林,如4个随机森林。或者,每层级联森林还可以包括第一数量的完全随机树森林和第二数量的随机森林,如2个完全随机树森林和2个随机森林。
在一个实施例中,服务器通过所述第二词袋模型构建用于文本分类的级联森林模型,可以包括:服务器将该第二词袋模型划分为生长子集和评估子集;服务器利用生长子集训练当前级联森林,并利用评估子集验证当前级联森林的准确率是否提升;如果没有提升,则停止增加级联层,得到最终的级联森林模型;如果有提升,则继续增加级联层,并利用生长子集逐级训练增加的级联层。
其中,服务器利用评估子集验证当前级联森林的准确率是否提升,可以包括:服务器将评估子集输入到当前级联森林中,在该当前级联森林的输出端得到分类结果;再将该分类结果分别与已知的类别进行比较,得到该当前级联森林的准确率;将当前级联森林的准确率与前一级联层的级联森林的准确率进行 比较,以判断当前级联森林的准确率是否提升。
S104、在需要对待分类的目标评语文本进行分类识别时,调用所述级联森林模型对所述目标评语文本进行分类,得到对所述目标评语文本的分类结果。
其中,该目标评语文本可以是新的文本或者其它文本,该分类结果包括该目标评语的分类类别。例如,该各评语文本是上级对下级的评语文本。在一个实施例中,该分类结果可以为工作、学习、性格等类别中任一个或多个类别。或者,该各评语文本是用户对保险产品的评语文本。在一个实施例中,该分类结果可以为服务、质量、生命周期、价格等类别中任一个或多个类别。或,该各评语文本是用户针对小说、视频等对象的评语文本。在一个实施例中,该分类结果可以为简介式、广告式、技术式、史学式、理论式、观感式等中的任一个或多个类别。
具体地,服务器调用所述级联森林模型对所述目标评语文本进行分类,得到对所述目标评语文本的分类结果,可以包括:服务器将该目标评语文本的词特征输入到级联森林模型中以进行分类识别;服务器通过该级联森林模型输出该目标评语文本的分类结果。
服务器将该目标评语文本的词特征输入到级联森林模型中,可以包括:服务器将目标评语文本的词特征的值输入到级联森林模型中。
在一个实施例中,服务器可以通过第二词袋模型确定目标评语文本的词特征,还可以通过第二词袋模型得到目标评语文本的词特征的值。其中,该目标评语文本的词特征的值为该目标评语文本对应词特征集合中的词特征的值。
本申请实施例还可以通过其他方式得到该目标评语文本的词特征以及词特征的值,在此不一一列举。
可见,图1所示的实施例中,服务器可以利用训练文本构建第一词袋模型,并从第一词袋模型的词特征中筛选出满足预设条件的词特征集合以构建第二词袋模型,从而利用该第二词袋模型构建用于文本分类的级联森林模型,以在需要对待分类的目标评语文本进行分类识别时,调用该级联森林模型对该目标评语文本进行分类,得到该目标评语文本的分类结果。采用构建的级联森林模型进行文本分类,不仅提高了计算速率,还提高了分类精度。
请参阅图2,为本申请再一实施例提供的一种基于智能决策的文本分类方法的流程示意图。该方法可以应用于服务器,具体的,该方法可以包括以下步骤:
S201、从指定平台获取评语文本集合。
本申请实施例中,服务器从指定平台获取评语文本集合,可以包括:服务器从指定平台下载评语文本集合。或,若服务器数据库保存了该评语文本集合,则服务器可以从数据库获取该评语文本集合。
其中,该评语文本集合包括多个评语文本。该指定平台根据分类对象的不同可以不同。例如,该分类对象为上级对下级的评语文本,则该指定平台,可以为在职公司的服务器。该分类对象为用户对保险产品的评语文本,则该指定平台,可以为保险公司的服务器。该分类对象为用户对小说的评语文本,则该指定平台,可以为小说服务器。该分类对象为用户对视频的评语文本,则该指定平台,可以视频服务器、影评服务器等服务器。
S202、根据预设筛选规则的指示,从所述评语文本集合中筛选出训练文本。
其中,该预设筛选规则可以包括以下任意一项或多项:无用评语文本筛选规则、评论时间筛选规则、文本长度筛选规则、评语文本类别筛选规则。
在一个实施例中,该预设筛选规则包括无用评语文本筛选规则,服务器根据预设筛选规则的指示,从该评语文本集合中筛选出训练文本,包括:服务器从评语文本集合中确定出无用评语文本,并删除该评语文本集合中的该无用评语文本;服务器将执行了删除操作的评语文本集合确定为训练文本。其中,该无用评语文本为以下任意一项或多项:恶俗评语文本、有用指数低于第一预设值的评语文本、无用指数高于第二预设值的评语文本、不属于分类类别和/或分类对象的评语文本。通过无用文本评语的筛选,可以提高训练文本的可靠性。
在一个实施例中,恶俗评语文本,可以通过关键字检测等方式确定出。例如,检测到某一评语文本包括垃圾,则可以将该评语文本确定为恶俗评语文本。该有用指数可以根据对有用图标的点击量或查看量确定出;或,还可以根据转发量、收藏量等参数确定出。无用指数可以根据对无用图标的点击量或查看量等参数确定出。不属于分类类别和/或分类对象的评语文本,可以通过人工筛选或机器学习等方式确定出,本申请实施例对其不做限制。例如,若分类对象 为上级对下级的评语文本,则将评语文本集合中员工对公司环境的评论文本、对公司附近交通的评论文本等评论文本确定为无用文本。
在一个实施例中,该预设筛选规则包括评论时间筛选规则,服务器根据预设筛选规则的指示,从该评语文本集合中筛选出训练文本,包括:服务器获取该评语文本集合中各个评语文本的评论时间;服务器从该各个评语文本中确定出评论时间在预设时间范围内的评语文本,并将该在预设时间范围内的评语文本确定为训练文本。其中,该预设时间范围可以为近一年,近半年,近一个季度等时间范围。
例如,服务器获取该评语文本集合中各个评语文本的评论时间;服务器从该各个评语文本中确定出评论时间在近半年内的评语文本,并将该在近半年内的评语文本确定为训练文本。
在一个实施例中,针对不同应用场景,该预设时间范围可使用不同的策略设置。例如,在员工考核的场景,该预设时间范围,可以是根据预设的考核周期设置的,该预设时间范围设为近半年。在视频分析的场景,该预设时间范围,可以是根据视频上映的时间设置的,例如,该预设时间范围设为视频上映后的某个时间段。
在一个实施例中,该预设筛选规则包括无用评语文本筛选规则和评论时间筛选规则,服务器根据预设筛选规则的指示,从该评语文本集合中筛选出训练文本,包括:服务器获取该评语文本集合中各个评语文本的评论时间;服务器从该各个评语文本中确定出评论时间在预设时间范围内的评语文本,并删除该评论时间在预设时间范围内的评语文本中的无用评语文本;服务器将执行了删除操作的该评论时间在预设时间范围内的评语文本确定为训练文本。
在一个实施例中,该预设筛选规则包括文本长度筛选规则,服务器根据预设筛选规则的指示,从该评语文本集合中筛选出训练文本,包括:服务器统计该评语文本集合中各个评语文本的文本长度;服务器从该各个评语文本中确定出文本长度大于预设文本长度的评语文本,并将该文本长度大于预设文本长度的评语文本确定为训练文本。
例如,服务器统计该评语文本集合中各个评语文本的文本长度;服务器从该各个评语文本中确定出文本长度大于30的评语文本,并将该文本长度大于 30的评语文本确定为训练文本。
在一个实施例中,服务器将该文本长度大于预设文本长度的评语文本作为训练文本,包括:服务器删除该文本长度大于预设文本长度的评语文本中,重复词语的数量大于预设数量的评语文本;服务器将执行了删除操作的该文本长度大于预设文本长度的评语文本,作为训练文本。通过删除重复词语的数量大于预设数量的评语文本,可以有提高训练文本的可靠性。
S203、利用训练文本构建第一词袋模型;所述第一词袋模型包括所述训练文本中各评语文本的词特征;
S204、从所述第一词袋模型的词特征中确定出满足预设条件的词特征集合,并根据所述词特征集合生成第二词袋模型;
S205、通过所述第二词袋模型构建用于文本分类的级联森林模型;
S206、在需要对待分类的目标评语文本进行分类识别时,调用所述级联森林模型对所述目标评语文本进行分类,得到对所述目标评语文本的分类结果。
其中,步骤S203-S206可参见图1实施例中的步骤S101-S104,本申请实施例在此不做赘述。
可见,图2所示的实施例中,服务器可以通过对从指定平台获取的评语文本集合中,按照一定筛选规则筛选出训练文本,提高了训练文本的可参考性。后续,服务器可以利用筛选出的训练文本得到级联森林模型,并利用该级联森林模型对待分类的目标评语文本进行分类,不仅提高了计算速率,还提高了分类精度。
请参阅图3,为本申请实施例提供的一种基于智能决策的文本分类装置的结构示意图。该装置可以应用于服务器。具体的,该装置可以包括:
构建单元31,用于利用训练文本构建第一词袋模型;所述第一词袋模型包括所述训练文本中各评语文本的词特征;
处理单元32,用于从所述第一词袋模型的词特征中确定出满足预设条件的词特征集合,并根据所述词特征集合生成第二词袋模型;
构建单元31,还用于通过所述第二词袋模型构建用于文本分类的级联森林模型;
处理单元32,还用于在需要对待分类的目标评语文本进行分类识别时,调用所述级联森林模型对目标评语文本进行分类,得到对所述目标评语文本的分类结果。
在一种可选的实施方式中,获取单元33,用于从指定平台获取评语文本集合;所述评语文本集合包括多个评语文本;
在一种可选的实施方式中,筛选单元34,用于根据预设筛选规则的指示,从所述评语文本集合中筛选出训练文本;所述预设筛选规则包括以下任意一项或多项:无用评语文本筛选规则、评论时间筛选规则、文本长度筛选规则、评语文本类别筛选规则。
在一种可选的实施方式中,所述预设筛选规则包括无用评语文本筛选规则,筛选单元34,具体用于从评语文本集合中确定出无用评语文本,并删除所述评语文本集合中的所述无用评语文本;所述无用评语文本为以下任意一项或多项:恶俗评语文本、有用指数低于第一预设值的评语文本、无用指数高于第二预设值的评语文本、不属于分类类别和/或分类对象的评语文本;将执行了删除操作的评语文本集合确定为训练文本。
在一种可选的实施方式中,所述预设筛选规则包括评论时间筛选规则,筛选单元34,具体用于获取所述评语文本集合中各个评语文本的评论时间;从所述各个评语文本中确定出评论时间在预设时间范围内的评语文本,并将所述在预设时间范围内的评语文本确定为训练文本。
在一种可选的实施方式中,所述预设筛选规则包括文本长度筛选规则。筛选单元34,具体用于统计所述评语文本集合中各个评语文本的文本长度;从所述各个评语文本中确定出文本长度大于预设文本长度的评语文本,并将所述文本长度大于预设文本长度的评语文本确定为训练文本。
在一种可选的实施方式中,筛选单元34将所述文本长度大于预设文本长度的评语文本作为训练文本,具体为删除所述文本长度大于预设文本长度的评语文本中,重复词语的数量大于预设数量的评语文本;将执行了删除操作的所述文本长度大于预设文本长度的评语文本,作为训练文本。
在一种可选的实施方式中,处理单元32从所述第一词袋模型的词特征中确定出满足预设条件的词特征集合,并根据所述词特征集合生成第二词袋模型, 具体为对第一词袋模型中的词特征进行卡方运算,得到每个词特征的卡方值;将每个词特征按照卡方值从高到低排序,选取前预设数量个词特征构建词特征集合,并生成包括所述词特征集合的第二词袋模型。
可见,图3所示的实施例中,服务器可以利用训练文本构建第一词袋模型,并从第一词袋模型的词特征中筛选出满足预设条件的词特征集合以构建第二词袋模型,从而利用该第二词袋模型构建用于文本分类的级联森林模型,以在需要对待分类的目标评语文本进行分类识别时,调用该级联森林模型对该目标评语文本进行分类,得到该目标评语文本的分类结果。采用构建的级联森林模型进行文本分类,不仅提高了计算速率,还提高了分类精度。
请参阅图4,为本申请实施例提供的一种服务器的结构示意图。其中,本实施例中所描述的服务器可以包括:一个或多个处理器1000,一个或多个输入设备2000,一个或多个输出设备3000和存储器4000。处理器1000、输入设备2000、输出设备3000和存储器4000可以通过总线连接。
输入设备2000、输出设备3000可以是标准的有线或无线通信接口。
处理器1000可以是中央处理模块(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
存储器4000可以是高速RAM存储器,也可为非不稳定的存储器(non-volatile memory),例如磁盘存储器。存储器4000用于存储一组程序代码,输入设备2000、输出设备3000和处理器1000可以调用存储器4000中存储的程序代码。具体地:
处理器1000,用于利用训练文本构建第一词袋模型;所述第一词袋模型包括所述训练文本中各评语文本的词特征;从所述第一词袋模型的词特征中确定出满足预设条件的词特征集合,并根据所述词特征集合生成第二词袋模型;通过所述第二词袋模型构建用于文本分类的级联森林模型;在需要对待分类的 目标评语文本进行分类识别时,调用所述级联森林模型对所述目标评语文本进行分类,得到对所述目标评语文本的分类结果。
可选地,处理器1000,还用于从指定平台获取评语文本集合;所述评语文本集合包括多个评语文本;根据预设筛选规则的指示,从所述评语文本集合中筛选出训练文本;所述预设筛选规则包括以下任意一项或多项:无用评语文本筛选规则、评论时间筛选规则、文本长度筛选规则、评语文本类别筛选规则。
可选地,所述预设筛选规则包括无用评语文本筛选规则,处理器1000具体用于从评语文本集合中确定出无用评语文本,并删除所述评语文本集合中的所述无用评语文本;所述无用评语文本为以下任意一项或多项:恶俗评语文本、有用指数低于第一预设值的评语文本、无用指数高于第二预设值的评语文本、不属于分类类别和/或分类对象的评语文本;将执行了删除操作的评语文本集合确定为训练文本。
可选地,所述预设筛选规则包括评论时间筛选规则,处理器1000,具体用于获取所述评语文本集合中各个评语文本的评论时间;从所述各个评语文本中确定出评论时间在预设时间范围内的评语文本,并将所述在预设时间范围内的评语文本确定为训练文本。
可选地,所述预设筛选规则包括文本长度筛选规则,处理器1000,具体用于统计所述评语文本集合中各个评语文本的文本长度;从所述各个评语文本中确定出文本长度大于预设文本长度的评语文本,并将所述文本长度大于预设文本长度的评语文本确定为训练文本。
可选地,处理器1000将所述文本长度大于预设文本长度的评语文本作为训练文本,具体为删除所述文本长度大于预设文本长度的评语文本中,重复词语的数量大于预设数量的评语文本;将执行了删除操作的所述文本长度大于预设文本长度的评语文本,作为训练文本。
可选地,处理器1000具体用于对第一词袋模型中的词特征进行卡方运算,得到每个词特征的卡方值;将每个词特征按照卡方值从高到低排序,选取前预设数量个词特征构建词特征集合,并生成包括所述词特征集合的第二词袋模型。
具体实现中,本申请实施例中所描述的处理器1000、输入设备2000、输出设备3000可执行图1-图2实施例所描述的实现方式,也可执行本申请实施 例所描述的实现方式,在此不再赘述。
在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以是两个或两个以上模块集成在一个模块中。上述集成的模块既可以采样硬件的形式实现,也可以采样软件功能模块的形式实现。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机非易失性可读存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的计算机非易失性可读存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
以上所揭露的仅为本申请一种较佳实施例而已,当然不能以此来限定本申请之权利范围,本领域普通技术人员可以理解实现上述实施例的全部或部分流程,并依本申请权利要求所作的等同变化,仍属于本申请所涵盖的范围。

Claims (20)

  1. 一种基于智能决策的文本分类方法,其特征在于,包括:
    利用训练文本构建第一词袋模型;所述第一词袋模型包括所述训练文本中各评语文本的词特征;
    从所述第一词袋模型的词特征中确定出满足预设条件的词特征集合,并根据所述词特征集合生成第二词袋模型;
    通过所述第二词袋模型构建用于文本分类的级联森林模型;
    在需要对待分类的目标评语文本进行分类识别时,调用所述级联森林模型对所述目标评语文本进行分类,得到对所述目标评语文本的分类结果。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    从指定平台获取评语文本集合;所述评语文本集合包括多个评语文本;
    根据预设筛选规则的指示,从所述评语文本集合中筛选出训练文本;所述预设筛选规则包括以下任意一项或多项:无用评语文本筛选规则、评论时间筛选规则、文本长度筛选规则、评语文本类别筛选规则。
  3. 根据权利要求2所述的方法,其特征在于,所述预设筛选规则包括无用评语文本筛选规则,所述根据预设筛选规则的指示,从所述评语文本集合中筛选出训练文本,包括:
    从评语文本集合中确定出无用评语文本,并删除所述评语文本集合中的所述无用评语文本;所述无用评语文本为以下任意一项或多项:恶俗评语文本、有用指数低于第一预设值的评语文本、无用指数高于第二预设值的评语文本、不属于分类类别和/或分类对象的评语文本;
    将执行了删除操作的评语文本集合确定为训练文本。
  4. 根据权利要求2所述的方法,其特征在于,所述预设筛选规则包括评论时间筛选规则,所述根据预设筛选规则的指示,从所述评语文本集合中筛选出训练文本,包括:
    获取所述评语文本集合中各个评语文本的评论时间;
    从所述各个评语文本中确定出评论时间在预设时间范围内的评语文本,并 将所述在预设时间范围内的评语文本确定为训练文本。
  5. 根据权利要求2所述的方法,其特征在于,所述预设筛选规则包括文本长度筛选规则,所述根据预设筛选规则的指示,从所述评语文本集合中筛选出训练文本,包括:
    统计所述评语文本集合中各个评语文本的文本长度;
    从所述各个评语文本中确定出文本长度大于预设文本长度的评语文本,并将所述文本长度大于预设文本长度的评语文本确定为训练文本。
  6. 根据权利要求5所述的方法,其特征在于,所述将所述文本长度大于预设文本长度的评语文本作为训练文本,包括:
    删除所述文本长度大于预设文本长度的评语文本中,重复词语的数量大于预设数量的评语文本;
    将执行了删除操作的所述文本长度大于预设文本长度的评语文本,作为训练文本。
  7. 根据权利要求2所述的方法,其特征在于,所述预设筛选规则包括无用评语文本筛选规则和评论时间筛选规则,所述根据预设筛选规则的指示,从所述评语文本集合中筛选出训练文本,包括:
    获取所述评语文本集合中各个评语文本的评论时间;
    从所述各个评语文本中确定出评论时间在预设时间范围内的评语文本,并删除所述评论时间在预设时间范围内的评语文本中的无用评语文本;
    将执行了删除操作的所述评论时间在预设时间范围内的评语文本确定为训练文本。
  8. 根据权利要求1-7任意一项所述的方法,其特征在于,所述从所述第一词袋模型的词特征中确定出满足预设条件的词特征集合,并根据所述词特征集合生成第二词袋模型,包括:
    对第一词袋模型中的词特征进行卡方运算,得到每个词特征的卡方值;
    将每个词特征按照卡方值从高到低排序,选取前预设数量个词特征构建词特征集合,并生成包括所述词特征集合的第二词袋模型。
  9. 一种基于智能决策的文本分类装置,其特征在于,包括:
    构建单元,用于利用训练文本构建第一词袋模型;所述第一词袋模型包括 所述训练文本中各评语文本的词特征;
    处理单元,用于从所述第一词袋模型的词特征中确定出满足预设条件的词特征集合,并根据所述词特征集合生成第二词袋模型;
    所述构建单元,还用于通过所述第二词袋模型构建用于文本分类的级联森林模型;
    所述处理单元,还用于在需要对待分类的目标评语文本进行分类识别时,调用所述级联森林模型对目标评语文本进行分类,得到对所述目标评语文本的分类结果。
  10. 根据权利要求9所述的装置,其特征在于,所述装置还包括获取单元和筛选单元;
    所述获取单元,用于从指定平台获取评语文本集合;所述评语文本集合包括多个评语文本;
    所述筛选单元,用于根据预设筛选规则的指示,从所述评语文本集合中筛选出训练文本;所述预设筛选规则包括以下任意一项或多项:无用评语文本筛选规则、评论时间筛选规则、文本长度筛选规则、评语文本类别筛选规则。
  11. 根据权利要求10所述的装置,其特征在于,所述预设筛选规则包括无用评语文本筛选规则,所述筛选单元根据预设筛选规则的指示,从所述评语文本集合中筛选出训练文本,具体为从评语文本集合中确定出无用评语文本,并删除所述评语文本集合中的所述无用评语文本;所述无用评语文本为以下任意一项或多项:恶俗评语文本、有用指数低于第一预设值的评语文本、无用指数高于第二预设值的评语文本、不属于分类类别和/或分类对象的评语文本;将执行了删除操作的评语文本集合确定为训练文本。
  12. 根据权利要求10所述的装置,其特征在于,所述预设筛选规则包括评论时间筛选规则,所述筛选单元根据预设筛选规则的指示,从所述评语文本集合中筛选出训练文本,具体为获取所述评语文本集合中各个评语文本的评论时间;从所述各个评语文本中确定出评论时间在预设时间范围内的评语文本,并将所述在预设时间范围内的评语文本确定为训练文本。
  13. 根据权利要求10所述的装置,其特征在于,所述预设筛选规则包括文本长度筛选规则,所述筛选单元根据预设筛选规则的指示,从所述评语文本 集合中筛选出训练文本,具体为统计所述评语文本集合中各个评语文本的文本长度;从所述各个评语文本中确定出文本长度大于预设文本长度的评语文本,并将所述文本长度大于预设文本长度的评语文本确定为训练文本。
  14. 根据权利要求13所述的装置,其特征在于,所述筛选单元将所述文本长度大于预设文本长度的评语文本作为训练文本,具体为删除所述文本长度大于预设文本长度的评语文本中,重复词语的数量大于预设数量的评语文本;将执行了删除操作的所述文本长度大于预设文本长度的评语文本,作为训练文本。
  15. 根据权利要求10所述的装置,其特征在于,所述预设筛选规则包括无用评语文本筛选规则和评论时间筛选规则,所述筛选单元根据预设筛选规则的指示,从所述评语文本集合中筛选出训练文本,具体为获取所述评语文本集合中各个评语文本的评论时间;从所述各个评语文本中确定出评论时间在预设时间范围内的评语文本,并删除所述评论时间在预设时间范围内的评语文本中的无用评语文本;将执行了删除操作的所述评论时间在预设时间范围内的评语文本确定为训练文本。
  16. 根据权利要求9-15任意一项所述的装置,其特征在于,所述处理单元从所述第一词袋模型的词特征中确定出满足预设条件的词特征集合,并根据所述词特征集合生成第二词袋模型,具体为对第一词袋模型中的词特征进行卡方运算,得到每个词特征的卡方值;将每个词特征按照卡方值从高到低排序,选取前预设数量个词特征构建词特征集合,并生成包括所述词特征集合的第二词袋模型。
  17. 一种服务器,其特征在于,包括处理器、输入设备、输出设备和存储器,所述处理器、输入设备、输出设备和存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行:
    利用训练文本构建第一词袋模型;所述第一词袋模型包括所述训练文本中各评语文本的词特征;
    从所述第一词袋模型的词特征中确定出满足预设条件的词特征集合,并根据所述词特征集合生成第二词袋模型;
    通过所述第二词袋模型构建用于文本分类的级联森林模型;
    在需要对待分类的目标评语文本进行分类识别时,调用所述级联森林模型对所述目标评语文本进行分类,得到对所述目标评语文本的分类结果。
  18. 根据权利要求17所述的服务器,其特征在于,所述处理器,还用于从指定平台获取评语文本集合;所述评语文本集合包括多个评语文本;根据预设筛选规则的指示,从所述评语文本集合中筛选出训练文本;所述预设筛选规则包括以下任意一项或多项:无用评语文本筛选规则、评论时间筛选规则、文本长度筛选规则、评语文本类别筛选规则。
  19. 根据权利要求18所述的服务器,其特征在于,所述预设筛选规则包括无用评语文本筛选规则,所述处理器根据预设筛选规则的指示,从所述评语文本集合中筛选出训练文本,具体为从评语文本集合中确定出无用评语文本,并删除所述评语文本集合中的所述无用评语文本;所述无用评语文本为以下任意一项或多项:恶俗评语文本、有用指数低于第一预设值的评语文本、无用指数高于第二预设值的评语文本、不属于分类类别和/或分类对象的评语文本;将执行了删除操作的评语文本集合确定为训练文本。
  20. 一种计算机非易失性可读存储介质,其特征在于,所述计算机非易失性可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如权利要求1-8任一项所述的方法。
PCT/CN2019/117861 2019-01-04 2019-11-13 基于智能决策的文本分类方法、装置、服务器及存储介质 WO2020140620A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910007838.6 2019-01-04
CN201910007838.6A CN109857862B (zh) 2019-01-04 2019-01-04 基于智能决策的文本分类方法、装置、服务器及介质

Publications (1)

Publication Number Publication Date
WO2020140620A1 true WO2020140620A1 (zh) 2020-07-09

Family

ID=66893881

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117861 WO2020140620A1 (zh) 2019-01-04 2019-11-13 基于智能决策的文本分类方法、装置、服务器及存储介质

Country Status (2)

Country Link
CN (1) CN109857862B (zh)
WO (1) WO2020140620A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111985836A (zh) * 2020-08-31 2020-11-24 平安医疗健康管理股份有限公司 医保评分指标体系构建方法、装置、设备及存储介质
CN113495959A (zh) * 2021-05-20 2021-10-12 山东大学 一种基于文本数据的金融舆情识别方法及系统
CN114925373A (zh) * 2022-05-17 2022-08-19 南京航空航天大学 基于用户评语的移动应用隐私保护政策漏洞自动识别的方法
CN117786560A (zh) * 2024-02-28 2024-03-29 通用电梯股份有限公司 一种基于多粒度级联森林的电梯故障分类方法及电子设备

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109857862B (zh) * 2019-01-04 2024-04-19 平安科技(深圳)有限公司 基于智能决策的文本分类方法、装置、服务器及介质
CN110825874A (zh) * 2019-10-29 2020-02-21 北京明略软件系统有限公司 一种中文文本分类方法和装置及计算机可读存储介质
CN112036146A (zh) * 2020-08-25 2020-12-04 广州视源电子科技股份有限公司 一种评语生成方法、装置、终端设备及存储介质
CN112182207B (zh) * 2020-09-16 2023-07-11 神州数码信息系统有限公司 基于关键词提取和快速文本分类的发票虚抵风险评估方法
CN113887193A (zh) * 2021-09-14 2022-01-04 山东师范大学 一种学位论文评价方法、系统、介质及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101414300A (zh) * 2008-11-28 2009-04-22 电子科技大学 一种互联网舆情信息的分类处理方法
CN103136352A (zh) * 2013-02-27 2013-06-05 华中师范大学 基于双层语义分析的全文检索系统
CN104750833A (zh) * 2015-04-03 2015-07-01 浪潮集团有限公司 一种文本分类方法及装置
CN106874959A (zh) * 2017-03-01 2017-06-20 南京大学 一种多尺度扫描级联森林学习机的训练方法
CN109857862A (zh) * 2019-01-04 2019-06-07 平安科技(深圳)有限公司 基于智能决策的文本分类方法、装置、服务器及介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008021244A2 (en) * 2006-08-10 2008-02-21 Trustees Of Tufts College Systems and methods for identifying unwanted or harmful electronic text
CN105335350A (zh) * 2015-10-08 2016-02-17 北京理工大学 一种基于集成学习的语种识别方法
CN107292186B (zh) * 2016-03-31 2021-01-12 阿里巴巴集团控股有限公司 一种基于随机森林的模型训练方法和装置
CN109002473B (zh) * 2018-06-13 2022-02-11 天津大学 一种基于词向量与词性的情感分析方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101414300A (zh) * 2008-11-28 2009-04-22 电子科技大学 一种互联网舆情信息的分类处理方法
CN103136352A (zh) * 2013-02-27 2013-06-05 华中师范大学 基于双层语义分析的全文检索系统
CN104750833A (zh) * 2015-04-03 2015-07-01 浪潮集团有限公司 一种文本分类方法及装置
CN106874959A (zh) * 2017-03-01 2017-06-20 南京大学 一种多尺度扫描级联森林学习机的训练方法
CN109857862A (zh) * 2019-01-04 2019-06-07 平安科技(深圳)有限公司 基于智能决策的文本分类方法、装置、服务器及介质

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111985836A (zh) * 2020-08-31 2020-11-24 平安医疗健康管理股份有限公司 医保评分指标体系构建方法、装置、设备及存储介质
CN111985836B (zh) * 2020-08-31 2024-04-05 平安医疗健康管理股份有限公司 医保评分指标体系构建方法、装置、设备及存储介质
CN113495959A (zh) * 2021-05-20 2021-10-12 山东大学 一种基于文本数据的金融舆情识别方法及系统
CN114925373A (zh) * 2022-05-17 2022-08-19 南京航空航天大学 基于用户评语的移动应用隐私保护政策漏洞自动识别的方法
CN114925373B (zh) * 2022-05-17 2023-12-08 南京航空航天大学 基于用户评语的移动应用隐私保护政策漏洞自动识别的方法
CN117786560A (zh) * 2024-02-28 2024-03-29 通用电梯股份有限公司 一种基于多粒度级联森林的电梯故障分类方法及电子设备
CN117786560B (zh) * 2024-02-28 2024-05-07 通用电梯股份有限公司 一种基于多粒度级联森林的电梯故障分类方法及电子设备

Also Published As

Publication number Publication date
CN109857862A (zh) 2019-06-07
CN109857862B (zh) 2024-04-19

Similar Documents

Publication Publication Date Title
WO2020140620A1 (zh) 基于智能决策的文本分类方法、装置、服务器及存储介质
US11704325B2 (en) Systems and methods for automatic clustering and canonical designation of related data in various data structures
US11416535B2 (en) User interface for visualizing search data
US10489441B1 (en) Models for classifying documents
TWI718643B (zh) 異常群體識別方法及裝置
US9589299B2 (en) Systems and user interfaces for dynamic and interactive investigation of bad actor behavior based on automatic clustering of related data in various data structures
US9202249B1 (en) Data item clustering and analysis
WO2018014610A1 (zh) 基于c4.5决策树算法的特定用户挖掘系统及其方法
WO2021047186A1 (zh) 咨询对话处理的方法、装置、设备及存储介质
US10579659B2 (en) Method, apparatus, electronic equipment and storage medium for performing screening and statistical operation on data
TWI745589B (zh) 風險特徵篩選、描述報文產生方法、裝置以及電子設備
US20150120583A1 (en) Process and mechanism for identifying large scale misuse of social media networks
US11042525B2 (en) Extracting and labeling custom information from log messages
CN110458324B (zh) 风险概率的计算方法、装置和计算机设备
CN106682096A (zh) 一种日志数据的管理方法和装置
CN102945246B (zh) 网络信息数据的处理方法及装置
US9582586B2 (en) Massive rule-based classification engine
US20230010680A1 (en) Business Lines
JP5128437B2 (ja) 時系列関係グラフに基づくエンティティ分類装置および方法
CN107305555A (zh) 数据处理方法及装置
CN106293650A (zh) 一种文件夹属性设置方法及装置
CN113535939A (zh) 文本处理方法和装置、电子设备以及计算机可读存储介质
US20150094032A1 (en) Method and apparatus for managing interruptions from different modes of communication
US20140067874A1 (en) Performing predictive analysis
CN110781211B (zh) 一种数据的解析方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19907309

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19907309

Country of ref document: EP

Kind code of ref document: A1