WO2020232898A1 - 文本分类方法、装置、电子设备及计算机非易失性可读存储介质 - Google Patents

文本分类方法、装置、电子设备及计算机非易失性可读存储介质 Download PDF

Info

Publication number
WO2020232898A1
WO2020232898A1 PCT/CN2019/103441 CN2019103441W WO2020232898A1 WO 2020232898 A1 WO2020232898 A1 WO 2020232898A1 CN 2019103441 W CN2019103441 W CN 2019103441W WO 2020232898 A1 WO2020232898 A1 WO 2020232898A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
word
classified
dimensional
word vector
Prior art date
Application number
PCT/CN2019/103441
Other languages
English (en)
French (fr)
Inventor
金戈
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020232898A1 publication Critical patent/WO2020232898A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • This application relates to the field of machine learning application technology, and in particular to a text classification method, device, electronic equipment, and computer non-volatile readable storage medium.
  • Text classification is to use a computer to automatically classify and mark text collections according to a certain classification system or standard.
  • text classification usually uses a deep learning model built by a neural network. After the words in the text are expressed as a numerical word vector, the word vector is integrated into a sentence vector, which is input into the deep learning model for text classification, and then the text sort.
  • the inventor of the present application realizes that in the traditional classification method, loop calculation is performed on the sentence vector of the entire paragraph of text, which has a large calculation load, and the accuracy of text classification is limited due to the large amount of information.
  • one objective of the present application is to provide a text classification method, device, electronic equipment, and computer non-volatile readable storage medium.
  • a text classification method includes: searching a multi-dimensional word vector dictionary according to words in a text to be classified to obtain a multi-dimensional word vector corresponding to each word; obtaining the multi-dimensional word vector corresponding to each word The multi-dimensional word vector of each keyword in the text to be classified; obtain the element value of the predetermined dimension in the multi-dimensional word vector corresponding to each word, and input the predetermined dimension machine learning model according to the order of each word in the text to be classified to obtain The predetermined dimensional classification result of the text to be classified; the multi-dimensional word vector of each keyword is input into the keyword machine learning model according to the order of each word in the text to be classified to obtain the keyword classification of the text to be classified Result; based on the predetermined dimensional classification result and the keyword classification result, the classification result of the text to be classified is determined.
  • a text classification device is characterized by comprising: a search module for searching a multi-dimensional word vector dictionary according to words in the text to be classified, and obtaining a multi-dimensional word vector corresponding to each word; In the multi-dimensional word vector corresponding to each word, the multi-dimensional word vector of each keyword in the text to be classified is obtained; the first classification module is used to obtain the predetermined dimension of the multi-dimensional word vector corresponding to each word The element value is input into the machine learning model of a predetermined dimension according to the order of each word in the text to be classified to obtain the result of the predetermined dimension classification of the text to be classified; the second classification module is used to calculate the multidimensional word vector of each keyword Enter the keyword machine learning model according to the order of each word in the text to be classified to obtain the keyword classification result of the text to be classified; the classification determination module is used to classify the keyword based on the predetermined dimension classification result The result is the classification result of the text to be classified.
  • a text classification device in another aspect, includes: a processor; and a memory for storing a text classification program of the processor; wherein the processor is configured to execute the text classification program as described above by executing the text classification program Text classification method.
  • a computer non-volatile readable storage medium has a text classification program stored thereon, wherein the text classification program is executed by a processor to implement the text classification method as described above.
  • the classification result of the predetermined dimension is obtained according to the full text analysis, and the keyword classification result is obtained according to the representative keywords of the text. Combining the two can effectively ensure the accuracy of text classification.
  • Fig. 1 schematically shows a flowchart of a text classification method.
  • Fig. 2 schematically shows an example diagram of an application scenario of a text classification method.
  • Fig. 3 schematically shows a flow chart of a method for determining a classification result of a text to be classified.
  • Fig. 4 schematically shows a block diagram of a text classification device.
  • Fig. 5 shows a block diagram of an electronic device for implementing the above-mentioned text classification method according to an exemplary embodiment.
  • Fig. 6 shows a schematic diagram of a computer non-volatile readable storage medium for implementing the above text classification method according to an exemplary embodiment.
  • This example embodiment first provides a text classification method.
  • the text classification method can be run on a server, a server cluster or a cloud server, etc.
  • the text classification method may include the following steps:
  • Step S110 Search the multi-dimensional word vector dictionary according to the words in the text to be classified, and obtain the multi-dimensional word vector corresponding to each word.
  • Step S120 Obtain the multi-dimensional word vector of each keyword in the text to be classified from the multi-dimensional word vector corresponding to each word.
  • Step S130 Obtain the element value of the predetermined dimension in the multidimensional word vector corresponding to each word, and input the predetermined dimension machine learning model according to the order of each word in the text to be classified to obtain the predetermined dimension classification result of the text to be classified .
  • Step S140 Input the multi-dimensional word vector of each keyword into the keyword machine learning model according to the order of each word in the text to be classified to obtain the keyword classification result of the text to be classified.
  • Step S150 Determine the classification result of the text to be classified based on the predetermined dimensional classification result and the keyword classification result.
  • the multi-dimensional word vector dictionary is searched to obtain the multi-dimensional word vector corresponding to each word; by representing the words in the text to be classified as a multi-dimensional word vector, it can be used in subsequent steps Facilitate accurate calculation of machine learning models.
  • the multi-dimensional word vector corresponding to each word the multi-dimensional word vector of each keyword in the text to be classified is obtained; by obtaining the keywords in the text to be classified, since the keywords represent the key theme of the text, In turn, the accuracy of text classification can be effectively guaranteed, and the amount of calculation in subsequent steps can be effectively reduced.
  • the multi-dimensional word vector of each keyword is input into the keyword machine learning model according to the order of each word in the text to be classified to obtain the keyword classification result of the text to be classified; the number of keyword multi-dimensional vectors is small , At the same time, it has a high degree of text representation, which can effectively reduce the calculation load of the machine learning model, improve calculation efficiency, and effectively improve the accuracy of pre-classification.
  • the classification result of the text to be classified is determined; the classification result of the predetermined dimension is obtained according to the full text analysis, and the keyword classification result is based on the representative keywords of the text Obtained, combining the two can effectively ensure the accuracy of text classification.
  • step S110 the multi-dimensional word vector dictionary is searched according to the words in the text to be classified, and the multi-dimensional word vector corresponding to each word is obtained.
  • the server 201 crawls the to-be-classified text of the server 202 or obtains the to-be-classified text stored on the server 201, and then the server 201 can perform word segmentation and other processing on the text to be classified and search Multi-dimensional word vector dictionary to obtain the multi-dimensional word vector corresponding to each word.
  • the server 201 can be any terminal with the function of executing program instructions and storage, such as a cloud server, mobile phone, computer, etc.; the server 202 can be any terminal with the storage function, such as a mobile phone, computer, etc.
  • the multi-dimensional vector dictionary is a dictionary that predefines the words corresponding to each multi-dimensional vector vector.
  • the multi-dimensional vectors corresponding to different words at least one dimension of the element value is different.
  • the multidimensional vectors corresponding to different words at least one dimension of the element value is different.
  • the word corresponding to the vector will change, for example: the vector (1,2,3) represents "You", when one of the values is changed, the vector (1,2,2) represents "I”.
  • the searching the multi-dimensional word vector dictionary according to the words in the text to be classified to obtain the multi-dimensional word vector corresponding to each word includes:
  • the text to be classified is usually composed of entire sentences, which form a sentence and contain many words.
  • the existing word segmentation method can accurately segment the text to be classified. For example, a sentence is "Today's Sunshine number goes to sea smoothly", word segmentation Then get "today”, “sunshine number”, “smooth”, “out” and "sea”.
  • each word can be used to find the multidimensional word vector corresponding to each word in the multidimensional vector dictionary.
  • the multi-dimensional word vector of each word can also take advantage of the different properties of the multi-dimensional word vector of each word to ensure that the semantics of each sentence is consistent with the original text and ensure the accuracy of text classification in subsequent steps.
  • step S120 the multi-dimensional word vector of each keyword in the text to be classified is obtained from the multi-dimensional word vector corresponding to each word.
  • the obtaining the multi-dimensional word vector of each keyword in the text to be classified from the multi-dimensional word vector corresponding to each word includes:
  • the keywords are words at all levels representing the key subject of the text, the accuracy of the text classification can be ensured, and the amount of calculation in subsequent steps can be effectively reduced.
  • the determining the keywords in the text to be classified includes:
  • the predetermined number of words that appear most frequently are determined as keywords.
  • the determining the keywords in the text to be classified includes:
  • Predicates are keywords, where A is the number of times a word appears in the text, B is the total number of words in the text, C is the total number of texts in the text library, and D is the number of texts that contain a word in the text library. E The weight of the paragraph where a word comes from in the text.
  • A is the number of times a word appears in the text
  • B is the total number of words in the text
  • the frequency of the word in the text can be obtained through A/B.
  • C is the total number of texts in the text library
  • D is the number of texts containing a word in the text library
  • the text library is a pre-collected inventory of a large number of texts
  • log(C/(D+1)) can calculate a word in all texts When a word appears frequently in all texts, it means that the word is a popular word. The larger the denominator D+1, the smaller the value of log(C/(D+1)) The closer to 0.
  • the words at a specific position relative to the specific word among the words in the text to be classified are determined as keywords of the text to be classified.
  • step S130 the element value of the predetermined dimension in the multi-dimensional word vector corresponding to each word is obtained, and the predetermined dimension machine learning model is input according to the order of each word in the text to be classified to obtain the predetermined dimension of the text to be classified Classification results.
  • the predetermined dimension refers to a certain dimension in the multidimensional vector of the word vector in the text to be classified.
  • the vector of "you" is (1,2,3), 1 represents the first dimension vector, 2 represents the second dimension vector, and 3 represents the third dimension vector.
  • the sub-classification result of the text to be classified takes out the element value of the vector of the first dimension of each word, and then inputs it into the machine learning model in the order of the words. Then the element values from the second dimension to the last dimension are taken out, and input into the machine learning model in order. Obtain the predetermined dimension classification result of the text to be classified. Extracting the element values of the predetermined dimensions, and using the trained machine learning model of the predetermined dimensions, can effectively reduce the calculation magnitude, improve the calculation efficiency, and accurately classify the text initially.
  • the training method of the predetermined-dimensional machine learning model includes:
  • the coefficient of the machine learning model is adjusted until the predetermined dimensional classification result is consistent with the pre-calibrated category of the text sample.
  • the predetermined dimensional classification results of all text samples are consistent with the pre-calibrated categories of the text samples, and the training ends.
  • pre-calibrated text samples of the categories Through pre-calibrated text samples of the categories, according to the element values of the predetermined dimensions of the multi-dimensional word vectors of the words in the text samples, input the machine learning model in order and output the pre-calibrated categories, which can accurately train and obtain the predetermined dimension machine learning model .
  • step S140 the multi-dimensional word vector of each keyword is input into the keyword machine learning model according to the order of each word in the text to be classified to obtain the keyword classification result of the text to be classified.
  • the number of keyword multi-dimensional vectors is small, and at the same time it has a high degree of text representation, which can effectively reduce the calculation load of the machine learning model, improve calculation efficiency, and effectively improve the accuracy of pre-classification.
  • the method for training the keyword machine learning model includes: setting a text sample set, each text sample in the text sample set has a known classification result, and obtaining each text sample
  • the keyword vector of the text sample is input to the keyword machine learning model, and the keyword machine learning model outputs the sub-classification result of the text sample, and the sub-classification result is known to the text sample Compare the classification results of the text samples. If they are inconsistent, adjust the machine learning model to make the sub-classification results consistent with the known classification results of the text sample.
  • step S150 the classification result of the text to be classified is determined based on the predetermined dimensional classification result and the keyword classification result.
  • the classification result of the predetermined dimension is obtained according to the full text analysis, and the keyword classification result is obtained according to the representative keywords of the text. Combining the two can effectively ensure the accuracy of text classification.
  • the determining the classification result of the text to be classified based on the predetermined dimensional classification result and the keyword classification result includes step S310, step S320, and Step S330:
  • Step S310 Obtain classification results of all dimensions
  • Step S310 Obtain classification results of all keywords
  • Step S310 Use the classification result with the largest number among the classification results of all dimensions and the classification results of all keywords as the classification result of the text to be classified.
  • the classification results of all dimensions and the classification results of all keywords have the largest number of classification results, which are the most closely related to the text and the most critical word in the text. This word is used as the classification result of the text to be classified to effectively ensure the text classification accuracy.
  • the text classification apparatus may include a search module 410, an acquisition module 420, a first classification module 430, a second classification module 440, and a classification determination module 450. among them:
  • the search module 410 can be used to search the multi-dimensional word vector dictionary according to the words in the text to be classified, and obtain the multi-dimensional word vector corresponding to each word;
  • the obtaining module 420 may be used to obtain the multi-dimensional word vector of each keyword in the text to be classified from the multi-dimensional word vector corresponding to each word;
  • the first classification module 430 may be used to obtain the element value of a predetermined dimension in the multi-dimensional word vector corresponding to each word, and input the predetermined dimension machine learning model according to the order of each word in the text to be classified to obtain the text to be classified Classification results of predetermined dimensions;
  • the second classification module 440 may be used to input the multi-dimensional word vector of each keyword into the keyword machine learning model according to the order of each word in the text to be classified to obtain the keyword classification result of the text to be classified;
  • the classification determination module 450 may be configured to use the classification result based on the predetermined dimension and the keyword classification result as the classification result of the text to be classified.
  • modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory.
  • the features and functions of two or more modules or units described above may be embodied in one module or unit.
  • the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.
  • the exemplary embodiments described herein can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which may be a personal computer, server, mobile terminal, or network device, etc.) execute the method according to the embodiment of the present application.
  • a non-volatile storage medium can be a CD-ROM, U disk, mobile hard disk, etc.
  • Including several instructions to make a computing device which may be a personal computer, server, mobile terminal, or network device, etc.
  • the electronic device 500 according to this embodiment of the present invention will be described below with reference to FIG. 5.
  • the electronic device 500 shown in FIG. 5 is only an example, and should not bring any limitation to the function and application scope of the embodiment of the present invention.
  • the electronic device 500 is represented in the form of a general-purpose computing device.
  • the components of the electronic device 500 may include, but are not limited to: the aforementioned at least one processing unit 510, the aforementioned at least one storage unit 520, and a bus 530 connecting different system components (including the storage unit 520 and the processing unit 510).
  • the storage unit stores program code, and the program code can be executed by the processing unit 510, so that the processing unit 510 executes the various exemplary methods described in the "Exemplary Methods" section of this specification. Implementation steps.
  • the processing unit 510 may perform step S110 as shown in FIG.
  • step S120 searching a multi-dimensional word vector dictionary according to the words in the text to be classified to obtain the multi-dimensional word vector corresponding to each word;
  • step S120 In the corresponding multi-dimensional word vector, the multi-dimensional word vector of each keyword in the text to be classified is obtained;
  • step S130 the element value of the predetermined dimension in the multi-dimensional word vector corresponding to each word is obtained, and each word is The order in the classified text is input into the predetermined dimension machine learning model to obtain the predetermined dimension classification result of the text to be classified;
  • step S140 the multi-dimensional word vector of each keyword is input according to the order of each word in the text to be classified
  • the keyword machine learning model obtains the keyword classification result of the text to be classified;
  • Step S150 the classification result based on the predetermined dimension and the keyword classification result are used as the classification result of the text to be classified.
  • the storage unit 520 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 5201 and/or a cache storage unit 5202, and may further include a read-only storage unit (ROM) 5203.
  • RAM random access storage unit
  • ROM read-only storage unit
  • the storage unit 520 may also include a program/utility tool 5204 having a set of (at least one) program module 5205.
  • program module 5205 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
  • the bus 530 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
  • the electronic device 500 can also communicate with one or more external devices 700 (such as keyboards, pointing devices, Bluetooth devices, etc.), and can also communicate with one or more devices that enable customers to interact with the electronic device 500, and/or communicate with Any device (such as a router, modem, etc.) that enables the electronic device 500 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 550.
  • the electronic device 500 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 560. As shown in the figure, the network adapter 560 communicates with other modules of the electronic device 500 through the bus 530.
  • LAN local area network
  • WAN wide area network
  • public network such as the Internet
  • the exemplary embodiments described herein can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which may be a personal computer, server, terminal device, or network device, etc.) execute the method according to the embodiment of the present application.
  • a non-volatile storage medium can be a CD-ROM, U disk, mobile hard disk, etc.
  • Including several instructions to make a computing device which may be a personal computer, server, terminal device, or network device, etc.
  • a computer-readable storage medium is also provided, on which is stored a program product capable of implementing the foregoing method of this specification.
  • various aspects of the present invention may also be implemented in the form of a program product, which includes program code, and when the program product runs on a terminal device, the program code is used to make the The terminal device executes the steps according to various exemplary embodiments of the present invention described in the above "Exemplary Method" section of this specification.
  • a program product 600 for implementing the above method according to an embodiment of the present invention is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be installed in a terminal device, For example, running on a personal computer.
  • the program product of the present invention is not limited thereto.
  • the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, device, or device.
  • the program product can use any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
  • the program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.
  • the program code used to perform the operations of the present invention can be written in any combination of one or more programming languages.
  • the programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural styles. Programming language-such as "C" language or similar programming language.
  • the program code can be executed entirely on the client computing device, partly executed on the client device, executed as a stand-alone software package, partly executed on the client computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on.
  • the remote computing device can be connected to a client computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service providers for example, using Internet service providers.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供了一种文本分类方法、装置、电子设备及计算机非易失性可读存储介质,属于机器学习应用技术领域,该方法包括:按照待分类文本中的词查找多维词向量字典,获得每个词对应的多维词向量;获取所述待分类文本中的各关键词的多维词向量;获取所述每个词对应的多维词向量中预定维度的元素值,按照每个词的顺序输入预定维度机器学习模型,得到预定维度分类结果;将所述各关键词的多维词向量,按照每个词的顺序输入关键词机器学习模型,得到关键词分类结果;基于所述预定维度分类结果和所述关键词分类结果,确定所述待分类文本的分类结果。本申请通过机器学习模型,结合关键词分类和预定维度分类,有效降低了计算负荷,同时有效提高了文本分类准确性。

Description

文本分类方法、装置、电子设备及计算机非易失性可读存储介质
本申请要求2019年05月23日递交、发明名称为“文本分类方法、装置、介质及电子设备”的中国专利申请201910435075.5的优先权,在此通过引用将其全部内容合并于此。
技术领域
本申请涉及机器学习应用技术领域,尤其涉及一种文本分类方法、装置、电子设备及计算机非易失性可读存储介质。
背景技术
文本分类就是用电脑对文本集按照一定的分类体系或标准进行自动分类标记。
目前,文本分类通常利用神经网络搭建的深度学习模型,在将文本中的词表示成数值型的词向量后,将词向量整合为句向量,输入用于文本分类的深度学习模型,进而对文本进行分类。本申请的发明人意识到,传统的分类方法中,针对整段文本的句向量进行循环计算,计算负荷大,同时由于信息量极大导致文本分类的准确率有限。
因此,需要提供一种新的文本分类方法、装置、介质及电子设备。
发明内容
为了解决上述技术问题,本申请的一个目的在于提供一种文本分类方法、装置、电子设备及计算机非易失性可读存储介质。
其中,本申请所采用的技术方案为:
一方面,一种文本分类方法,包括:按照待分类文本中的词查找多维词向量字典,获得每个词对应的多维词向量;从所述每个词对应的多维词向量中,获取所述待分类文本中的各关键词的多维词向量;获取所述每个词对应的多维词向量中预定维度的元素值,按照每个词在待分类文本中的顺序输入预定维度机器学习模型,得到所述待分类文本的预定维度分类结果;将所述各关键词的多维词向量,按照每个词在待分类文本中的顺序输入关键词机器学习模型,得到所述待分类文本的关键词分类结果;基于所述预定维度分类结果和所述关键词分类结果,确定所述待分类文本的分类结果。
另一方面,一种文本分类装置,其特征在于,包括:查找模块,用于按照待分类文本中的词查找多维词向量字典,获得每个词对应的多维词向量;获取模块,用于从所述每个词对应的多维词向量中,获取所述待分类文本中的各关键词的多维词向量;第一分类模块,用于获取所述每个词对应的多维词向量中预定维度的元素值,按照每个词在待分类文本中的顺序输入预定维度机器学习模型,得到所述待分类文本的预定维度分类结果;第二分类模块,用于将所述各关键词的多维词向量,按照每个词在待分类文本中的顺序输入关键词 机器学习模型,得到所述待分类文本的关键词分类结果;分类确定模块,用于基于所述预定维度分类结果和所述关键词分类结果,作为所述待分类文本的分类结果。
另一方面,一种文本分类装置,包括:处理器;以及存储器,用于存储所述处理器的文本分类程序;其中,所述处理器配置为经由执行所述文本分类程序来执行如上述的文本分类方法。
另一方面,一种计算机非易失性可读存储介质,其上存储有文本分类程序,其特征在于,所述文本分类程序被处理器执行时实现如上述的文本分类方法。
在上述技术方案中,预定维度的分类结果根据全文分析获得,同时,关键词分类结果根据文本的代表性关键词获得,结合两者可以有效保证文本分类的准确性。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并于说明书一起用于解释本申请的原理。
图1示意性示出一种文本分类方法的流程图。
图2示意性示出一种文本分类方法的应用场景示例图。
图3示意性示出一种确定待分类文本的分类结果方法流程图。
图4示意性示出一种文本分类装置的方框图。
图5示出根据示例性实施例的用于实现上述文本分类方法的电子设备的框图。
图6示出根据示例性实施例的用于实现上述文本分类方法的计算机非易失性可读存储介质的示意图。
通过上述附图,已示出本申请明确的实施例,后文中将有更详细的描述,这些附图和文字描述并不是为了通过任何方式限制本申请构思的范围,而是通过参考特定实施例为本领域技术人员说明本申请的概念。
具体实施方式
这里将详细地对示例性实施例执行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本申请将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。
本示例实施方式中首先提供了文本分类方法,该文本分类方法可以运行于服务器,也可以运行于服务器集群或云服务器等,当然,本领域技术人员也可以根据需求在其他平台运行本发明的方法,本示例性实施例中对此不做特殊限定。参考图1所示,该文本分类方法可以包括以下步骤:
步骤S110.按照待分类文本中的词查找多维词向量字典,获得每个词对应的多维词向量。
步骤S120.从所述每个词对应的多维词向量中,获取所述待分类文本中的各关键词的多维词向量。
步骤S130.获取所述每个词对应的多维词向量中预定维度的元素值,按照每个词在待分类文本中的顺序输入预定维度机器学习模型,得到所述待分类文本的预定维度分类结果。
步骤S140.将所述各关键词的多维词向量,按照每个词在待分类文本中的顺序输入关键词机器学习模型,得到所述待分类文本的关键词分类结果。
步骤S150.基于所述预定维度分类结果和所述关键词分类结果,确定所述待分类文本的分类结果。
上述文本分类方法中,首先,按照待分类文本中的词查找多维词向量字典,获得每个词对应的多维词向量;通过将待分类文本中的词表示成多维词向量,可以在后续步骤中便于机器学习模型的准确计算。然后,从所述每个词对应的多维词向量中,获取所述待分类文本中的各关键词的多维词向量;通过获取待分类文本中的关键词,由于关键词代表文本的关键主旨,进而可以有效保证文本分类的准确性,同时有效降低后续步骤的计算量。然后,获取所述每个词对应的多维词向量中预定维度的元素值,按照每个词在待分类文本中的顺序输入预定维度机器学习模型,得到所述待分类文本的预定维度分类结果;将预定维度的元素值提取出,利用训练好的预定维度的机器学习模型,可以在有效降低计算量级的情况下,提高计算效率,准确地将文本初步分类。然后,将所述各关键词的多维词向量,按照每个词在待分类文本中的顺序输入关键词机器学习模型,得到所述待分类文本的关键词分类结果;关键词多维向量数量很少,同时具有高度的文本代表性,可以有效降低机器学习模型的计算负荷,提高计算效率,同时有效提高预分类的准确性。最后,基于所述预定维度分类结果和所述关键词分类结果,确定所述待分类文本的分类结果;预定维度的分类结果根据全文分析获得,同时,关键词分类结果根据文本的代表性关键词获得,结合两者可以有效保证文本分类的准确性。
下面,将结合附图对本示例实施方式中上述文本分类方法中的各步骤进行详细的解释以及说明。
在步骤S110中,按照待分类文本中的词查找多维词向量字典,获得每个词对应的多维词向量。
在本示例的实施方式中,参考图2所示,服务器201爬取服务器202的待分类文本或 者获取服务器201上存储的待分类文本,然后服务器201就可以对待分类文本进行分词等处理后,查找多维词向量字典,获得每个词对应的多维词向量。其中,服务器201可以是任何具有执行程序指令、存储功能的终端,例如云服务器、手机、电脑等;服务器202可以是任何具有存储功能的终端,例如手机、电脑等。
多维向量字典是预先规定各多维向量向量对应的词语的字典,不同词对应的多维向量中,至少有一个维度的元素值不同。不同词对应的多维向量中,至少有一个维度的元素值不同是当向量中的其中一个元素值发生改变时,该向量所对应的词会发生改变,例如:向量(1,2,3)代表“你”,当改变其中一个值后,向量(1,2,2)就代表“我”。通过获取每个词的多维词向量,可以在后续步骤中利用机器学习模型进行准确的计算分析。
在本示例的一种实施方式中,所述按照待分类文本中的词查找多维词向量字典,获得每个词对应的多维词向量,包括:
将所述待分类文本分词,得到组成所述待分类文本的每个词;
从所述多维词向量字典中查找所述每个词对应的多维词向量。
待分类文本通常是由整段的句子组成,组成一个句子又包含很多的词,利用现有的分词方法可以准确的将待分类文本分词,例如,一个句子为“今天阳光号顺利出海”,分词后得到“今天”“阳光号”“顺利”“出”“海”,通过将待分类文本分词,可以利用每个词在多维向量字典查找每个词对应的多维词向量,这样既可以获得每个词的多维词向量,又可以利用每个词的多维词向量不同的性质,保证每个句子的语义与原文一致,保证后续步骤进行文本分类的准确性。
在步骤S120中.从所述每个词对应的多维词向量中,获取所述待分类文本中的各关键词的多维词向量。
在本示例的实施方式中,通过获取待分类文本中的关键词,由于关键词是代表文本的关键主旨的各级词,可以保证文本分类的准确性,同时有效降低后续步骤的计算量。
在本示例的一种实施方式中,所述从所述每个词对应的多维词向量中,获取所述待分类文本中的各关键词的多维词向量,包括:
确定所述待分类文本中的关键词;
从所述每个词对应的多维词向量中,获取所述关键词的多维词向量。
通过获取待分类文本中的关键词,由于关键词是代表文本的关键主旨的各级词,可以保证文本分类的准确性,同时有效降低后续步骤的计算量。
在本示例的一种实施方式中,所述确定所述待分类文本中的关键词,包括:
计算所述待分类文本中的每个词出现的次数;
将出现次数最多的预定数目个词,确定为关键词。
通过计算待分类文本中每个词出现的次数,一般,文本中越重要的词出现的次数越多,该词在文本中的重要性越高,通过将出现次数最多的预定数目个词,确定为关键词,可以快速的确定出文本的关键词。
在本示例的一种实施方式中,所述确定所述待分类文本中的关键词,包括:
根据所述待分类文本中的词的词-文本关联度M=E*A/B*log(C/(D+1)),当所述词-文本关联度M大于预定阈值时,确定所述词为关键词,其中,A为某个词在文本中出现的次数,B为文本中的总词数,C为文本库中文本总数,D为文本库中包含某个词的文本数,E某个词在文本中来源的段落的权重。
A为某个词在文本中出现的次数,B为文本中的总词数,通过A/B可以得到该词在文本中出现的频率。C为文本库中文本总数,D为文本库中包含某个词的文本数,文本库为预先收集了大量文本的库存,log(C/(D+1))可以计算出一个词在所有文本中的出现频率,当一个词在所有文本中出现的频率很高时,说明该词为大众词,那么分母D+1就越大,log(C/(D+1))的值就越小越接近0。A/B*log(C/(D+1))的值越大说明该词在待分类文本中出现的次数多,在整个文本库中出现的次数越少,进而说明该词在待分类文本中越重要。E为某个词在文本中来源的段落的权重,通过关键词的频率E乘以关键词在待分类文本中的关联频率A/B*log(C/(D+1)),可以得到待分类文本中的词的词-文本关联度M,该值越高,对应的词的越关键。当所述词-文本关联度M大于预定阈值时,确定所述词为关键词,可以有效保证关键词的准确度,进而保证文本分类的准确度。
在本示例的一种实施方式中,基于待分类文本分成的词,将待分类文本分成的词中相对于特定词的特定位置的词,确定为待分类文本的关键词。
例如一个文本的主旨是西红柿和产地山东,则文本中必定多次描述西红柿富含各种营养,产自山东的西红柿等;此时可以摸过设定模板,***富含,产自**,将富含之前位置,和产自之后位置的词确定为待分类文本的关键词;方便快捷、准确度高。
在步骤S130中.获取所述每个词对应的多维词向量中预定维度的元素值,按照每个词在待分类文本中的顺序输入预定维度机器学习模型,得到所述待分类文本的预定维度分类结果。
在本示例的实施方式中,所述预定维度指的是待分类文本中的词向量在多维向量中的某一维度。例如“你”的向量是(1,2,3),1代表第一维度向量,2代表第二维度向量,3代表第三维度向量。
将待分类文本中的每个词对应的多维向量中的预定维度的元素值取出,按照待分类文本中的每个词的顺序输入与该预定位次对应的机器学习模型,由机器学习模型输出该待分类文本的子分类结果,例如:将每个词的第一维度的向量的元素值取出,然后按照词的顺序输入机器学习模型。然后将第二维度到最后一个维度的元素值分别取出,分别按顺序输入机器学习模型。得到所述待分类文本的预定维度分类结果。将预定维度的元素值提取出,利用训练好的预定维度的机器学习模型,可以在有效降低计算量级的情况下,提高计算效率,准确地将文本初步分类。
在本示例的一种实施方式中,所述预定维度机器学习模型的训练方法包括:
收集事先标定了类别的文本样本集合;
按照所述文本样本中的词查找多维词向量字典,获得每个词对应的多维词向量;
获取所述每个词对应的多维词向量中预定维度的元素值,按照每个词在文本样本中的顺序输入预定维度机器学习模型,输出所述文本样本的预定维度分类结果;
当所述预定维度分类结果与对所述文本样本事先标定的类别不一致,调整机器学习模型的系数,直到所述预定维度分类结果与对所述文本样本事先标定的类别一致。
当所述机器学习模型针对所述文本样本集合中,所有文本样本的预定维度分类结果与对所述文本样本事先标定的类别一致,训练结束。
通过事先标定了类别的文本样本,按照所述文本样本中的词的多维词向量的预定维度的元素值,按照顺序输入机器学习模型输出事先标定的类别,可以准确的训练得到预定维度机器学习模型。
在步骤S140中.将所述各关键词的多维词向量,按照每个词在待分类文本中的顺序输入关键词机器学习模型,得到所述待分类文本的关键词分类结果。
在本示例的实施方式中,关键词多维向量数量很少,同时具有高度的文本代表性,可以有效降低机器学习模型的计算负荷,提高计算效率,同时有效提高预分类的准确性。
在本示例的一种实施方式中,所述关键词机器学习模型的训练方法包括:设置文本样本集合,该文本样本集合中的每个文本样本具有已知的分类结果,获取每个文本样本中的关键词的向量,将该文本样本中的关键词的向量输入关键词机器学习模型,由关键词机器学习模型输出对该文本样本的子分类结果,将该子分类结果与该文本样本已知的分类结果进行比较,如不一致,则调整机器学习模型,使该子分类结果与该文本样本已知的分类结果一致。
通过事先标定了类别的文本样本,按照所述文本样本中的关键词的多维词向量,按照顺序输入机器学习模型输出事先标定的类别,可以准确的训练得到关键词机器学习模型。
在步骤S150中.基于所述预定维度分类结果和所述关键词分类结果,确定所述待分类文本的分类结果。
在本示例的实施方式中,预定维度的分类结果根据全文分析获得,同时,关键词分类结果根据文本的代表性关键词获得,结合两者可以有效保证文本分类的准确性。
在本示例的一种实施方式中,参考图3所示,所述基于所述预定维度分类结果和所述关键词分类结果,确定所述待分类文本的分类结果,包括步骤S310、步骤S320及步骤S330:
步骤S310.获取所有维度的分类结果;
步骤S310.获取所有关键词的分类结果;
步骤S310.将所述所有维度的分类结果和所述所有关键词的分类结果中数量最多的分类结果,作为所述待分类文本的分类结果。
所有维度的分类结果和所有关键词的分类结果中数量最多的分类结果,就是与文本的关系最密切同时是文本中最关键的词,将该词作为待分类文本的分类结果有效保证文本分 类的准确性。
本申请还提供了一种文本分类装置。参考图4所示,该文本分类装置可以包括查找模块410、获取模块420、第一分类模块430、第二分类模块440以及分类确定模块450。其中:
查找模块410可以用于按照待分类文本中的词查找多维词向量字典,获得每个词对应的多维词向量;
获取模块420可以用于从所述每个词对应的多维词向量中,获取所述待分类文本中的各关键词的多维词向量;
第一分类模块430可以用于获取所述每个词对应的多维词向量中预定维度的元素值,按照每个词在待分类文本中的顺序输入预定维度机器学习模型,得到所述待分类文本的预定维度分类结果;
第二分类模块440可以用于将所述各关键词的多维词向量,按照每个词在待分类文本中的顺序输入关键词机器学习模型,得到所述待分类文本的关键词分类结果;
分类确定模块450可以用于基于所述预定维度分类结果和所述关键词分类结果,作为所述待分类文本的分类结果。
上述文本分类装置中各模块的具体细节已经在对应的文本分类方法中进行了详细的描述,因此此处不再赘述。
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本申请的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。
此外,尽管在附图中以特定顺序描述了本申请中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、移动终端、或者网络设备等)执行根据本申请实施方式的方法。
在本申请的示例性实施例中,还提供了一种能够实现上述方法的电子设备。
所属技术领域的技术人员能够理解,本发明的各个方面可以实现为系统、方法或程序产品。因此,本发明的各个方面可以具体实现为以下形式,即:完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等),或硬件和软件方面结合的实施方式,这里可 以统称为“电路”、“模块”或“系统”。
下面参照图5来描述根据本发明的这种实施方式的电子设备500。图5显示的电子设备500仅仅是一个示例,不应对本发明实施例的功能和使用范围带来任何限制。
如图5所示,电子设备500以通用计算设备的形式表现。电子设备500的组件可以包括但不限于:上述至少一个处理单元510、上述至少一个存储单元520、连接不同系统组件(包括存储单元520和处理单元510)的总线530。
其中,所述存储单元存储有程序代码,所述程序代码可以被所述处理单元510执行,使得所述处理单元510执行本说明书上述“示例性方法”部分中描述的根据本发明各种示例性实施方式的步骤。例如,所述处理单元510可以执行如图1中所示的步骤S110:按照待分类文本中的词查找多维词向量字典,获得每个词对应的多维词向量;S120:从所述每个词对应的多维词向量中,获取所述待分类文本中的各关键词的多维词向量;步骤S130:获取所述每个词对应的多维词向量中预定维度的元素值,按照每个词在待分类文本中的顺序输入预定维度机器学习模型,得到所述待分类文本的预定维度分类结果;步骤S140:将所述各关键词的多维词向量,按照每个词在待分类文本中的顺序输入关键词机器学习模型,得到所述待分类文本的关键词分类结果;步骤S150:基于所述预定维度分类结果和所述关键词分类结果,作为所述待分类文本的分类结果。
存储单元520可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)5201和/或高速缓存存储单元5202,还可以进一步包括只读存储单元(ROM)5203。
存储单元520还可以包括具有一组(至少一个)程序模块5205的程序/实用工具5204,这样的程序模块5205包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。
总线530可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。
电子设备500也可以与一个或多个外部设备700(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得客户能与该电子设备500交互的设备通信,和/或与使得该电子设备500能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口550进行。并且,电子设备500还可以通过网络适配器560与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器560通过总线530与电子设备500的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备500使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式 可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本申请实施方式的方法。
在本申请的示例性实施例中,还提供了一种计算机可读存储介质,其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中,本发明的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当所述程序产品在终端设备上运行时,所述程序代码用于使所述终端设备执行本说明书上述“示例性方法”部分中描述的根据本发明各种示例性实施方式的步骤。
参考图6所示,描述了根据本发明的实施方式的用于实现上述方法的程序产品600,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本发明的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
所述程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言的任意组合来编写用于执行本发明操作的程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在客户计算设备上执行、部分地在客户设备上执行、作为一个独立的软件包执行、部分在客户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到客户计算设备,或者,可以连接到外部计算设备(例如利用因特 网服务提供商来通过因特网连接)。
此外,上述附图仅是根据本发明示例性实施例的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其他实施例。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由权利要求指出。

Claims (20)

  1. 一种文本分类方法,其特征在于,包括:
    按照待分类文本中的词查找多维词向量字典,获得每个词对应的多维词向量;
    从所述每个词对应的多维词向量中,获取所述待分类文本中的各关键词的多维词向量;
    获取所述每个词对应的多维词向量中预定维度的元素值,按照每个词在待分类文本中的顺序输入预定维度机器学习模型,得到所述待分类文本的预定维度分类结果;
    将所述各关键词的多维词向量,按照每个词在待分类文本中的顺序输入关键词机器学习模型,得到所述待分类文本的关键词分类结果;
    基于所述预定维度分类结果和所述关键词分类结果,确定所述待分类文本的分类结果。
  2. 根据权利要求1所述的方法,其特征在于,所述按照待分类文本中的词查找多维词向量字典,获得每个词对应的多维词向量,包括:
    将所述待分类文本分词,得到组成所述待分类文本的每个词;
    从所述多维词向量字典中查找所述每个词对应的多维词向量。
  3. 根据权利要求1所述的方法,其特征在于,所述从所述每个词对应的多维词向量中,获取所述待分类文本中的各关键词的多维词向量,包括:
    确定所述待分类文本中的关键词;
    从所述每个词对应的多维词向量中,获取所述关键词的多维词向量。
  4. 根据权利要求3所述的方法,其特征在于,所述确定所述待分类文本中的关键词,包括:
    计算所述待分类文本中的每个词出现的次数;
    将出现次数最多的预定数目个词,确定为关键词。
  5. 根据权利要求3所述的方法,其特征在于,所述确定所述待分类文本中的关键词,包括:
    根据所述待分类文本中的词的词-文本关联度M=E*A/B*log(C/(D+1)),当所述词-文本关联度M大于预定阈值时,确定所述词为关键词,其中,A为某个词在文本中出现的次数,B为文本中的总词数,C为文本库中文本总数,D为文本库中包含某个词的文本数,E为某个词在文本中来源的段落的权重。
  6. 根据权利要求1所述的方法,其特征在于,所述预定维度机器学习模型的训练方法包括:
    收集事先标定了类别的文本样本集合;
    按照所述文本样本中的词查找多维词向量字典,获得每个词对应的多维词向量;
    获取所述每个词对应的多维词向量中预定维度的元素值,按照每个词在文本样本中的顺序输入预定维度机器学习模型,输出所述文本样本的预定维度分类结果;
    当所述预定维度分类结果与对所述文本样本事先标定的类别不一致,调整机器学习模型的系数,直到所述预定维度分类结果与对所述文本样本事先标定的类别一致。
    当所述机器学习模型针对所述文本样本集合中,所有文本样本的预定维度分类结果与对所述文本样本事先标定的类别一致,训练结束。
  7. 根据权利要求1所述的方法,其特征在于,所述基于所述预定维度分类结果和所述关键词分类结果,确定所述待分类文本的分类结果,包括:
    获取所有维度的分类结果;
    获取所有关键词的分类结果;
    将所述所有维度的分类结果和所述所有关键词的分类结果中数量最多的分类结果,作为所述待分类文本的分类结果。
  8. 一种文本分类装置,其特征在于,包括:
    查找模块,用于按照待分类文本中的词查找多维词向量字典,获得每个词对应的多维词向量;
    获取模块,用于从所述每个词对应的多维词向量中,获取所述待分类文本中的各关键词的多维词向量;
    第一分类模块,用于获取所述每个词对应的多维词向量中预定维度的元素值,按照每个词在待分类文本中的顺序输入预定维度机器学习模型,得到所述待分类文本的预定维度分类结果;
    第二分类模块,用于将所述各关键词的多维词向量,按照每个词在待分类文本中的顺序输入关键词机器学习模型,得到所述待分类文本的关键词分类结果;
    分类确定模块,用于基于所述预定维度分类结果和所述关键词分类结果,作为所述待分类文本的分类结果。
  9. 根据权利要求8所述的装置,所述查找模块被配置为:
    将所述待分类文本分词,得到组成所述待分类文本的每个词;
    从所述多维词向量字典中查找所述每个词对应的多维词向量。
  10. 根据权利要求8所述的装置,所述获取模块被配置为:
    确定所述待分类文本中的关键词;
    从所述每个词对应的多维词向量中,获取所述关键词的多维词向量。
  11. 根据权利要求10所述的装置,所述获取模块被配置为:
    所述待分类文本中的词的词-文本关联度M=E*A/B*log(C/(D+1)),当所述词-文本关联度M大于预定阈值时,确定所述词为关键词,其中,A为某个词在文本中出现的次数,B为文本中的总词数,C为文本库中文本总数,D为文本库中包含某个词的文本数,E为某个词在文本中来源的段落的权重。
  12. 根据权利要求8所述的装置,所述第一分类模块被配置为:
    收集事先标定了类别的文本样本集合;
    按照所述文本样本中的词查找多维词向量字典,获得每个词对应的多维词向量;
    获取所述每个词对应的多维词向量中预定维度的元素值,按照每个词在文本样本中的顺序输入预定维度机器学习模型,输出所述文本样本的预定维度分类结果;
    当所述预定维度分类结果与对所述文本样本事先标定的类别不一致,调整机器学习模型的系数,直到所述预定维度分类结果与对所述文本样本事先标定的类别一致。
    当所述机器学习模型针对所述文本样本集合中,所有文本样本的预定维度分类结果与对所述文本样本事先标定的类别一致,训练结束。
  13. 根据权利要求8所述的装置,所述分类确定模块被配置为:
    获取所有维度的分类结果;
    获取所有关键词的分类结果;
    将所述所有维度的分类结果和所述所有关键词的分类结果中数量最多的分类结果,作为所述待分类文本的分类结果。
  14. 一种电子设备,其特征在于,包括:处理器;以及存储器,用于存储所述处理器的文本分类程序;其中,所述处理器配置为经由执行所述文本分类程序来执行以下处理:
    按照待分类文本中的词查找多维词向量字典,获得每个词对应的多维词向量;
    从所述每个词对应的多维词向量中,获取所述待分类文本中的各关键词的多维词向量;
    获取所述每个词对应的多维词向量中预定维度的元素值,按照每个词在待分类文本中的顺序输入预定维度机器学习模型,得到所述待分类文本的预定维度分类结果;
    将所述各关键词的多维词向量,按照每个词在待分类文本中的顺序输入关键词机器学习模型,得到所述待分类文本的关键词分类结果;
    基于所述预定维度分类结果和所述关键词分类结果,确定所述待分类文本的分类结果。
  15. 根据权利要求14所述的电子设备,其特征在于,所述按照待分类文本中的词查找多维词向量字典,获得每个词对应的多维词向量,包括:
    将所述待分类文本分词,得到组成所述待分类文本的每个词;
    从所述多维词向量字典中查找所述每个词对应的多维词向量。
  16. 根据权利要求14所述的电子设备,其特征在于,所述从所述每个词对应的多维词向量中,获取所述待分类文本中的各关键词的多维词向量,包括:
    确定所述待分类文本中的关键词;
    从所述每个词对应的多维词向量中,获取所述关键词的多维词向量。
  17. 根据权利要求16所述的电子设备,其特征在于,所述确定所述待分类文本中的关键词,包括:
    根据所述待分类文本中的词的词-文本关联度M=E*A/B*log(C/(D+1)),当所述词-文本关联度M大于预定阈值时,确定所述词为关键词,其中,A为某个词在文本中出现的次数,B为文本中的总词数,C为文本库中文本总数,D为文本库中包含某个词的文本数,E为某个词在文本中来源的段落的权重。
  18. 根据权利要求14所述的电子设备,其特征在于,还包括:
    收集事先标定了类别的文本样本集合;
    按照所述文本样本中的词查找多维词向量字典,获得每个词对应的多维词向量;
    获取所述每个词对应的多维词向量中预定维度的元素值,按照每个词在文本样本中的顺序输入预定维度机器学习模型,输出所述文本样本的预定维度分类结果;
    当所述预定维度分类结果与对所述文本样本事先标定的类别不一致,调整机器学习模型的系数,直到所述预定维度分类结果与对所述文本样本事先标定的类别一致。
    当所述机器学习模型针对所述文本样本集合中,所有文本样本的预定维度分类结果与对所述文本样本事先标定的类别一致,训练结束。
  19. 根据权利要求14所述的电子设备,其特征在于,所述基于所述预定维度分类结果和所述关键词分类结果,确定所述待分类文本的分类结果,包括:
    获取所有维度的分类结果;
    获取所有关键词的分类结果;
    将所述所有维度的分类结果和所述所有关键词的分类结果中数量最多的分类结果,作为所述待分类文本的分类结果。
  20. 一种计算机非易失性可读存储介质,其上存储有文本分类程序,其特征在于,所述文本分类程序被处理器执行时执行以下处理:
    按照待分类文本中的词查找多维词向量字典,获得每个词对应的多维词向量;
    从所述每个词对应的多维词向量中,获取所述待分类文本中的各关键词的多维词向量;
    获取所述每个词对应的多维词向量中预定维度的元素值,按照每个词在待分类文本中的顺序输入预定维度机器学习模型,得到所述待分类文本的预定维度分类结果;
    将所述各关键词的多维词向量,按照每个词在待分类文本中的顺序输入关键词机器学习模型,得到所述待分类文本的关键词分类结果;
    基于所述预定维度分类结果和所述关键词分类结果,确定所述待分类文本的分类结果。
PCT/CN2019/103441 2019-05-23 2019-08-29 文本分类方法、装置、电子设备及计算机非易失性可读存储介质 WO2020232898A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910435075.5 2019-05-23
CN201910435075.5A CN110334209B (zh) 2019-05-23 2019-05-23 文本分类方法、装置、介质及电子设备

Publications (1)

Publication Number Publication Date
WO2020232898A1 true WO2020232898A1 (zh) 2020-11-26

Family

ID=68139167

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/103441 WO2020232898A1 (zh) 2019-05-23 2019-08-29 文本分类方法、装置、电子设备及计算机非易失性可读存储介质

Country Status (2)

Country Link
CN (1) CN110334209B (zh)
WO (1) WO2020232898A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011178A (zh) * 2021-03-29 2021-06-22 广州博冠信息科技有限公司 文本生成方法、文本生成装置、电子设备及存储介质
CN113407722A (zh) * 2021-07-09 2021-09-17 平安国际智慧城市科技股份有限公司 基于文本摘要的文本分类方法、装置、电子设备及介质

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259158B (zh) * 2020-02-25 2023-06-02 北京小米松果电子有限公司 一种文本分类方法、装置及介质
CN111291189B (zh) * 2020-03-10 2020-12-04 北京芯盾时代科技有限公司 一种文本处理方法、设备及计算机可读存储介质
CN111507099A (zh) * 2020-06-19 2020-08-07 平安科技(深圳)有限公司 文本分类方法、装置、计算机设备及存储介质
CN111966830A (zh) * 2020-06-30 2020-11-20 北京来也网络科技有限公司 结合rpa和ai的文本分类方法、装置、设备及介质
CN112507117B (zh) * 2020-12-16 2024-02-13 中国南方电网有限责任公司 一种基于深度学习的检修意见自动分类方法及系统

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130059511A (ko) * 2011-11-29 2013-06-07 건국대학교 산학협력단 정지영상 자동 키워드 추출 시스템 및 그 방법
CN106815194A (zh) * 2015-11-27 2017-06-09 北京国双科技有限公司 模型训练方法及装置和关键词识别方法及装置
CN107168992A (zh) * 2017-03-29 2017-09-15 北京百度网讯科技有限公司 基于人工智能的文章分类方法及装置、设备与可读介质
CN107436875A (zh) * 2016-05-25 2017-12-05 华为技术有限公司 文本分类方法及装置
CN107908635A (zh) * 2017-09-26 2018-04-13 百度在线网络技术(北京)有限公司 建立文本分类模型以及文本分类的方法、装置
CN109408636A (zh) * 2018-09-29 2019-03-01 新华三大数据技术有限公司 文本分类方法及装置
CN109460472A (zh) * 2018-11-09 2019-03-12 北京京东金融科技控股有限公司 文本分类方法和装置、及电子设备
CN109739989A (zh) * 2018-12-29 2019-05-10 北京奇安信科技有限公司 文本分类方法和计算机设备

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105574105B (zh) * 2015-12-14 2019-05-28 北京锐安科技有限公司 一种文本分类模型的确定方法
CN105975478A (zh) * 2016-04-09 2016-09-28 北京交通大学 一种基于词向量分析的网络文章所属事件的检测方法和装置
US10216724B2 (en) * 2017-04-07 2019-02-26 Conduent Business Services, Llc Performing semantic analyses of user-generated textual and voice content

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130059511A (ko) * 2011-11-29 2013-06-07 건국대학교 산학협력단 정지영상 자동 키워드 추출 시스템 및 그 방법
CN106815194A (zh) * 2015-11-27 2017-06-09 北京国双科技有限公司 模型训练方法及装置和关键词识别方法及装置
CN107436875A (zh) * 2016-05-25 2017-12-05 华为技术有限公司 文本分类方法及装置
CN107168992A (zh) * 2017-03-29 2017-09-15 北京百度网讯科技有限公司 基于人工智能的文章分类方法及装置、设备与可读介质
CN107908635A (zh) * 2017-09-26 2018-04-13 百度在线网络技术(北京)有限公司 建立文本分类模型以及文本分类的方法、装置
CN109408636A (zh) * 2018-09-29 2019-03-01 新华三大数据技术有限公司 文本分类方法及装置
CN109460472A (zh) * 2018-11-09 2019-03-12 北京京东金融科技控股有限公司 文本分类方法和装置、及电子设备
CN109739989A (zh) * 2018-12-29 2019-05-10 北京奇安信科技有限公司 文本分类方法和计算机设备

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011178A (zh) * 2021-03-29 2021-06-22 广州博冠信息科技有限公司 文本生成方法、文本生成装置、电子设备及存储介质
CN113011178B (zh) * 2021-03-29 2023-05-16 广州博冠信息科技有限公司 文本生成方法、文本生成装置、电子设备及存储介质
CN113407722A (zh) * 2021-07-09 2021-09-17 平安国际智慧城市科技股份有限公司 基于文本摘要的文本分类方法、装置、电子设备及介质

Also Published As

Publication number Publication date
CN110334209A (zh) 2019-10-15
CN110334209B (zh) 2024-05-07

Similar Documents

Publication Publication Date Title
WO2020232898A1 (zh) 文本分类方法、装置、电子设备及计算机非易失性可读存储介质
WO2021017721A1 (zh) 智能问答方法、装置、介质及电子设备
US10657325B2 (en) Method for parsing query based on artificial intelligence and computer device
CN107992596B (zh) 一种文本聚类方法、装置、服务器和存储介质
CN111898366B (zh) 文献主题词聚合方法、装置、计算机设备及可读存储介质
CN108460011B (zh) 一种实体概念标注方法及系统
CN110019732B (zh) 一种智能问答方法以及相关装置
CN112035730B (zh) 一种语义检索方法、装置及电子设备
US20160328467A1 (en) Natural language question answering method and apparatus
CN110619051B (zh) 问题语句分类方法、装置、电子设备及存储介质
CN111444320A (zh) 文本检索方法、装置、计算机设备和存储介质
TW202020691A (zh) 特徵詞的確定方法、裝置和伺服器
CN103971677A (zh) 一种声学语言模型训练方法和装置
CN106708929B (zh) 视频节目的搜索方法和装置
CN114861889A (zh) 深度学习模型的训练方法、目标对象检测方法和装置
CN110727769B (zh) 语料库生成方法及装置、人机交互处理方法及装置
KR102608867B1 (ko) 업계 텍스트를 증분하는 방법, 관련 장치 및 매체에 저장된 컴퓨터 프로그램
CN106570196B (zh) 视频节目的搜索方法和装置
CN114202443A (zh) 政策分类方法、装置、设备及存储介质
CN114116997A (zh) 知识问答方法、装置、电子设备及存储介质
WO2023246849A1 (zh) 回馈数据图谱生成方法及冰箱
CN117076636A (zh) 一种智能客服的信息查询方法、系统和设备
CN113157857B (zh) 面向新闻的热点话题检测方法、装置及设备
WO2021227951A1 (zh) 前端页面元素的命名
CN114676227A (zh) 样本生成方法、模型的训练方法以及检索方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19930001

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19930001

Country of ref document: EP

Kind code of ref document: A1