CN115292487A

CN115292487A - Text classification method, device, equipment and medium based on naive Bayes

Info

Publication number: CN115292487A
Application number: CN202210867479.3A
Authority: CN
Inventors: 刘强
Original assignee: Hangzhou Yiyou Material Technology Co ltd
Current assignee: Hangzhou Yiyou Material Technology Co ltd
Priority date: 2022-07-22
Filing date: 2022-07-22
Publication date: 2022-11-04

Abstract

Embodiments of the present disclosure disclose a naive bayes based text classification method, apparatus, device, and medium. One embodiment of the method comprises: acquiring target text data; performing word segmentation processing on target text data to obtain a target text vocabulary set; extracting target text vocabularies meeting preset keyword conditions from the target text vocabulary set to serve as target key vocabularies, and obtaining a target key vocabulary set; vectorizing each target key word in the target key word set to generate a target key word vector to obtain a target key word vector set; and inputting the target key vocabulary vector set into a naive Bayes text classification model to obtain a text classification corresponding to the target text data. The text classification method and the device can classify the text based on the naive Bayesian model, and improve the text classification efficiency.

Description

Text classification method, device, equipment and medium based on naive Bayes

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a text classification method, apparatus, device, and medium based on naive bayes.

Background

In the big data era, with the rapid development of network platforms, a large amount of text information rapidly grows, and how to efficiently and accurately classify texts becomes a technical problem which needs to be solved urgently. At present, when text is classified, the method generally adopted is as follows: and classifying the texts by using a text classification method based on a deep learning model.

However, when the text is classified in the above manner, there are often technical problems as follows:

firstly, the text classification method based on the deep learning model has a complex model structure, takes long time to classify the text, and has relatively low text classification efficiency.

Secondly, the text classification method based on the deep learning model has high dependency on data and cannot process text classification in complex scenes.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure propose na iotave bayes-based text classification methods, apparatuses, electronic devices, and computer-readable media to address one or more of the technical problems set forth in the background section above.

In a first aspect, some embodiments of the present disclosure provide a naive bayes-based text classification method, the method comprising: acquiring target text data; performing word segmentation processing on the target text data to obtain a target text vocabulary set; extracting target text vocabularies meeting preset keyword conditions from the target text vocabulary set to serve as target key vocabularies, and obtaining a target key vocabulary set; vectorizing each target key vocabulary in the target key vocabulary set to generate a target key vocabulary vector to obtain a target key vocabulary vector set; and inputting the target key vocabulary vector set into a naive Bayes text classification model to obtain a text category corresponding to the target text data, wherein the naive Bayes text classification model is obtained by pre-training.

In a second aspect, some embodiments of the present disclosure provide an apparatus for text classification based on naive bayes, the apparatus comprising: an acquisition unit configured to acquire target text data; the first processing unit is configured to perform word segmentation processing on the target text data to obtain a target text vocabulary set; the extraction unit is configured to extract target text vocabularies meeting preset keyword conditions from the target text vocabulary set as target key vocabularies, and a target key vocabulary set is obtained; the second processing unit is configured to carry out vectorization processing on each target key vocabulary in the target key vocabulary set so as to generate a target key vocabulary vector and obtain a target key vocabulary vector set; and the input unit is configured to input the target key vocabulary vector set into a naive Bayesian text classification model to obtain a text category corresponding to the target text data, wherein the naive Bayesian text classification model is obtained by pre-training.

In a third aspect, some embodiments of the present disclosure provide an electronic device, comprising: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement the method described in any of the implementations of the first aspect.

In a fourth aspect, some embodiments of the present disclosure provide a computer readable medium on which a computer program is stored, wherein the program, when executed by a processor, implements the method described in any of the implementations of the first aspect.

The above embodiments of the present disclosure have the following beneficial effects: through the text classification method based on the naive Bayes of some embodiments of the disclosure, the text can be classified based on the naive Bayes, and the text classification efficiency is improved. Specifically, the reason why the text classification efficiency is relatively low is that: the text classification method based on the deep learning model has the advantages of complex model structure, long time spent on classifying the text and relatively low text classification efficiency. Based on this, the text classification method based on naive bayes of some embodiments of the disclosure firstly obtains target text data. Thus, the target text data can be obtained, and the target text data can be classified. And secondly, performing word segmentation processing on the target text data to obtain a target text vocabulary set. Thus, the set of target text words may be used for vectorization processing. And then, extracting target text vocabularies meeting preset keyword conditions from the target text vocabulary set to serve as target key vocabularies, and obtaining a target key vocabulary set. Therefore, the obtained target key vocabulary set can reduce the processing amount of vectorization processing. And then, vectorizing each target key word in the target key word set to generate a target key word vector, so as to obtain a target key word vector set. Therefore, the target key vocabulary characteristics in the target text data can be obtained and can be directly used for text classification. And finally, inputting the target key vocabulary vector set into a naive Bayes text classification model to obtain a text classification corresponding to the target text data. Wherein, the naive Bayes text classification model is obtained by pre-training. Therefore, the text classification can be directly carried out on the target text data, and the text classification corresponding to the target text data is obtained. And the text classification method based on naive Bayes has a simple model structure, and takes short time for classifying the text, thereby improving the text classification efficiency. Therefore, the text is classified based on naive Bayes, and the text classification efficiency is improved.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.

FIG. 1 is a flow diagram of some embodiments of a naive Bayes based text classification method in accordance with the present disclosure;

FIG. 2 is a schematic block diagram of some embodiments of a naive Bayes based text classification apparatus according to the present disclosure;

FIG. 3 is a schematic block diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and the embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates a flow 100 of some embodiments of a naive bayes-based text classification method according to the present disclosure. The text classification method based on naive Bayes comprises the following steps:

step 101, target text data is obtained.

In some embodiments, an executing agent (e.g., a computer device) of the naive bayes-based text classification method may obtain the target text data through a wired connection or a wireless connection. The target text data may be text data that needs to be subjected to text classification processing. It is noted that the wireless connection means may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a UWB (ultra wideband) connection, and other wireless connection means now known or developed in the future.

Optionally, before step 101, the executing body may execute the following steps:

in a first step, text data input by a user is received.

And step two, storing the text data to a text data cache. The text data cache may be a cache for storing text data. The cache may be a local cache.

In practice, first, the execution body described above may receive text data input by a user. Secondly, the text data can be stored in a text data cache.

In some optional implementation manners of some embodiments, the executing body may, in response to determining that the current time satisfies a preset interval duration condition, acquire the target text data from the text data cache. The preset interval duration condition may be a preset interval duration condition. For example, the preset interval duration condition may be: and the time interval between the current time and the last time of acquiring the target text data from the text data cache is preset interval duration. For example, the preset interval duration may be 10 seconds. In practice, the execution subject may obtain the target text data from the text data buffer according to a message queue technique.

In some optional implementations of some embodiments, the executing subject may obtain the target text data from the text data cache by:

and step one, acquiring each text data stored in a target historical time period from the text data cache as batch text data. The target historical time period may be a time period from a first preset time before the current time to the current time. The first preset time period may be a preset time period. For example, the first preset time period may be 10 seconds.

And secondly, selecting the text data meeting the preset screening condition from the batch of text data as target text data. The preset screening condition may be that the text data is text data that has not been selected from the batch of text data.

And 102, performing word segmentation processing on the target text data to obtain a target text vocabulary set.

In some embodiments, the execution subject may perform word segmentation on the target text data to obtain a target text vocabulary set. The target text vocabulary set may be a set of text vocabularies in the target text data. In practice, the executing body may perform word segmentation processing on the target text data through a jieba word segmentation model.

And 103, extracting target text vocabularies meeting preset keyword conditions from the target text vocabulary set as target key vocabularies to obtain a target key vocabulary set.

In some embodiments, the execution subject may extract a target text vocabulary satisfying a predetermined keyword condition from the target text vocabulary set as a target keyword, to obtain a target keyword set. The preset keyword condition may be that the target text vocabulary is a target text vocabulary different from the stop word in the target text vocabulary set.

And step 104, performing vectorization processing on each target key word in the target key word set to generate a target key word vector and obtain a target key word vector set.

In some embodiments, the execution subject may perform vectorization on each target key word in the target key word set to generate a target key word vector, so as to obtain a target key word vector set. The target key vocabulary vector may be a vector of a target key vocabulary. In practice, the execution main body may perform vectorization processing on the target key vocabulary based on a one-hot presentation (one-hot presentation) method and a TF-IDF (Term Frequency-Inverse Document Frequency) method to obtain a target key vocabulary vector.

And 105, inputting the target key vocabulary vector set into a naive Bayes text classification model to obtain a text classification corresponding to the target text data.

In some embodiments, the executing agent may input the target key vocabulary vector set into a naive bayesian text classification model, and obtain a text classification corresponding to the target text data. Wherein, the naive Bayes text classification model is obtained by pre-training. The text category may be a category to which the text belongs. The above category may be, but is not limited to, one of the following: junk mail, ordinary mail, sports news, automobile news, effective messages and invalid messages. The naive Bayes text classification model is obtained by training through the following steps:

first, a sample set is obtained. The samples in the sample set comprise sample texts and category labels corresponding to the sample texts. The sample text may be, but is not limited to, one of the following: spam text, general mail text, sports news text, automobile news text. The category label may be, but is not limited to, one of the following: junk mail, general mail, sports news, and automobile news. In practice, the executing body may obtain the sample set through a wired connection manner or a wireless connection manner.

And secondly, performing word segmentation processing on each sample text included in the sample set to generate a sample text vocabulary group, so as to obtain a sample text vocabulary group set. The sample text vocabulary group may be a vocabulary group composed of vocabularies in the sample text. For example, the vocabulary set may be (Honda, car) or (football, score, win). In practice, the execution subject may perform word segmentation processing on each sample text included in the sample set through a jieba word segmentation model.

And thirdly, extracting sample key words from each sample text word group in the sample text word group set to obtain a sample key word group set. And the sample key word group in the sample key word group set corresponds to the sample text word group in the sample text word group set. The sample set of key vocabulary groups may be stored in the form of a table. In practice, the execution subject may extract a sample text vocabulary different from the stop word from the sample text vocabulary group as a sample key vocabulary, and obtain a sample key vocabulary group.

And fourthly, vectorizing each sample key vocabulary group in the sample key vocabulary group set to generate a sample key vocabulary vector group and obtain a sample key vocabulary vector group set. The sample key vocabulary vector group may be a vector group formed by sample key vocabulary vectors. The sample key word vector may be a vector of a sample key word. In practice, the execution body may perform vectorization processing on the sample key vocabulary group based on a one-hot representation (one-hot representation) method and a TF-IDF (Term Frequency-Inverse Document Frequency) method, so as to obtain a sample key vocabulary vector group.

And fifthly, generating a sample key vocabulary matrix according to the sample key vocabulary vector group set. The sample key vocabulary matrix may be a matrix using each sample key vocabulary vector as an element. In practice, the execution subject may use each sample key vocabulary vector group in the sample key vocabulary vector group set as a column vector group to form a sample key vocabulary matrix.

And sixthly, training to obtain a naive Bayes text classification model according to the sample key vocabulary matrix, each class label included in the sample set and the initial naive Bayes text classification model. The naive Bayesian text classification model comprises the prior probabilities and the conditional probabilities corresponding to the class labels. The initial naive bayes text classification model described above can be a naive bayes model. In practice, first, the executing agent may substitute the sample key vocabulary matrix and the category labels included in the sample set into the initial naive bayesian text classification model. Then, solving to obtain each prior probability and each conditional probability corresponding to each category label so as to generate a naive Bayesian text classification model.

Optionally, the executing body may execute the following steps:

firstly, determining target text data cache corresponding to the text type. The target text data cache may be a cache of the target text data.

As an example, the text category may be general mail. The execution body may determine the cache corresponding to the general mail as the target text data cache.

And secondly, determining the size of the cache space of the target text data cache. The size of the cache space may be the size of the remaining cache space of the target text data cache.

And thirdly, determining whether the size of the cache space meets a preset cache space condition. The preset buffer space condition may be a preset buffer space condition. The buffer space condition may be that the size of the buffer space is greater than or equal to the size of the buffer space required by the target text data.

And fourthly, in response to the fact that the size of the cache space meets the preset cache space condition, storing the target text data into the target text data cache.

And fifthly, in response to the fact that the size of the cache space does not meet the preset cache space condition, deleting each text data meeting the preset time condition from the target text data cache. The preset time condition may be a preset time condition. The time condition may be that the buffering time is within a time period from the first buffering time to a second preset time after the first buffering time. The first buffer time may be a buffer time of a first target text data in the target text data buffer. The second preset time period may be a preset time period. For example, the second preset time period may be 5 seconds.

And sixthly, in response to the fact that the size of the cache space of the deleted target text data cache meets the preset cache space condition, storing the target text data into the deleted target text data cache.

Optionally, before step 105, the executing body may execute the following steps:

firstly, determining the scene type corresponding to the target text data. The scene type may be an application scene of the target text data. The scene type may be, but is not limited to, one of the following: mail type, news type, community type. In practice, the execution subject may determine a scene type corresponding to the target text data in response to receiving a classification request from a user.

As an example, the execution body may determine, in response to receiving a classification request from a user as a request for characterizing mail classification, a mail type as a scene type corresponding to the target text data.

In some optional implementations of some embodiments, the executing entity may determine the scene type corresponding to the target text data by:

the first step is to collect the interface image of the input interface of the user. In practice, the execution subject may capture an interface image of the input interface of the user by means of screen capture.

And secondly, determining the scene type corresponding to the target text data according to the interface image. In practice, first, the execution subject may perform image recognition on the interface image by using an interface picture recognition method. And then, determining the interface image category obtained after the identification as the scene type corresponding to the target text data.

And secondly, extracting a preset naive Bayes text classification model corresponding to the scene type from a preset naive Bayes text classification model set as the naive Bayes text classification model. Each preset naive Bayesian text classification model in the preset naive Bayesian text classification model set corresponds to a scene type, and each preset naive Bayesian text classification model in the preset naive Bayesian text classification model set is obtained by pre-training.

As an example, the scene type may be a mail type. The execution subject can extract a preset naive Bayes text classification model with a corresponding scene type as an email type from a preset naive Bayes text classification model set as the naive Bayes text classification model.

Optionally, the executing body may execute the following steps:

the first step is to select candidate push information matched with the text type from the candidate push information set as push information to obtain a push information set. The candidate push information may be candidate push information. The push information may be information for pushing.

In a second step, a push window is displayed on the associated display device. Wherein, at least one piece of push information in the push information set is displayed in the push window. The display device may be a display screen. The push window may be a page window for displaying push related information.

And thirdly, responding to the detected selection operation acting on the push window, and displaying a push page corresponding to the push window on the display equipment. Wherein, the push information set is displayed in the push page. The selection operation may include, but is not limited to, at least one of: click, slide, hover, drag. The push page may be a page for displaying a push information set.

The relevant content of the technical scheme is taken as an invention point of the embodiment of the disclosure, and the technical problem that the text classification method based on the deep learning model has high dependency on data and cannot process text classification in a complex scene in the background technology is solved. The factors that lead to the inability to handle text classification in complex scenes are often as follows: the text classification method based on the deep learning model has high dependence on data and cannot process text classification under complex scenes. If the factors are solved, the effect of processing text classification under complex scenes can be achieved. To achieve this effect, the text classification method based on naive bayes according to some embodiments of the disclosure first determines a scene type corresponding to the target text data. Therefore, the scene type of the target text data to be classified can be determined, and the corresponding preset naive Bayesian text classification model can be selected as the naive Bayesian text classification model, so that text classification under a complex scene is more targeted. And then, extracting a preset naive Bayes text classification model corresponding to the scene type from a preset naive Bayes text classification model set to serve as the naive Bayes text classification model. Each preset naive Bayesian text classification model in the preset naive Bayesian text classification model set corresponds to a scene type, and each preset naive Bayesian text classification model in the preset naive Bayesian text classification model set is obtained by pre-training. Therefore, the naive Bayesian text classification model matched with the scene type can be obtained, and the method can be used for text classification in complex scenes. And because the naive Bayes text classification model has a good classification effect on small-scale data sets and low data dependency, the preset naive Bayes text classification model corresponding to each scene type can be trained in advance, so that text classification under complex scenes is processed. Therefore, the text classification method based on naive Bayes can process the text classification under the complex scene.

The above embodiments of the present disclosure have the following advantages: through the text classification method based on the naive Bayes of some embodiments of the present disclosure, the text can be classified based on the naive Bayes, and the text classification efficiency is improved. Specifically, the reason why the text classification efficiency is relatively low is that: the text classification method based on the deep learning model has the advantages of complex model structure, long time spent in classifying texts and relatively low text classification efficiency. Based on this, the text classification method based on naive bayes of some embodiments of the disclosure firstly obtains target text data. Thus, the target text data can be obtained, and the target text data can be classified. And secondly, performing word segmentation processing on the target text data to obtain a target text vocabulary set. Thus, the set of target text words may be used for vectorization processing. And then, extracting target text vocabularies meeting preset keyword conditions from the target text vocabulary set to serve as target key vocabularies, and obtaining a target key vocabulary set. Thus, the obtained target key vocabulary set can reduce the processing amount of vectorization processing. And then, vectorizing each target key word in the target key word set to generate a target key word vector, so as to obtain a target key word vector set. Therefore, the target key vocabulary characteristics in the target text data can be obtained and can be directly used for text classification. And finally, inputting the target key vocabulary vector set into a naive Bayes text classification model to obtain a text classification corresponding to the target text data. Wherein, the naive Bayes text classification model is obtained by pre-training. Therefore, the text classification can be directly carried out on the target text data, and the text classification corresponding to the target text data is obtained. And the text classification method based on the naive Bayes has a simple model structure, so that the time spent on classifying the text is short, and the text classification efficiency can be improved. Therefore, the text is classified based on naive Bayes, and the text classification efficiency is improved.

With further reference to fig. 2, as an implementation of the methods illustrated in the above figures, the present disclosure provides some embodiments of a na iotave bayes-based text classification apparatus, which correspond to those method embodiments illustrated in fig. 1, and which can be applied in particular to various electronic devices.

As shown in fig. 2, the web page generation apparatus 200 of some embodiments includes: an acquisition unit 201, a first processing unit 202, an extraction unit 203, a second processing unit 204, and an input unit 205. Wherein the acquiring unit 201 is configured to acquire target text data; the first processing unit 202 is configured to perform word segmentation processing on the target text data to obtain a target text vocabulary set; the extracting unit 203 is configured to extract target text vocabularies meeting preset keyword conditions from the target text vocabulary set as target key vocabularies, and obtain a target key vocabulary set; the second processing unit 204 is configured to perform vectorization processing on each target key vocabulary in the target key vocabulary set to generate a target key vocabulary vector, and obtain a target key vocabulary vector set; the input unit 205 is configured to input the target key vocabulary vector set into a naive bayesian text classification model, which is trained in advance, to obtain a text category corresponding to the target text data.

It will be appreciated that the units described in the apparatus 200 correspond to the various steps in the method described with reference to figure 1. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 200 and the units included therein, and are not described herein again.

Referring now to FIG. 3, shown is a block diagram of an electronic device (e.g., a computing device or terminal device) 300 suitable for use in implementing some embodiments of the present disclosure. The electronic device in some embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle-mounted terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device/terminal device/server shown in fig. 3 is only an example, and should not bring any limitation to the functions and use range of the embodiments of the present disclosure.

As shown in fig. 3, the electronic device 300 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 301 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage means 308 into a Random Access Memory (RAM) 303. In the RAM303, various programs and data necessary for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

Generally, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 308 including, for example, magnetic tape, hard disk, etc.; and a communication device 309. The communication means 309 may allow the electronic device 300 to communicate wirelessly or by wire with other devices to exchange data. While fig. 3 illustrates an electronic device 300 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be alternatively implemented or provided. Each block shown in fig. 3 may represent one device or may represent multiple devices, as desired.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In some such embodiments, the computer program may be downloaded and installed from a network through the communication device 309, or installed from the storage device 308, or installed from the ROM 302. The computer program, when executed by the processing apparatus 301, performs the above-described functions defined in the methods of some embodiments of the present disclosure.

It should be noted that the computer readable medium described in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring target text data; performing word segmentation processing on the target text data to obtain a target text vocabulary set; extracting target text vocabularies meeting preset keyword conditions from the target text vocabulary set to serve as target key vocabularies, and obtaining a target key vocabulary set; vectorizing each target key vocabulary in the target key vocabulary set to generate a target key vocabulary vector to obtain a target key vocabulary vector set; and inputting the target key vocabulary vector set into a naive Bayes text classification model to obtain a text category corresponding to the target text data, wherein the naive Bayes text classification model is obtained by pre-training.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented by software, and may also be implemented by hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a first processing unit, an extraction unit, a second processing unit, and an input unit. Here, the names of these units do not constitute a limitation to the unit itself in some cases, and for example, the acquisition unit may also be described as "a unit that acquires target text data".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A text classification method based on naive Bayes comprises the following steps:

acquiring target text data;

performing word segmentation processing on the target text data to obtain a target text vocabulary set;

extracting target text vocabularies meeting preset keyword conditions from the target text vocabulary set to serve as target key vocabularies, and obtaining a target key vocabulary set;

vectorizing each target key word in the target key word set to generate a target key word vector to obtain a target key word vector set;

and inputting the target key vocabulary vector set into a naive Bayes text classification model to obtain a text category corresponding to the target text data, wherein the naive Bayes text classification model is obtained by pre-training.

2. The method of claim 1, wherein the naive bayes text classification model is trained by:

obtaining a sample set, wherein samples in the sample set comprise sample texts and category labels corresponding to the sample texts;

performing word segmentation processing on each sample text included in the sample set to generate a sample text vocabulary group, and obtaining a sample text vocabulary group set;

extracting sample key words from each sample text word group in the sample text word group set to obtain a sample key word group set, wherein the sample key words in the sample key word group set correspond to the sample text word groups in the sample text word group set;

vectorizing each sample key vocabulary group in the sample key vocabulary group set to generate a sample key vocabulary vector group and obtain a sample key vocabulary vector group set;

generating a sample key vocabulary matrix according to the sample key vocabulary vector group set;

and training to obtain a naive Bayesian text classification model according to the sample key vocabulary matrix and each class label included in the sample set by using an initial naive Bayesian text classification model, wherein the naive Bayesian text classification model comprises each prior probability and each conditional probability corresponding to each class label.

3. The method of claim 1, wherein prior to the obtaining target text data, the method further comprises:

receiving text data input by a user;

and storing the text data to a text data cache.

4. The method of claim 3, wherein the obtaining target text data comprises:

and responding to the fact that the current time meets the preset interval duration condition, and obtaining target text data from the text data cache.

5. The method of claim 4, wherein the retrieving target text data from the text data cache comprises:

acquiring each text data stored in a target historical time period from the text data cache as batch text data;

and selecting the text data meeting the preset screening condition from the batch of text data as target text data.

6. The method according to one of claims 1-5, wherein the method further comprises:

determining a target text data cache corresponding to the text category;

determining the size of a cache space of the target text data cache;

determining whether the size of the cache space meets a preset cache space condition;

in response to determining that the size of the cache space meets the preset cache space condition, storing the target text data into the target text data cache;

deleting each text data meeting a preset time condition from the target text data cache in response to determining that the size of the cache space does not meet the preset cache space condition;

and in response to the fact that the size of the cache space of the deleted target text data cache meets the preset cache space condition, storing the target text data into the deleted target text data cache.

7. A naive bayes-based text classification apparatus, comprising:

an acquisition unit configured to acquire target text data;

the first processing unit is configured to perform word segmentation processing on the target text data to obtain a target text vocabulary set;

the extraction unit is configured to extract target text vocabularies meeting preset keyword conditions from the target text vocabulary set to serve as target key vocabularies, and a target key vocabulary set is obtained;

the second processing unit is configured to perform vectorization processing on each target key vocabulary in the target key vocabulary set to generate a target key vocabulary vector to obtain a target key vocabulary vector set;

an input unit configured to input the target key vocabulary vector set into a naive Bayes text classification model, which is pre-trained, to obtain a text category corresponding to the target text data.

8. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-6.

9. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.