CN110990577A

CN110990577A - Text classification method and device

Info

Publication number: CN110990577A
Application number: CN201911360415.9A
Authority: CN
Inventors: 孙宇浩; 孙龙超; 唐劭; 张斌; 龚平
Original assignee: Beijing Asiainfo Data Co ltd
Current assignee: Beijing Asiainfo Data Co ltd
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2020-04-10

Abstract

The present disclosure provides a text classification method and device, the text classification method comprising: calculating the similarity of the text to be classified in each pre-constructed word list in at least one pre-constructed word list; judging whether the similarity obtained by calculation is greater than a preset threshold value, if so, classifying the text to be classified into a pre-constructed word list with the similarity greater than the preset threshold value; otherwise, the text to be classified is not included in any pre-constructed word list. The method and the device solve the problem that a large amount of manual items-by-item screening is needed to obtain two or more specific word lists from disordered mass data, save manpower and material resources, effectively improve the working efficiency, effectively construct various word lists and reduce manual errors.

Description

Text classification method and device

Technical Field

The present disclosure relates to the field of computers, and in particular, to a text classification method and apparatus.

Background

In an actual production project, two or more types of word lists need to be constructed, mass data types obtained through a web crawler are very disordered before the word lists are constructed, and two or more types of texts are mixed together. In the prior art, the required data is screened from the massive texts in a manual item-by-item screening mode, and two or more types of word lists are added respectively, so that a large amount of manpower and material resources are needed, and the efficiency is low.

Disclosure of Invention

To solve or at least alleviate at least one of the above technical problems, the present disclosure provides a text classification method and apparatus.

In a first aspect, the present disclosure provides a text classification method, including:

calculating the similarity of the text to be classified in each pre-constructed word list in at least one pre-constructed word list;

judging whether the similarity obtained by calculation is greater than a preset threshold value, if so, classifying the text to be classified into a pre-constructed word list with the similarity greater than the preset threshold value; otherwise, the text to be classified is not included in any pre-constructed word list.

Optionally, each of the pre-vocabulary tables contains at least one predefined text belonging to the same class.

Optionally, the text classification method further includes:

and classifying the texts to be classified into a pre-constructed word list with the similarity larger than a preset threshold value, and then updating the pre-constructed word list.

In a second aspect, the present disclosure provides a text classification apparatus comprising: a calculation module and a judgment module, wherein,

the calculation module is used for calculating the similarity of the text to be classified in advance of each pre-constructed word list in at least one pre-constructed word list;

the judging module is used for judging whether the calculated similarity is greater than a preset threshold value or not, and if so, classifying the text to be classified into a pre-constructed word list with the similarity greater than the preset threshold value; otherwise, the text to be classified is not included in any pre-constructed word list.

Optionally, the text classification apparatus further includes: and the updating module is used for classifying the texts to be classified into the pre-constructed word list with the similarity larger than a preset threshold value and then updating the pre-constructed word list.

In a third aspect, the present disclosure provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the text classification method of any one of claims 1 to 3.

In a fourth aspect, the present disclosure provides a computing device comprising a memory storing a computer program and a processor implementing the text classification method of any one of claims 1 to 3 when executing the computer program.

Compared with the prior art, the method has the following beneficial effects:

the method and the device solve the problem that a large amount of manual items-by-item screening is needed to obtain two or more specific word lists from disordered mass data, save manpower and material resources, effectively improve the working efficiency, effectively construct various word lists and reduce manual errors.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.

FIG. 1 is a schematic flow chart of a text classification method provided by the present disclosure;

FIG. 2 is a schematic flow chart diagram of another text classification method provided by the present disclosure;

fig. 3 is a block diagram of a structure of a text classification apparatus provided in the present disclosure.

Detailed Description

The present disclosure will be described in further detail with reference to the drawings and embodiments. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limitations of the present disclosure. It should be further noted that, for the convenience of description, only the portions relevant to the present disclosure are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

As shown in fig. 1, the present disclosure provides a text classification method, including:

In one embodiment of the present disclosure, each of the pre-formed word lists includes at least one predefined text belonging to the same class.

In the embodiment, each text is segmented, word vectors are generated according to the segmentation result, and finally, matrix calculation is performed according to the word vectors, so that the similarity between the text to be classified and the text in the pre-constructed word list is obtained. And classifying the texts to be classified with the similarity greater than a preset threshold into corresponding pre-constructed word lists. For example, the pre-configuration word table 1 has a text a, a text B and a text c, the texts to be classified are a and B, the similarity between the text a to be classified and the text a, the text B and the text c is calculated, and if the calculation result is 97%, and the preset threshold is 95%, and obviously 97% > 95%, the text a to be classified is classified into the pre-configuration word table 1; and calculating the similarity between the text to be classified and the text a, the text B and the text c, wherein if the calculation result is 86% and the preset threshold value is 95%, the text B to be classified cannot be classified into the pre-constructed word list 1 because the similarity 86% obtained by calculation is less than 95%. In this example, the text a, the text b and the text c are predefined texts belonging to the same category in the pre-vocabulary.

As shown in fig. 2, in an embodiment provided by the present disclosure, the text classification method further includes:

The pre-constructed vocabulary in the present disclosure may also be referred to as a pre-constructed vocabulary, which is to customize a part of text (similar texts (including vocabularies) are expected to be obtained according to the requirements, that is, predefined texts belonging to the same category) according to the type of the vocabulary to be constructed, and put the part of text into the corresponding vocabulary respectively. And performing similarity calculation on the texts in the word bank and the mass texts so as to expand the word list. Similarity calculation is respectively carried out on the acquired massive texts and texts (including vocabularies) of various vocabularies, and the texts with the similarity larger than a preset threshold value are placed into the corresponding vocabularies, so that the vocabularies are continuously expanded. The preset threshold setting can be adaptively adjusted according to actual requirements. And then, similarity calculation is carried out on the expanded word list and the mass texts, the number of the texts in the word list is continuously expanded, a larger word list is established, and a larger database is correspondingly formed. In the above example, after the text to be classified is a, which is included in the pre-configured word table 1, the texts in the pre-configured word table 1 are the text a, the text b, and the text c to be classified.

The following explains the text classification method provided in the present disclosure again by taking emotion analysis as an example.

In the example, two types of vocabularies, positive and negative, are pre-constructed, and a part of vocabularies are customized and are respectively put into the two types of vocabularies. For example: placing in the active vocabulary: words such as happy, happy and pleasant; placing in a passive vocabulary: difficult, poor and damaged vocabulary.

Similarity calculation is respectively carried out on the vocabularies and the mass texts in the active vocabularies and the passive vocabularies, and if the similarity is more than ninety percent, the vocabularies and the mass texts are correspondingly put into the corresponding vocabularies. Where the preset threshold is set at ninety percent.

And performing similarity calculation on the two expanded word lists, continuously expanding the number of vocabularies in the word lists and establishing a larger database.

As shown in fig. 3, the present disclosure also provides a text classification apparatus including: a calculation module and a judgment module, wherein,

In one embodiment of the present disclosure, the text classification apparatus further includes: and the updating module is used for classifying the texts to be classified into the pre-constructed word list with the similarity larger than a preset threshold value and then updating the pre-constructed word list.

For the information interaction, execution process, and other contents between the modules in the apparatus, the specific contents may refer to the description in the embodiment of the method of the present disclosure because the same concept is based on the embodiment of the method of the present disclosure, and are not described herein again.

The present disclosure also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements any of the text classification methods of the present disclosure.

The computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a device of test software, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the means for testing software over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The present disclosure also provides a computing device comprising a memory and a processor, the memory storing a computer program, the processor implementing any of the text classification methods of the present disclosure when executing the computer program.

The computing devices of the disclosed embodiments exist in a variety of forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) Other electronic devices with data processing capabilities.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a" does not exclude the presence of other similar elements in a process, method, article, or apparatus that comprises the element.

In the description herein, reference to the description of the terms "one embodiment/mode," "some embodiments/modes," "example," "specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment/mode or example is included in at least one embodiment/mode or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to be the same embodiment/mode or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments/modes or examples. Furthermore, the various embodiments/aspects or examples and features of the various embodiments/aspects or examples described in this specification can be combined and combined by one skilled in the art without conflicting therewith.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

It will be understood by those skilled in the art that the foregoing embodiments are merely for clarity of illustration of the disclosure and are not intended to limit the scope of the disclosure. Other variations or modifications may occur to those skilled in the art, based on the foregoing disclosure, and are still within the scope of the present disclosure.

Claims

1. A text classification method is characterized by comprising the following steps:

2. The method of claim 1, wherein each of the pre-formed word lists comprises at least one pre-defined text belonging to the same class.

3. The text classification method according to claim 2, characterized in that the text classification method further comprises:

4. A text classification apparatus, characterized by comprising: a calculation module and a judgment module, wherein,

5. The apparatus according to claim 4, wherein each of said pre-formed word lists comprises at least one predefined text belonging to the same class.

6. The apparatus for classifying text according to claim 5, further comprising: and the updating module is used for classifying the texts to be classified into the pre-constructed word list with the similarity larger than a preset threshold value and then updating the pre-constructed word list.

7. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the text classification method of any one of claims 1 to 3.

8. A computing device comprising a memory and a processor, wherein the memory stores a computer program and the processor implements the text classification method of any one of claims 1 to 3 when executing the computer program.