CN113535965A

CN113535965A - Method and system for large-scale classification of texts

Info

Publication number: CN113535965A
Application number: CN202111084923.6A
Authority: CN
Inventors: 沈伟; 杨红飞
Original assignee: Hangzhou Firestone Technology Co ltd
Current assignee: Hangzhou Firestone Technology Co ltd
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2021-10-22

Abstract

The application relates to a method and a system for text large-scale classification, wherein the method comprises the following steps: the method comprises the steps of obtaining a first simhash value of an initial classified text in a database, calculating a second simhash value of the text to be classified, comparing the first simhash value with the second simhash value to obtain the distance between the initial classified text and the text to be classified, selecting a front preset number of texts with the smallest distance from each type of the initial classified text as the classified text, obtaining keywords and word vectors of the classified text, obtaining a second word segmentation result of the text to be classified, calculating the similarity between the classified text and the text to be classified through a text vector similarity calculation method according to the second word segmentation result, the keywords and the word vectors thereof, and completing text classification.

Description

Method and system for large-scale classification of texts

Technical Field

The present application relates to the field of data classification, and in particular, to a method and system for large-scale text classification.

Background

At present, under the condition that the number of types of common text classification models is extremely large, such as hundreds of types, the classification effect is relatively poor, so that the existing text classification scheme usually utilizes methods such as resampling and reweighing to manually supervise the distribution of training data, a balanced data set inherently greatly simplifies the requirement on algorithm robustness, and the reliability of the obtained models is ensured to a certain extent. However, as the number of classes increases on a large scale, the problem of long tail distribution of data becomes more and more serious, and the need to manually maintain the balance among the classes brings about exponentially increased sampling cost.

At present, no effective solution is provided for the problem of data long tail distribution of large-scale text classification in the related technology.

Disclosure of Invention

The embodiment of the application provides a method and a system for large-scale classification of texts, which are used for at least solving the problem of long tail distribution of data of large-scale classification of texts in the related technology.

In a first aspect, an embodiment of the present application provides a method for text large-scale classification, where the method includes:

acquiring a first simhash value of an initially classified text in a database, and calculating a second simhash value of the text to be classified through a locality sensitive hash algorithm;

comparing the first simhash value with the second simhash value to obtain the distance between the initial classified text and the text to be classified, and selecting a preset number of texts with the minimum distance from each type of the initial classified text as the classified text;

acquiring a first word segmentation result of the classified text in the database, determining a keyword from the first word segmentation result according to a keyword word list, and obtaining a word vector of the keyword;

and acquiring a second word segmentation result of the text to be classified, calculating the similarity between the classified text and the text to be classified through a text vector similarity algorithm according to the second word segmentation result, the keyword word list and the word vectors of the keywords, and determining the category of the classified text with the maximum similarity as the category of the text to be classified.

In some embodiments, before obtaining the first simhash value of the initial classified text in the database, the method comprises:

performing word segmentation on the obtained initial classified text to obtain a first word segmentation result, and respectively calculating a first simhash value of the initial classified text through a locality sensitive hash algorithm;

storing the text label of the initial classified text, the first simhash value and the first segmentation result in a database.

In some embodiments, before determining a keyword from the first segmentation result according to a keyword vocabulary and obtaining a word vector of the keyword, the method includes:

determining a keyword word list from the first word segmentation result of the classified text through a TF-IDF algorithm;

and calculating word vectors of the keywords in the keyword word list through a word2vec algorithm.

In some embodiments, after determining the category of the classified text with the maximum similarity as the category of the text to be classified, the method further includes:

and verifying the classification result through manual checking and/or a text classification model, and if the classification result passes the verification, updating the database by using the text to be classified.

In some embodiments, before obtaining the word segmentation result of the text to be classified, the method further includes:

and performing word segmentation on the text to be classified to obtain a second word segmentation result.

In a second aspect, an embodiment of the present application provides a system for large-scale classification of texts, where the system includes a database module, a preprocessing module, a calculation module, and a classification module;

the database module acquires a first simhash value of the initial classified text;

the preprocessing module calculates a second simhash value of the text to be classified through a locality sensitive hash algorithm, compares the first simhash value with the second simhash value to obtain the distance between the initial classified text and the text to be classified, and selects a preset number of texts with the minimum distance from each type of the initial classified text as the classified text;

the calculation module acquires a first word segmentation result of the classified text in the database, determines a keyword from the first word segmentation result according to a keyword word list and obtains a word vector of the keyword;

the classification module obtains a second word segmentation result of the text to be classified, calculates the similarity between the classified text and the text to be classified through a text vector similarity algorithm according to the second word segmentation result, the keyword word list and the word vectors of the keywords, and determines the category of the classified text with the maximum similarity as the category of the text to be classified.

In some embodiments, before the database module obtains the first simhash value of the initial classified text, the method further includes:

the preprocessing module carries out word segmentation on the obtained initial classified text to obtain a first word segmentation result, and first simhash values of the initial classified text are calculated through a local sensitive hash algorithm;

the preprocessing module stores the text label of the initial classified text, the first simhash value and the first segmentation result in a database.

In some embodiments, before the calculating module determines a keyword from the first segmentation result according to a keyword vocabulary and obtains a word vector of the keyword, the calculating module further includes:

the calculation module determines a keyword word list from the first word segmentation result of the classified text through a TF-IDF algorithm;

and the calculation module calculates word vectors of the keywords in the keyword word list through a word2vec algorithm.

In some embodiments, after the classification module determines the class of the classified text with the maximum similarity as the class of the text to be classified, the system further comprises a verification update module;

and the verification updating module verifies the classification result through manual checking and/or a text classification model, and if the classification result passes the verification, the database is updated by using the text to be classified.

In some embodiments, before the classifying module obtains the word segmentation result of the text to be classified, the method further includes:

and the preprocessing module performs word segmentation on the text to be classified to obtain a second word segmentation result.

Compared with the related technology, the method and the system for large-scale classification of the text provided by the embodiment of the application calculate the second simhash value of the text to be classified by obtaining the first simhash value of the initial classified text in the database, compare the first simhash value with the second simhash value to obtain the distance between the initial classified text and the text to be classified, select the text with the minimum distance from each type of the initial classified text as the classified text, obtain the keywords and the word vectors of the classified text, obtain the second segmentation result of the text to be classified, calculate the similarity between the classified text and the text to be classified by a text vector similarity calculation method according to the second segmentation result, the keywords and the word vectors thereof, complete text classification, solve the problem of long tail data distribution in large-scale text classification, and do not need manual supervision and balance on the data distribution, the cost of data processing in large-scale classification of texts is reduced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flow chart of steps of a method for large-scale classification of text according to an embodiment of the present application;

FIG. 2 is a block diagram of a text large-scale classification system according to an embodiment of the present application;

FIG. 3 is a block diagram of the expanded text large-scale classification system;

fig. 4 is an internal structural diagram of an electronic device according to an embodiment of the present application.

Description of the drawings: 21. a database module; 22. a preprocessing module; 23. a calculation module; 24. a classification module; 25. and verifying the updating module.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The embodiment of the present application provides a method for large-scale classification of a text, fig. 1 is a flowchart illustrating steps of the method for large-scale classification of a text according to the embodiment of the present application, and as shown in fig. 1, the method includes the following steps:

step S102, acquiring a first simhash value of an initially classified text in a database, and calculating a second simhash value of the text to be classified through a locality sensitive hash algorithm;

specifically, a first simhash value of an initial classified text is obtained from a database;

calculating a second simhash value of the text to be classified by a locality sensitive hash algorithm, wherein the detailed steps are as follows:

the locality sensitive hash (simhash) algorithm is an algorithm for solving the task of removing the duplicate on the order of billions of web pages.

Step one, the text to be classified is segmented into characteristic words of the article, word sequences with noise words removed are formed, and weight is added to each word.

For example, in the 'American' 51 region ', an employee calls that 9 flying saucer exist inside and a gray alien is seen once, after the division, the employee (3) calls that (1) in the (1) 2 of the employee (3) in the' American (4) 51 region (5) and that (1) 9 flying saucer (3) and that (1) a gray alien (4) are seen (5), and if the weight is divided into 5 levels (1-5), the importance degree of the word in the whole sentence is represented in parentheses, and the larger the number is, the more important the word is.

And step two, changing each word into a hash value through a hash algorithm.

If "usa" is 100101 by the hash algorithm, and "region 51" is 101011 by the hash algorithm. So that our string becomes a string of numbers.

And step three, obtaining a weighted sequence value according to the hash value and the weight of the word.

For example, the hash value of "US" is "100101", the weight is 4, and the weight is calculated as "4-4-44-44"; the hash value of "51 field" is "101011", the weight is 5, and "5-55-555" is calculated by weighting.

And step four, accumulating the weighted sequence values calculated by the words into accumulated sequence values.

For example, the sum of "4-4-44-44" in "USA" and "5-55-555" in "51 region" is obtained as "9-91-119".

And step five, performing dimensionality reduction on the accumulated sequence value to obtain the simhash signature of the text to be classified.

If the value of "9-91-119" calculated in the fourth step is changed into 01 strings, the value of 1 is greater than 0, the value of 0 is less than 0, and finally the result is "101011".

Step S104, comparing the first simhash value with the second simhash value to obtain the distance between the initial classified text and the text to be classified, and selecting a preset number of texts with the minimum distance from each type of the initial classified text as the classified text;

specifically, the initial classified texts contain several categories of texts, each category of text contains several numbers of texts, and the numbers of texts contained in different categories of texts are different;

calculating the hamming distance between the initial classified text and the text to be classified according to the first simhash value and the second simhash value;

and respectively selecting the first n items of texts with the minimum Hamming distance from each type of the initial classified texts as classified texts by taking the type in the initial classified texts as a unit, wherein the type with the minimum text number in the initial classified texts can be selected, and the text number of the type is taken as the value of n.

Step S106, obtaining a first word segmentation result of the classified text in the database, determining keywords from the first word segmentation result according to a keyword word list, and obtaining word vectors of the keywords;

step S108, obtaining a second word segmentation result of the text to be classified, calculating the similarity between the classified text and the text to be classified through a text vector similarity algorithm according to the second word segmentation result, the keyword word list and the word vectors of the keywords, and determining the category of the classified text with the maximum similarity as the category of the text to be classified.

Specifically, according to the second Word segmentation result, the Word list of the keyword and the Word vector of the keyword, the similarity between the classified text and the text to be classified is calculated through a WMD algorithm, and the detailed steps are as follows (Word Mover's Distance is abbreviated as WMD, and the text similarity is measured by calculating the Distance between words in the text on the basis of Word2 vec.)

Obtaining a word vector matrix X of d X n, wherein the size of a dictionary is n, and the dimensionality of a word vector is d;

obtaining importance degree d of each word by using TF-IDF calculation_i；

D and d 'are respectively used for representing the normalized bag-of-words representation of the classified text and the text to be classified, each word i in d can be completely or partially transferred to each word j of d', and a sparse transfer matrix T, T of m x n is obtained_ijRepresents the transfer distance of the word in d to the word in d';

and (3) solving the minimum global transfer cost cumulative sum from d to d' by using linear programming, namely the similarity between the two texts:

and, constraint 1: the sum of the flows of the ith word in d needs to be equal to d_i；

And constraint 2: the total of all the inflowing words of the jth word in d' needs to be equal to d_j。

Through the steps S102 to S108 in the embodiment of the application, the problem of long tail distribution of data in large-scale text classification is solved, manual supervision and balance on the distribution of the data are not needed, and the cost of data processing in the large-scale text classification is reduced.

In some embodiments, in step S102, before obtaining the first simhash value of the initial classified text in the database, the method further includes:

and storing the text label, the first simhash value and the first segmentation result of the initial classified text into a database.

By storing the related data of the initially classified texts into the database, because the texts in different classes are independent, only a small amount of parameter updating and simple expansion are needed for expanding the classes and increasing the training data, and model iteration is easier.

In some embodiments, before determining the keyword from the first segmentation result according to the keyword vocabulary and obtaining the word vector of the keyword in step S106, the method further includes:

It should be noted that TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and text mining. The main idea is that if a word occurs frequently in one article and rarely in other articles, the word or phrase is considered to have a good ability to distinguish.

word2vec is a text vectorization method that can learn distributed vector representation of words through unsupervised training on a large corpus of text.

In some embodiments, in step S108, after determining the category of the classified text with the maximum similarity as the category of the text to be classified, the method further includes:

and verifying the classification result through manual checking and/or a text classification model, and updating the database by using the text to be classified if the classification result passes the verification.

Specifically, the classification result may be verified by manual collation and/or other text classification models different from the present invention, and if the prediction result is consistent with the classification result of the present scheme, the classification result may be considered as a group of classification results with high confidence.

And updating the text label, the first simhash value and the first segmentation result of the classified and verified text to be classified into the database, so as to improve the final effect of the whole model.

In some embodiments, before obtaining the word segmentation result of the text to be classified, the word segmentation of the text to be classified is further performed to obtain a second word segmentation result.

It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.

The embodiment of the application provides a system for large-scale classification of texts, and fig. 2 is a structural block diagram of the system for large-scale classification of texts according to the embodiment of the application, and as shown in fig. 2, the system includes a database module 21, a preprocessing module 22, a calculation module 23 and a classification module 24;

the database module 21 acquires a first simhash value of the initial classified text;

the preprocessing module 22 calculates a second simhash value of the text to be classified through a locality sensitive hash algorithm, compares the first simhash value with the second simhash value to obtain a distance between the initial classified text and the text to be classified, and selects a preset number of texts with the smallest distance from each type of the initial classified text as the classified text;

the calculation module 23 obtains a first segmentation result of the classified text in the database, determines a keyword from the first segmentation result according to the keyword word list, and obtains a word vector of the keyword;

the classification module 24 obtains a second segmentation result of the text to be classified, calculates the similarity between the classified text and the text to be classified according to the second segmentation result, the keyword word list and the word vectors of the keywords by a text vector similarity algorithm, and determines the category of the classified text with the maximum similarity as the category of the text to be classified.

By the system in the embodiment of the application, the problem of long-tail data distribution in large-scale text classification is solved, manual supervision and balance on data distribution are not needed, and the cost of data processing in large-scale text classification is reduced.

the preprocessing module 22 performs word segmentation on the obtained initial classified text to obtain a first word segmentation result, and respectively calculates a first simhash value of the initial classified text through a locality sensitive hash algorithm;

the pre-processing module 22 stores the text label, the first simhash value, and the first segmentation result of the initial classified text in a database.

In some embodiments, before the calculating module 23 determines the keyword from the first segmentation result according to the keyword vocabulary and obtains the word vector of the keyword, the method further includes:

the calculation module 23 determines a keyword vocabulary from the first segmentation result of the classified text through a TF-IDF algorithm;

the calculation module 23 calculates word vectors of the keywords in the keyword vocabulary by word2vec algorithm.

In some embodiments, after the classification module 24 determines the category of the classified text with the maximum similarity as the category of the text to be classified, fig. 3 is a structural block diagram of the expanded text large-scale classification system, as shown in fig. 3, the system further includes a verification update module 25;

the verification updating module 25 verifies the classification result through manual checking and/or a text classification model, and if the classification result passes verification, the database is updated by using the text to be classified.

In some embodiments, before the classification module 24 obtains the word segmentation result of the text to be classified, the method further includes:

the preprocessing module 22 performs word segmentation on the text to be classified to obtain a second word segmentation result.

The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.

The present embodiment also provides an electronic device comprising a memory having a computer program stored therein and a processor configured to execute the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

It should be noted that, for specific examples in this embodiment, reference may be made to examples described in the foregoing embodiments and optional implementations, and details of this embodiment are not described herein again.

In addition, in combination with the method for text large-scale classification in the foregoing embodiments, the embodiments of the present application may provide a storage medium to implement. The storage medium having stored thereon a computer program; the computer program, when executed by a processor, implements any of the methods of large-scale classification of text described in the above embodiments.

In one embodiment, a computer device is provided, which may be a terminal. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for large-scale classification of text. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

In one embodiment, fig. 4 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application, and as shown in fig. 4, there is provided an electronic device, which may be a server, and its internal structure diagram may be as shown in fig. 4. The electronic device comprises a processor, a network interface, an internal memory and a non-volatile memory connected by an internal bus, wherein the non-volatile memory stores an operating system, a computer program and a database. The processor is used for providing calculation and control capability, the network interface is used for communicating with an external terminal through network connection, the internal memory is used for providing an environment for an operating system and the running of a computer program, the computer program is executed by the processor to realize a text large-scale classification method, and the database is used for storing data.

Those skilled in the art will appreciate that the configuration shown in fig. 4 is a block diagram of only a portion of the configuration associated with the present application, and does not constitute a limitation on the electronic device to which the present application is applied, and a particular electronic device may include more or less components than those shown in the drawings, or combine certain components, or have a different arrangement of components.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It should be understood by those skilled in the art that various features of the above-described embodiments can be combined in any combination, and for the sake of brevity, all possible combinations of features in the above-described embodiments are not described in detail, but rather, all combinations of features which are not inconsistent with each other should be construed as being within the scope of the present disclosure.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for large-scale classification of text, the method comprising:

2. The method of claim 1, wherein prior to obtaining the first simhash value of the initial classified text in the database, the method comprises:

3. The method of claim 1, wherein before determining keywords from the first segmentation result according to a keyword vocabulary and obtaining word vectors of the keywords, the method comprises:

4. The method according to claim 1, wherein after the classification of the classified text with the highest similarity is determined as the classification of the text to be classified, the method further comprises:

5. The method according to claim 1, wherein before obtaining the segmentation result of the text to be classified, the method further comprises:

6. A system for large-scale classification of texts is characterized by comprising a database module, a preprocessing module, a calculation module and a classification module;

7. The system of claim 6, further comprising, prior to the database module obtaining the first simhash value for the initial classified text:

8. The system of claim 6, further comprising, before the computing module determines a keyword from the first segmentation result according to a keyword vocabulary and obtains a word vector of the keyword:

9. The system according to claim 6, wherein after the classification module determines the classification of the classified text with the highest similarity as the classification of the text to be classified, the system further comprises a verification update module;

10. The system according to claim 6, before the classifying module obtains the segmentation result of the text to be classified, further comprising: