CN115563281A - Text classification method and device based on text data enhancement - Google Patents

Text classification method and device based on text data enhancement Download PDF

Info

Publication number
CN115563281A
CN115563281A CN202211255742.XA CN202211255742A CN115563281A CN 115563281 A CN115563281 A CN 115563281A CN 202211255742 A CN202211255742 A CN 202211255742A CN 115563281 A CN115563281 A CN 115563281A
Authority
CN
China
Prior art keywords
text
word
text database
word segmentation
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211255742.XA
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Xumi Yuntu Space Technology Co Ltd
Original Assignee
Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xumi Yuntu Space Technology Co Ltd filed Critical Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority to CN202211255742.XA priority Critical patent/CN115563281A/en
Publication of CN115563281A publication Critical patent/CN115563281A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to the technical field of text processing, and provides a text classification method and device based on text data enhancement. The method comprises the following steps: acquiring a text database, wherein the text database comprises a plurality of documents, and each document comprises a plurality of sentences; performing word segmentation processing on each sentence by using a word segmentation device to obtain a word segmentation result corresponding to each sentence, wherein each word segmentation result comprises a plurality of words; calculating an importance evaluation value of each word; sampling the text database for multiple times according to the importance evaluation value of each word to obtain a text database with enhanced data; and performing text classification training by using the data-enhanced text database. By adopting the technical means, the problem that a text classification model trained based on a traditional text data enhancement method in the prior art is weak in generalization capability is solved.

Description

Text classification method and device based on text data enhancement
Technical Field
The present disclosure relates to the field of text processing technologies, and in particular, to a text classification method and apparatus based on text data enhancement.
Background
In model training, in order to obtain a large amount of training data, data enhancement processing is often performed on the training data, and the same is true for model training in the field of text classification. The current text data enhancement means usually replaces part of words of an original text randomly by a certain rule, or randomly adds and deletes part of words, or retranslates sentences, and the like. The modes are all direct modification of the original text based on certain rules, which often causes that the enhanced new sentence semantics are not smooth or deviate from the original sentence semantics, and simultaneously, the modes only simply increase the quantity of training data, and actually do not effectively improve the generalization capability of the model.
In the course of implementing the disclosed concept, the inventors found that there are at least the following technical problems in the related art: the text classification model trained based on the traditional text data enhancement method has the problem of weak generalization capability.
Disclosure of Invention
In view of this, embodiments of the present disclosure provide a text classification method and apparatus based on text data enhancement, an electronic device, and a computer-readable storage medium, so as to solve the problem in the prior art that a text classification model trained based on a conventional text data enhancement method is weak in generalization capability.
In a first aspect of the embodiments of the present disclosure, a text classification method based on text data enhancement is provided, including: acquiring a text database, wherein the text database comprises a plurality of documents, and each document comprises a plurality of sentences; performing word segmentation processing on each sentence by using a word segmentation device to obtain a word segmentation result corresponding to each sentence, wherein each word segmentation result comprises a plurality of words; calculating an importance evaluation value of each word; sampling the text database for multiple times according to the importance evaluation value of each word to obtain a text database with enhanced data; and performing text classification training by using the data-enhanced text database.
In a second aspect of the embodiments of the present disclosure, a text classification device based on text data enhancement is provided, including: the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is configured to acquire a text database, and the text database comprises a plurality of documents, and each document comprises a plurality of sentences; the word segmentation module is configured to perform word segmentation processing on each sentence by using a word segmentation device to obtain a word segmentation result corresponding to each sentence, wherein each word segmentation result comprises a plurality of words; a calculation module configured to calculate an importance evaluation value for each word; the sampling module is configured to sample the text database for multiple times according to the importance evaluation value of each word, so as to obtain a data-enhanced text database; and the training module is configured to perform text classification training by using the data-enhanced text database.
In a third aspect of the embodiments of the present disclosure, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the above method when executing the computer program.
In a fourth aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, which stores a computer program, which when executed by a processor, implements the steps of the above-mentioned method.
Compared with the prior art, the embodiment of the disclosure has the following beneficial effects: because the embodiment of the present disclosure obtains the text database, wherein the text database includes a plurality of documents, each document including a plurality of sentences; performing word segmentation processing on each sentence by using a word segmentation device to obtain a word segmentation result corresponding to each sentence, wherein each word segmentation result comprises a plurality of words; calculating an importance evaluation value of each word; sampling the text database for multiple times according to the importance evaluation value of each word to obtain a text database with enhanced data; the text classification training is carried out by utilizing the text database after data enhancement, so that the technical means can solve the problem that the text classification model trained based on the traditional text data enhancement method in the prior art is weak in generalization capability, and further improve the generalization capability of the text classification model.
Drawings
To more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without inventive efforts.
FIG. 1 is a scenario diagram of an application scenario of an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of a text classification method based on text data enhancement according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a text classification device based on text data enhancement according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. It will be apparent, however, to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.
A text classification method and apparatus based on text data enhancement according to an embodiment of the present disclosure will be described in detail below with reference to the accompanying drawings.
Fig. 1 is a scene schematic diagram of an application scenario of an embodiment of the present disclosure. The application scenario may include terminal devices 101, 102, and 103, server 104, and network 105.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, and 103 are hardware, they may be various electronic devices having a display screen and supporting communication with the server 104, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like; when the terminal devices 101, 102, and 103 are software, they can be installed in the electronic devices as above. The terminal devices 101, 102, and 103 may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module, which is not limited by the embodiments of the present disclosure. Further, various applications, such as data processing applications, instant messaging tools, social platform software, search-type applications, shopping-type applications, etc., may be installed on the terminal devices 101, 102, and 103.
The server 104 may be a server providing various services, for example, a backend server receiving a request sent by a terminal device establishing a communication connection with the server, and the backend server may receive and analyze the request sent by the terminal device and generate a processing result. The server 104 may be a server, may also be a server cluster composed of a plurality of servers, or may also be a cloud computing service center, which is not limited in this disclosure.
The server 104 may be hardware or software. When the server 104 is hardware, it may be various electronic devices that provide various services to the terminal devices 101, 102, and 103. When the server 104 is software, it may be multiple software or software modules providing various services for the terminal devices 101, 102, and 103, or may be a single software or software module providing various services for the terminal devices 101, 102, and 103, which is not limited by the embodiment of the present disclosure.
The network 105 may be a wired network connected by a coaxial cable, a twisted pair cable, and an optical fiber, or may be a wireless network that can interconnect various Communication devices without wiring, for example, bluetooth (Bluetooth), near Field Communication (NFC), infrared (Infrared), and the like, which is not limited in the embodiment of the present disclosure.
A user can establish a communication connection with the server 104 via the network 105 through the terminal apparatuses 101, 102, and 103 to receive or transmit information or the like. It should be noted that the specific types, numbers and combinations of the terminal devices 101, 102 and 103, the server 104 and the network 105 may be adjusted according to the actual requirements of the application scenario, and the embodiment of the present disclosure does not limit this.
Fig. 2 is a schematic flowchart of a text classification method based on text data enhancement according to an embodiment of the present disclosure. The text classification method based on text data enhancement of fig. 2 may be performed by the computer or server of fig. 1, or software on the computer or server. As shown in fig. 2, the text classification method based on text data enhancement includes:
s201, acquiring a text database, wherein the text database comprises a plurality of documents, and each document comprises a plurality of sentences;
s202, performing word segmentation processing on each sentence by using a word segmentation device to obtain a word segmentation result corresponding to each sentence, wherein each word segmentation result comprises a plurality of words;
s203, calculating an importance evaluation value of each word;
s204, sampling the text database for multiple times according to the importance evaluation value of each word to obtain a data-enhanced text database;
and S205, performing text classification training by using the text database after data enhancement.
Topics or scenarios for text classification training include, but are not limited to, the following types: sentiment analysis (Sentiment analysis), topic Classification (Topic Labeling), question and answer tasks (Question Answering), intention recognition (Dialog Act Classification), and Natural Language reasoning (Natural Language Inference). The text database differs for different topics or scenes. The word segmentation device can be any common word segmentation device, such as a jieba word segmentation device, and the word segmentation process is to divide a sentence into a plurality of words according to the words.
According to the technical scheme provided by the embodiment of the disclosure, a text database is obtained, wherein the text database comprises a plurality of documents, and each document comprises a plurality of sentences; performing word segmentation processing on each sentence by using a word segmentation device to obtain a word segmentation result corresponding to each sentence, wherein each word segmentation result comprises a plurality of words; calculating an importance evaluation value of each word; sampling the text database for multiple times according to the importance evaluation value of each word to obtain a text database with enhanced data; the text classification training is carried out by utilizing the text database after data enhancement, so that the technical means can solve the problem that the text classification model trained based on the traditional text data enhancement method in the prior art is weak in generalization capability, and further improve the generalization capability of the text classification model.
In step S203, an importance evaluation value for each word is calculated, including: determining the word frequency of each word based on the occurrence frequency of each word in the word segmentation result of the word and the total number of words in the word segmentation result of the word; determining the inverse document rating of each term based on the number of the documents with each term and the total number of all the documents in the text database; an importance evaluation value of each word is determined based on the word frequency and the inverse document rating of each word.
It should be noted that calculating the importance evaluation value of each word is similar to TF-IDF, but not identical. TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. TF is Term Frequency (Term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency). The word frequency of each word is determined to be different from a method for determining TF in a TF-IDF technology in the disclosed embodiment, wherein TF in the TF-IDF technology is the frequency of occurrence of each word in a document, and the disclosed embodiment can be that the frequency of occurrence of each word in a word segmentation result of the word is divided by the total number of words in the word segmentation result of the word as the word frequency of each word; determining the inverse document rating for each term in the embodiments of the present disclosure is similar to the method of determining IDF in the TF-IDF technique. The importance evaluation value of each word is determined based on the word frequency and the inverse document rating of each word, and may be determined as the importance evaluation value of each word based on a product of the word frequency and the inverse document rating of each word.
In step S204, according to the importance evaluation value of each word, the text database is sampled for multiple times, so as to obtain a data-enhanced text database, which includes: the text database is sampled a number of times as follows: words whose importance evaluation value is smaller than the target threshold value are replaced with masks with a first probability, and words that are not replaced with masks are retained as original values.
For example, the sentence "Shanghai tap water comes from sea", the word segmentation results in "Shanghai, tap water, from, sea, and Shanghai". The importance estimates for each of the terms shanghai, tap water, from, sea, and shanghai are 0.7, 0.63, 0.5, 0.42, and 0.17, respectively. The target threshold is 0.6, then the importance evaluation values of the three words from sea, sea and above are less than the target threshold, the three words from sea, sea and above are replaced with the mask with the first probability, and the words not replaced with the mask are kept as original values. The first probability is 0.8, and finally the two words in the sea and in the top are replaced with masks, the sea, tap water and from the three words retaining the original values. The anti-interference capability of the finally trained model can be improved by the text database with enhanced data obtained by replacing part of words with masks.
In addition, model training often employs a pre-trained model because in order to make the pre-trained model suitable for most scenarios, part of the data in the training data set is often obscured when training the pre-trained model. The text database after data enhancement obtained by replacing partial words with the mask is closer to the input form of the pre-training model in form, so that semantic knowledge learned in the pre-training stage of the pre-training model is more effectively utilized, meanwhile, the main information of sentences is kept when data enhancement is considered, the main meaning of the sentences is kept to a greater extent, the anti-interference capability of the model and the extraction capability of the main information are improved to a certain extent, and the generalization capability of the model is further improved.
In step S205, a text classification training is performed by using the data-enhanced text database, including: acquiring a training task of text classification training; labels of each word in the text database after the training task labeling data is enhanced; and performing text classification training by using the text database labeled with the labels.
The training tasks are constructed based on the subjects and scenes of text classification training, and the labels of all words in the text database and the text database of different training tasks are different. Such as labels for each word in the following training tasks: labels in emotion analysis training tasks: positive, negative, neutral; label in the topic classification training task: finance, sports, military, society; labels in the question-and-answer task training task: yes, no; intent recognition labels in the training task: weather inquiry, song search and random chatting; labels in natural language reasoning training tasks: derivation, contradiction, neutrality.
Replacing words whose importance evaluation values are smaller than the target threshold value with masks with a first probability, and before leaving the original values of the words that are not replaced with masks, the method further includes: obtaining a model generalization index of text classification training; and adjusting the target threshold value and the first probability according to the model generalization index.
The model generalization index represents the actual generalization ability of the model, and if the model generalization index is low, the target threshold and the first probability should be increased.
In step S205, a text classification training is performed by using the data-enhanced text database, including: carrying out covering treatment on words in the text database after the data enhancement according to the second probability; carrying out text classification training by using the text database subjected to the masking treatment; wherein the second probability is determined by the model generalization index, the target threshold, and the first probability.
And possibly replacing the words with the importance evaluation values smaller than the target threshold value with a mask processing mode by a first probability, wherein the words cannot meet the model generalization index, and at the moment, the words in the text database after the data enhancement are subjected to covering processing according to a second probability so as to further meet the model generalization index. The masking process is the same as replacing words with masks.
In an alternative embodiment, the method comprises: performing text classification on texts in various fields by using a model corresponding to text classification training, and determining the accuracy, precision and recall rate of the model according to the text classification; and determining a model generalization index according to the accuracy, precision and recall rate of the model.
And (3) determining a model generalization index, actually testing the performance of the model by using the test data set to obtain the accuracy, precision and recall rate of the model through testing, and further determining the model generalization index. Accuracy describes the number of samples that the model predicts as correct and that are actually correct as a proportion of the total number of samples that the model predicts as correct. The recall ratio describes the number of samples that the model predicts as correct and that are actually correct as a proportion of the total number of samples in the test data set that are actually correct. The accuracy rate describes the proportion of the total samples of the sample number station where the model prediction is correct (including true case, positive case and negative case), namely the accuracy rate of the model prediction.
In an optional embodiment, the text classification method based on text data enhancement comprises the following steps: acquiring a text database, wherein the text database comprises a plurality of documents, and each document comprises a plurality of sentences; performing word segmentation processing on each sentence by using a word segmentation device to obtain a word segmentation result corresponding to each sentence, wherein each word segmentation result comprises a plurality of words; determining the word frequency of each word based on the occurrence frequency of each word in the word segmentation result of the word and the total number of words in the word segmentation result of the word; determining an inverse document rating for each term based on the number of documents in which each term occurs and the total number of all documents in the text database; determining an importance evaluation value of each word based on the word frequency and the inverse document evaluation rate of each word; obtaining a model generalization index of the text classification training; adjusting the target threshold and the first probability according to the model generalization index; sampling the text database for multiple times according to the following mode to obtain a text database after data enhancement: replacing the words with the importance evaluation values smaller than the target threshold value with masks according to a first probability, and keeping original values of the words which are not replaced with the masks; acquiring a training task of the text classification training; labeling a label of each word in the text database after the data enhancement based on the training task; and performing text classification training by using the text database labeled with the labels.
According to the technical scheme provided by the embodiment of the disclosure, a text database is obtained, wherein the text database comprises a plurality of documents, and each document comprises a plurality of sentences; performing word segmentation processing on each sentence by using a word segmentation device to obtain a word segmentation result corresponding to each sentence, wherein each word segmentation result comprises a plurality of words; calculating an importance evaluation value of each word; sampling the text database for multiple times according to the importance evaluation value of each word to obtain a text database after data enhancement; the text classification training is carried out by utilizing the text database after data enhancement, so that the technical means can solve the problem that the text classification model trained based on the traditional text data enhancement method is weak in generalization capability in the prior art, and further improve the generalization capability of the text classification model.
All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.
Fig. 3 is a schematic diagram of a text classification device based on text data enhancement according to an embodiment of the present disclosure. As shown in fig. 3, the text classification apparatus based on text data enhancement includes:
an obtaining module 301 configured to obtain a text database, where the text database includes a plurality of documents, and each document includes a plurality of sentences;
a word segmentation module 302 configured to perform word segmentation processing on each sentence by using a word segmentation device to obtain a word segmentation result corresponding to each sentence, wherein each word segmentation result includes a plurality of words;
a calculation module 303 configured to calculate an importance evaluation value of each word;
the sampling module 304 is configured to sample the text database for multiple times according to the importance evaluation value of each word, so as to obtain a data-enhanced text database;
and a training module 305 configured to perform text classification training by using the data-enhanced text database.
Topics or scenarios for text classification training include, but are not limited to, the following types: sentiment analysis (Sentiment analysis), topic Classification (Topic Labeling), question and answer tasks (Question Answering), intention recognition (Dialog Act Classification), and Natural Language reasoning (Natural Language Inference). The text database differs for different topics or scenes. The word segmentation device can be any common word segmentation device, such as a jieba word segmentation device, and the word segmentation process is to divide a sentence into a plurality of words according to the words.
According to the technical scheme provided by the embodiment of the disclosure, a text database is obtained, wherein the text database comprises a plurality of documents, and each document comprises a plurality of sentences; performing word segmentation processing on each sentence by using a word segmentation device to obtain a word segmentation result corresponding to each sentence, wherein each word segmentation result comprises a plurality of words; calculating an importance evaluation value of each word; sampling the text database for multiple times according to the importance evaluation value of each word to obtain a text database with enhanced data; the text classification training is carried out by utilizing the text database after data enhancement, so that the technical means can solve the problem that the text classification model trained based on the traditional text data enhancement method in the prior art is weak in generalization capability, and further improve the generalization capability of the text classification model.
Optionally, the calculating module 303 is further configured to determine a word frequency of each word based on the number of times that each word appears in the word segmentation result of the word and the total number of words of the word segmentation result of the word; determining the inverse document rating of each term based on the number of documents in which each term appears and the total number of all documents in the text database; an importance evaluation value of each word is determined based on the word frequency and the inverse document rating of each word.
It should be noted that calculating the importance evaluation value of each word is similar to TF-IDF, but not identical. TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. TF is Term Frequency (Term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency). The word frequency of each word is determined to be different from a method for determining TF in a TF-IDF technology in the disclosed embodiment, wherein TF in the TF-IDF technology is the frequency of occurrence of each word in a document, and the disclosed embodiment can be that the frequency of occurrence of each word in a word segmentation result of the word is divided by the total number of words in the word segmentation result of the word as the word frequency of each word; determining the inverse document rating for each term in the embodiments of the present disclosure is similar to the method of determining IDF in the TF-IDF technique. The importance evaluation value of each word is determined based on the word frequency and the inverse document rating of each word, and may be determined as the importance evaluation value of each word based on a product of the word frequency and the inverse document rating of each word.
Optionally, the sampling module 304 is further configured to sample the text database a plurality of times as follows: words whose importance evaluation value is smaller than the target threshold value are replaced with masks with a first probability, and words that are not replaced with masks are retained as original values.
For example, the sentence "Shanghai tap water comes from sea", the word segmentation results in "Shanghai, tap water, from, sea, and Shanghai". The importance estimates for each of the terms shanghai, tap water, from, sea, and shanghai are 0.7, 0.63, 0.5, 0.42, and 0.17, respectively. The target threshold is 0.6, then the importance evaluation values of the three words from sea, sea and above are less than the target threshold, the three words from sea, sea and above are replaced with the mask with the first probability, and the words not replaced with the mask are kept as original values. The first probability is 0.8, and finally the two words in the sea and in the top are replaced with masks, the sea, tap water and from the three words retaining the original values. The anti-interference capability of the finally trained model can be improved by the text database with enhanced data obtained by replacing part of words with masks.
In addition, model training often employs a pre-trained model, because in order to make the pre-trained model suitable for most scenarios, part of the data in the training data set is often masked when training the pre-trained model. The text database after data enhancement obtained by replacing partial words with the mask is closer to the input form of the pre-training model in form, so that semantic knowledge learned in the pre-training stage of the pre-training model is more effectively utilized, meanwhile, the main information of sentences is kept when data enhancement is considered, the main meaning of the sentences is kept to a greater extent, the anti-interference capability of the model and the extraction capability of the main information are improved to a certain extent, and the generalization capability of the model is further improved.
Optionally, the training module 305 is further configured to obtain a training task of the text classification training; labels of each word in the text database after the training task labeling data is enhanced; and performing text classification training by using the text database labeled with the labels.
The training task is constructed based on the subject and scene of text classification training, and the labels of each word in the text database and the text database of different training tasks are different. Such as labels for each word in the following training tasks: labels in emotion analysis training tasks: positive, negative, neutral; label in the topic classification training task: finance, sports, military, society; labels in the question-and-answer task training task: yes, no; intent recognition labels in the training task: weather inquiry, song search and random chatting; labels in natural language reasoning training tasks: derivation, contradiction, neutrality.
Optionally, the training module 305 is further configured to obtain a model generalization index of the text classification training; and adjusting the target threshold value and the first probability according to the model generalization index.
The model generalization index represents the actual generalization ability of the model, and if the model generalization index is low, the target threshold and the first probability should be increased.
Optionally, the training module 305 is further configured to perform a masking process on the words in the data-enhanced text database according to a second probability; carrying out text classification training by using the text database subjected to the masking treatment; wherein the second probability is determined by the model generalization index, the target threshold, and the first probability.
And possibly replacing the words with the importance evaluation values smaller than the target threshold value with a mask processing mode by a first probability, wherein the words cannot meet the model generalization index, and at the moment, the words in the text database after the data enhancement are subjected to covering processing according to a second probability so as to further meet the model generalization index. The masking process is the same as replacing words with masks.
Optionally, the training module 305 is further configured to train a corresponding model to perform text classification on the texts in multiple fields by using the text classification, and determine accuracy, precision and recall of the model according to the text classification; and determining a model generalization index according to the accuracy, precision and recall rate of the model.
And determining a model generalization index, namely actually testing the performance of the model by using the test data set to obtain the accuracy, precision and recall rate of the model through testing, and further determining the model generalization index. Accuracy describes the number of samples that the model predicts as correct and that are actually correct as a proportion of the total number of samples that the model predicts as correct. The recall ratio describes the number of samples that the model predicts as correct and that are actually correct as a proportion of the total number of samples in the test data set that are actually correct. The accuracy rate describes the proportion of the total samples of the sample number station where the model prediction is correct (including true case, positive case and negative case), namely the accuracy rate of the model prediction.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure.
Fig. 4 is a schematic diagram of an electronic device 4 provided by the embodiment of the present disclosure. As shown in fig. 4, the electronic apparatus 4 of this embodiment includes: a processor 401, a memory 402 and a computer program 403 stored in the memory 402 and executable on the processor 401. The steps in the various method embodiments described above are implemented when the processor 401 executes the computer program 403. Alternatively, the processor 401 implements the functions of the respective modules/units in the above-described respective apparatus embodiments when executing the computer program 403.
The electronic device 4 may be a desktop computer, a notebook, a palm computer, a cloud server, or other electronic devices. The electronic device 4 may include, but is not limited to, a processor 401 and a memory 402. Those skilled in the art will appreciate that fig. 4 is merely an example of electronic device 4 and does not constitute a limitation of electronic device 4 and may include more or fewer components than shown, or different components.
The Processor 401 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like.
The storage 402 may be an internal storage unit of the electronic device 4, for example, a hard disk or a memory of the electronic device 4. The memory 402 may also be an external storage device of the electronic device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the electronic device 4. The memory 402 may also include both internal storage units of the electronic device 4 and external storage devices. The memory 402 is used for storing computer programs and other programs and data required by the electronic device.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method in the above embodiments, and may also be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the above methods and embodiments. The computer program may comprise computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable medium may contain suitable additions or additions that may be required in accordance with legislative and patent practices within the jurisdiction, for example, in some jurisdictions, computer readable media may not include electrical carrier signals or telecommunications signals in accordance with legislative and patent practices.
The above examples are only intended to illustrate the technical solutions of the present disclosure, not to limit them; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present disclosure, and are intended to be included within the scope of the present disclosure.

Claims (10)

1. A text classification method based on text data enhancement is characterized by comprising the following steps:
acquiring a text database, wherein the text database comprises a plurality of documents, and each document comprises a plurality of sentences;
performing word segmentation processing on each sentence by using a word segmentation device to obtain a word segmentation result corresponding to each sentence, wherein each word segmentation result comprises a plurality of words;
calculating an importance evaluation value of each word;
sampling the text database for multiple times according to the importance evaluation value of each word to obtain a text database after data enhancement;
and performing text classification training by using the data-enhanced text database.
2. The method of claim 1, wherein said calculating an importance assessment value for each word comprises:
determining the word frequency of each word based on the occurrence frequency of each word in the word segmentation result of the word and the total number of words in the word segmentation result of the word;
determining an inverse document rating for each term based on the number of documents in which each term occurs and the total number of all documents in the text database;
an importance evaluation value of each word is determined based on the word frequency and the inverse document rating of each word.
3. The method of claim 1, wherein the sampling the text database multiple times according to the importance evaluation value of each word to obtain a data-enhanced text database comprises:
the text database is sampled a plurality of times as follows:
words of which the importance evaluation value is smaller than the target threshold value are replaced with masks with a first probability, and words not replaced with masks are kept as original values.
4. The method of claim 1, wherein the text classification training using the data-enhanced text database comprises:
acquiring a training task of the text classification training;
labeling a label of each word in the text database after the data enhancement based on the training task;
and performing text classification training by using the text database labeled with the labels.
5. The method of claim 3, wherein the words whose importance estimates are less than a target threshold are replaced with masks with a first probability, and words that are not replaced with masks remain as they are, the method further comprising:
obtaining a model generalization index of the text classification training;
and adjusting the target threshold value and the first probability according to the model generalization index.
6. The method of claim 1, wherein the text classification training using the data-enhanced text database comprises:
carrying out covering treatment on words in the text database after the data enhancement according to a second probability;
performing the text classification training by using the text database subjected to the masking treatment;
wherein the second probability is determined by a model generalization index, a target threshold, and the first probability.
7. The method of claim 5 or 6, comprising:
carrying out text classification on texts in various fields by using a model corresponding to the text classification training, and determining the accuracy, precision and recall rate of the model according to the text classification;
and determining the model generalization index according to the accuracy, precision and recall rate of the model.
8. A text classification apparatus based on text data enhancement, comprising:
an obtaining module configured to obtain a text database, wherein the text database comprises a plurality of documents, and each document comprises a plurality of sentences;
the word segmentation module is configured to perform word segmentation processing on each sentence by using a word segmentation device to obtain a word segmentation result corresponding to each sentence, wherein each word segmentation result comprises a plurality of words;
a calculation module configured to calculate an importance evaluation value for each word;
the sampling module is configured to sample the text database for multiple times according to the importance evaluation value of each word, so as to obtain a data-enhanced text database;
and the training module is configured to perform text classification training by using the data-enhanced text database.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202211255742.XA 2022-10-13 2022-10-13 Text classification method and device based on text data enhancement Pending CN115563281A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211255742.XA CN115563281A (en) 2022-10-13 2022-10-13 Text classification method and device based on text data enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211255742.XA CN115563281A (en) 2022-10-13 2022-10-13 Text classification method and device based on text data enhancement

Publications (1)

Publication Number Publication Date
CN115563281A true CN115563281A (en) 2023-01-03

Family

ID=84745934

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211255742.XA Pending CN115563281A (en) 2022-10-13 2022-10-13 Text classification method and device based on text data enhancement

Country Status (1)

Country Link
CN (1) CN115563281A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127925A (en) * 2023-04-07 2023-05-16 北京龙智数科科技服务有限公司 Text data enhancement method and device based on destruction processing of text

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127925A (en) * 2023-04-07 2023-05-16 北京龙智数科科技服务有限公司 Text data enhancement method and device based on destruction processing of text
CN116127925B (en) * 2023-04-07 2023-08-29 北京龙智数科科技服务有限公司 Text data enhancement method and device based on destruction processing of text

Similar Documents

Publication Publication Date Title
CN110442712B (en) Risk determination method, risk determination device, server and text examination system
CN107193974B (en) Regional information determination method and device based on artificial intelligence
CN109284502B (en) Text similarity calculation method and device, electronic equipment and storage medium
CN111339295A (en) Method, apparatus, electronic device and computer readable medium for presenting information
CN113011169B (en) Method, device, equipment and medium for processing conference summary
CN117131281B (en) Public opinion event processing method, apparatus, electronic device and computer readable medium
CN116882372A (en) Text generation method, device, electronic equipment and storage medium
CN115840808B (en) Technological project consultation method, device, server and computer readable storage medium
CN111915086A (en) Abnormal user prediction method and equipment
CN113407814A (en) Text search method and device, readable medium and electronic equipment
CN116108149A (en) Intelligent question-answering method, device, equipment, medium and product thereof
CN112052297A (en) Information generation method and device, electronic equipment and computer readable medium
CN116402166A (en) Training method and device of prediction model, electronic equipment and storage medium
CN115563281A (en) Text classification method and device based on text data enhancement
CN116933800B (en) Template-based generation type intention recognition method and device
CN110377706B (en) Search sentence mining method and device based on deep learning
CN108268443A (en) It determines the transfer of topic point and obtains the method, apparatus for replying text
CN112487188A (en) Public opinion monitoring method and device, electronic equipment and storage medium
CN116010606A (en) Training method and device for text auditing model and text auditing method and device
CN110634024A (en) User attribute marking method and device, electronic equipment and storage medium
CN116431912A (en) User portrait pushing method and device
CN116108810A (en) Text data enhancement method and device
CN115098665A (en) Method, device and equipment for expanding session data
CN114943590A (en) Object recommendation method and device based on double-tower model
CN113420723A (en) Method and device for acquiring video hotspot, readable medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination