WO2022160449A1 - Procédé et appareil de classification de texte, dispositif électronique et support de stockage - Google Patents

Procédé et appareil de classification de texte, dispositif électronique et support de stockage Download PDF

Info

Publication number
WO2022160449A1
WO2022160449A1 PCT/CN2021/083560 CN2021083560W WO2022160449A1 WO 2022160449 A1 WO2022160449 A1 WO 2022160449A1 CN 2021083560 W CN2021083560 W CN 2021083560W WO 2022160449 A1 WO2022160449 A1 WO 2022160449A1
Authority
WO
WIPO (PCT)
Prior art keywords
classification
model
text
confidence
label
Prior art date
Application number
PCT/CN2021/083560
Other languages
English (en)
Chinese (zh)
Inventor
谢馥芯
王磊
陈又新
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022160449A1 publication Critical patent/WO2022160449A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the technical field of natural language processing, and in particular, to a text classification method, apparatus, electronic device, and computer-readable storage medium.
  • a text classification method provided by this application includes:
  • the processed text is input into the multi-model structure classification voting model, and the processed text is classified by a plurality of base models in the multi-model structure classification voting model to obtain a first confidence space, and the first confidence space is obtained.
  • the confidence space includes a first confidence that the processed text belongs to the first classification label;
  • the classification label to which the text to be classified belongs and the classification confidence degree corresponding to the classification label are determined.
  • the present application also provides a text classification device, the device comprising:
  • a model acquisition module used for acquiring a multi-model structure classification voting model and a multi-task classification model, the multi-model structure classification voting model and the multi-task classification model are obtained through a pre-built classification model and a training sample set;
  • a text preprocessing module used to obtain the text to be classified, and preprocess the text to be classified to obtain the processed text
  • the first model analysis module is used to input the processed text into the multi-model structure classification voting model, and classify the processed text through a plurality of base models in the multi-model structure classification voting model to obtain the first a confidence space, where the first confidence space includes a first confidence that the processed text belongs to a first classification label;
  • the second model analysis module is configured to input the processed text into the multi-task classification model, and obtain a second confidence space by classifying the processed text in the multi-task classification model.
  • the confidence space includes the second confidence that the processed text belongs to the second classification label;
  • a result processing module configured to determine, according to the first confidence space and the second confidence space, the classification label to which the text to be classified belongs and the classification confidence corresponding to the classification label.
  • the present application also provides an electronic device, the electronic device comprising:
  • the memory stores computer program instructions executable by the at least one processor, the computer program instructions being executed by the at least one processor to enable the at least one processor to perform the steps of:
  • the processed text is input into the multi-model structure classification voting model, and the processed text is classified by a plurality of base models in the multi-model structure classification voting model to obtain a first confidence space, and the first confidence space is obtained.
  • the confidence space includes a first confidence that the processed text belongs to the first classification label;
  • the classification label to which the text to be classified belongs and the classification confidence degree corresponding to the classification label are determined.
  • the present application also provides a computer-readable storage medium, including a storage data area and a storage program area, the storage data area stores created data, and the storage program area stores a computer program; wherein, the computer program is implemented as follows when executed by a processor step:
  • the processed text is input into the multi-model structure classification voting model, and the processed text is classified by a plurality of base models in the multi-model structure classification voting model to obtain a first confidence space, and the first confidence space is obtained.
  • the confidence space includes a first confidence that the processed text belongs to the first classification label;
  • the classification label to which the text to be classified belongs and the classification confidence degree corresponding to the classification label are determined.
  • FIG. 1 is a schematic flowchart of a text classification method provided by an embodiment of the present application.
  • FIG. 2 is a schematic block diagram of a text classification apparatus according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of the internal structure of an electronic device implementing a text classification method provided by an embodiment of the present application
  • the embodiment of the present application provides a text classification method.
  • the execution subject of the text classification method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server and a terminal.
  • the text classification method can be executed by software or hardware installed in a terminal device or a server device, and the software can be a blockchain platform.
  • the server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
  • the text classification method includes:
  • the multi-model structure classification voting model is obtained by training multiple base models through a pre-built training sample set, and then performing performance ranking and weight setting on the output results of each base model.
  • the method before the acquisition of the multi-model structure classification voting model and the multi-task classification model, the method further includes:
  • the multi-model structure classification voting model is constructed using the plurality of text classification models.
  • the classification model is a BERT model.
  • the method before acquiring the training sample set, includes: acquiring a pre-built corpus set, and performing quantization and cleaning operations on the corpus set to obtain a training sample set.
  • the corpus set is the texts that have been classified and processed in the past, or the corpus texts of which types are obtained on the network and have been classified.
  • a quantization operation is performed on the corpus set to obtain quantified data
  • a cleaning operation is performed on the quantified data to obtain the training sample set.
  • the quantization operation includes converting the text of the float32 data type in the corpus set into a uint8 data type suitable for training a text classification model; the cleaning includes deduplicating the quantized data and filling empty values.
  • the random forest algorithm is an ensemble learning algorithm for classification.
  • the random forest algorithm is used to randomly select 25% of the data from the training sample set with the preset value Q times with replacement to train the text classification model, and obtain Q preset values. the text classification model.
  • the preset value Q is 5.
  • the construction of the multi-model structure classification voting model by using the multiple text classification models includes:
  • weights are set for the multiple text classification models according to preset weight gradient values, so as to obtain the multi-model structure classification voting model.
  • the confidence formula of the multi-model structure classification voting model is:
  • p q is the weight of the q-th text classification model
  • y q (x) is the confidence result of the q-th text classification model
  • model testing samples are texts of a determined type.
  • the analysis results obtained by the 5 text classification models are: [Model 1: Negative emotion class, confidence level 90%; Model 2: Negative emotion Class, confidence level 86%; Model 3: Negative emotion class, confidence level 96%; Model 4: Negative emotion class, confidence level 82%; Model 5: Negative emotion class, confidence level 79%], then classify 5 texts model, and the base model confidence level table [Model 3, Model 1; Model 2; Model 4; Model 5] is obtained by arranging according to the confidence level.
  • the weights are allocated as [Model 3: Weight 0.3, Model 1: Weight 0.25; Model 2: Weight 0.2; Model 4: Weight 0.15; Model 5: Weight 0.1], according to the weight
  • the five text classification models are combined to obtain the multi-model structure classification voting model.
  • the method before the acquisition of the multi-model structure classification voting model and the multi-task classification model, the method further includes:
  • the optimized classification model is trained by using the sentence vector, and the multi-task classification model is obtained when the descending gradient of the improved loss of the optimized classification model is smaller than a preset loss threshold within a preset training step.
  • the training sample set includes different types of standard sentences in addition to the corpus, and the pre-built similarity loss is:
  • the N is the number of standard sentences in the training sample set, and each standard sentence in the training sample represents a type, is the sentence vector of the specified corpus, and x j is the sentence vector of the standard sentence.
  • the obtained improvement loss is:
  • the confidence calculation formula of the classification label to which each corpus belongs is:
  • z j is the classification result of the jth short sentence in the corpus, that is, the classification label described by the jth short sentence, and K is the number of classification results.
  • the embodiment of the present application can continuously train and optimize the classification model by using the sentence vector through the twin network.
  • the improvement loss is continuously minimized until the improvement loss of the optimized classification model decreases.
  • the training process is stopped, and the multi-task classification model is obtained.
  • a pre-built recall engine may be used to obtain the text to be classified from the Internet or a local storage space.
  • the S2 includes:
  • Punctuation coincidence segmentation or sentence length segmentation is performed on the text to be classified to obtain the processed text.
  • the text to be classified is segmented according to punctuation, that is, the text to be classified is divided according to punctuation; when the volume of the text to be classified is greater than 512 characters, The to-be-classified text is segmented by sentence length, for example, the to-be-classified text is randomly divided into processing texts whose volume is less than 521 characters.
  • the multi-model structure classification voting model is used to classify the processing text. Perform classification to obtain a type result and a confidence level corresponding to the type result, calculate the confidence level generated by the five models through weights, and obtain a first classification label and a first confidence level of the processed text.
  • the confidence levels obtained by the processed text through the five models are [0.8, 0.9, 0.6, 0.5, 0.7], and the weights of the five models are [0.25, 0.2, 0.3, 0.15, 0.1]
  • the first confidence level is obtained as [0.8*0.25+0.9*0.2+0.6*0.3+0.5*0.15+0.7*0.1], that is, the first confidence level is 0.705.
  • the multi-task classification model includes a classification task and a similarity task.
  • the multi-task classification model is used to analyze the processed text to obtain a similarity set between the processed text and each type of standard sentence and a confidence set corresponding to the similarity, and then filter the similarity set to obtain The type corresponding to the standard sentence with the highest similarity of the processed text is used as the second classification label of the processed text, and according to the second classification label, the confidence level set is queried to obtain the second confidence level.
  • S5. Determine, according to the first confidence space and the second confidence space, a classification label to which the text to be classified belongs and a classification confidence corresponding to the classification label.
  • the classification to which the text to be classified belongs can be determined according to the confidence level threshold. Label.
  • the confidence level with the confidence level greater than the confidence level threshold such as 0.8
  • the classification label corresponding to the confidence level as the classification result or select the confidence level from the first confidence level and the second confidence level.
  • the confidence degree greater than the confidence degree threshold such as 0.5
  • the S5 includes:
  • the first classification label is the same as the second classification label
  • determine that the classification label to which the text to be classified belongs is the first classification label or the second classification label, and determine where the text to be classified belongs.
  • the confidence corresponding to the classification label is the average of the first confidence and the second confidence.
  • the predicted result of outputting the processed text is (label 1, confidence level 0.8), and the multi-task classification model prediction result is (label 1, The confidence level is 0.7), and the prediction type is the same as label 1, then the confidence level is added and averaged, and finally the type of the text to be classified is determined as label 1, the confidence level is 0.75, and the output (label is 1, confidence level 0.75).
  • the S5 further includes:
  • the first confidence level is greater than the second confidence level, determine that the category label to which the text to be classified belongs is the first category label, and the confidence level corresponding to the category result is the first confidence level multiplied by a first coefficient;
  • the first confidence level is not greater than the second confidence level, it is determined that the category label to which the text to be classified belongs is the second category label, and the confidence level corresponding to the category result is the second confidence level multiplied by a second coefficient.
  • the values of the first coefficient and the second coefficient may be the same or different, for example, the values of the first coefficient and the second coefficient are both 0.5.
  • the prediction result of the processed text output by the multi-model structure classification voting model is (label 1, confidence level 0.8), and the prediction result of the multi-task classification model is (label 2, confidence level 0.9)
  • the prediction result of the multi-task classification model is (label 2, confidence level 0.9)
  • the type result with large degree is multiplied by 0.5
  • the type of the text to be classified is determined as label 2
  • the confidence degree is 0.45
  • the output is (label 2, confidence degree 0.45).
  • the prediction result of the processed text output by the multi-model structure classification voting model is (label 1, confidence level 0.8)
  • the prediction result of the multi-task classification model is (label 2, confidence level 0.8)
  • the confidence levels are the same, then randomly select A type result, multiply the corresponding confidence by 0.5, and finally determine the type of the text to be classified as label 1 or label 2, the confidence is 0.4, and output (label 1 or 2, confidence 0.4).
  • the classification confidence level corresponding to the classification label to which the text to be classified belongs may also be added to the training sample set.
  • the multi-task classification model/or the multi-model structure classification voting model can be further optimized by continuously expanding the training sample set, which is beneficial to increase the accuracy of the confidence result.
  • the multi-model structure classification voting model and the multi-task classification model are used to classify the text to be classified, respectively, to obtain the first confidence that the processed text belongs to the first classification label and the first confidence that the processed text belongs to the second classification label.
  • Two confidence levels and then determine the classification label described in the processing text according to the first confidence level and the second confidence level, which can improve the accuracy of text category judgment by combining different models for category judgment, and in the process of text classification. , without manual intervention, improving the efficiency of text category judgment; at the same time, because the first confidence level that the processed text belongs to the first classification label is obtained by analyzing the processed text through each base model in the multi-model structure classification voting model, many times Analysis to improve the accuracy of text category judgment. Therefore, the text classification method proposed in this application can achieve the purpose of improving both the reliability of the text classification result and the efficiency of the text classification.
  • FIG. 2 it is a schematic diagram of a module of a text classification apparatus of the present application.
  • the text classification apparatus 100 described in this application can be installed in an electronic device.
  • the text classification apparatus may include a model acquisition module 101 , a text preprocessing module 102 , a first model analysis module 103 , a second model analysis module 104 and a result processing module 105 .
  • the modules described in this application may also be referred to as units, which refer to a series of computer program segments that can be executed by the processor of an electronic device and can perform fixed functions, and are stored in the memory of the electronic device.
  • each module/unit is as follows:
  • the model obtaining module 101 is configured to obtain a multi-model structure classification voting model and a multi-task classification model, and the multi-model structure classification voting model and the multi-task classification model are obtained through a pre-built classification model and a training sample set.
  • the multi-model structure classification voting model is obtained by training multiple base models through a pre-built training sample set, and then performing performance ranking and weight setting on the output results of each base model.
  • the device further includes a multi-model structure classification voting model building module, and the multi-model structure classification voting model building module includes:
  • a first training unit used for training a pre-built classification model according to the random forest algorithm and the training sample set, to obtain a plurality of text classification models
  • a construction unit configured to construct the multi-model structure classification voting model by using the plurality of text classification models.
  • the classification model is a BERT model.
  • the obtaining unit includes: before obtaining the training sample set, obtaining a pre-built corpus set, and performing quantification and cleaning operations on the corpus set to obtain the training sample set.
  • the corpus set is the texts that have been classified and processed in the past, or the corpus texts of which types are obtained on the network and have been classified.
  • a quantization operation is performed on the corpus set to obtain quantified data
  • a cleaning operation is performed on the quantified data to obtain the training sample set.
  • the quantization operation includes converting the text of the float32 data type in the corpus set into a uint8 data type suitable for training a text classification model; the cleaning includes deduplicating the quantized data and filling empty values.
  • the random forest algorithm is an ensemble learning algorithm for classification.
  • the random forest algorithm is used to randomly select 25% of the data from the training sample set with the preset value Q times with replacement to train the text classification model, and obtain Q preset values. the text classification model.
  • the preset value Q is 5.
  • construction unit is specifically used for:
  • weights are set for the multiple text classification models according to preset weight gradient values, so as to obtain the multi-model structure classification voting model.
  • the confidence formula of the multi-model structure classification voting model is:
  • p q is the weight of the q-th text classification model
  • y q (x) is the confidence result of the q-th text classification model
  • model testing samples are texts of a determined type.
  • the analysis results obtained by the 5 text classification models are: [Model 1: Negative emotion class, confidence level 90%; Model 2: Negative emotion Class, confidence level 86%; Model 3: Negative emotion class, confidence level 96%; Model 4: Negative emotion class, confidence level 82%; Model 5: Negative emotion class, confidence level 79%], then classify 5 texts model, and the base model confidence level table [Model 3, Model 1; Model 2; Model 4; Model 5] is obtained by arranging according to the confidence level.
  • the weights are allocated as [Model 3: Weight 0.3, Model 1: Weight 0.25; Model 2: Weight 0.2; Model 4: Weight 0.15; Model 5: Weight 0.1], according to the weight
  • the five text classification models are combined to obtain the multi-model structure classification voting model.
  • the device further includes a multi-task classification model acquisition module, and the multi-task classification model acquisition module includes:
  • An optimized classification model acquisition unit configured to combine the classification loss in the classification model with the pre-built similarity loss to obtain an improved loss, and replace the classification loss in the classification model with the improved loss to obtain an optimized classification Model;
  • a feature extraction unit configured to perform feature extraction on the training sample set by utilizing the feature extraction neural network in the optimized classification model to obtain a sentence vector
  • the second training unit is configured to train the optimized classification model by using the sentence vector, until the descending gradient of the improved loss of the optimized classification model is smaller than a preset loss threshold within a preset training step, obtain The multi-task classification model.
  • the training sample set includes different types of standard sentences in addition to the corpus, and the pre-built similarity loss is:
  • the N is the number of standard sentences in the training sample set, and each standard sentence in the training sample represents a type, is the sentence vector of the specified corpus, and x j is the sentence vector of the standard sentence.
  • the obtained improvement loss is:
  • the confidence calculation formula of the classification label to which each corpus belongs is:
  • z j is the classification result of the jth short sentence in the corpus, that is, the classification label described by the jth short sentence, and K is the number of classification results.
  • the embodiment of the present application can continuously train and optimize the classification model by using the sentence vector through the twin network.
  • the improvement loss is continuously minimized until the improvement loss of the optimized classification model decreases.
  • the training process is stopped, and the multi-task classification model is obtained.
  • the text preprocessing module 102 is configured to acquire the text to be classified, and preprocess the text to be classified to obtain the processed text.
  • a pre-built recall engine may be used to obtain the text to be classified from the Internet or a local storage space.
  • the text preprocessing module 102 is specifically used for:
  • Punctuation coincidence segmentation or sentence length segmentation is performed on the text to be classified to obtain the processed text.
  • the text to be classified is segmented according to punctuation, that is, the text to be classified is divided according to punctuation; when the volume of the text to be classified is greater than 512 characters, The to-be-classified text is segmented by sentence length, for example, the to-be-classified text is randomly divided into processing texts whose volume is less than 521 characters.
  • the first model analysis module 103 is configured to input the processed text into the multi-model structure classification voting model, and classify the processed text through a plurality of base models in the multi-model structure classification voting model, A first confidence space is obtained, where the first confidence space includes a first confidence that the processed text belongs to a first classification label.
  • the multi-model structure classification voting model is used to classify the processing text. Perform classification to obtain a type result and a confidence level corresponding to the type result, calculate the confidence level generated by the five models through weights, and obtain a first classification label and a first confidence level of the processed text.
  • the confidence levels obtained by the processed text through the five models are [0.8, 0.9, 0.6, 0.5, 0.7], and the weights of the five models are [0.25, 0.2, 0.3, 0.15, 0.1]
  • the first confidence level is obtained as [0.8*0.25+0.9*0.2+0.6*0.3+0.5*0.15+0.7*0.1], that is, the first confidence level is 0.705.
  • the second model analysis module 104 is used for inputting the processed text into the multi-task classification model, and by classifying the processed text in the multi-task classification model, a second confidence space is obtained.
  • the second confidence space includes a second confidence that the processed text belongs to a second classification label.
  • the multi-task classification model includes a classification task and a similarity task.
  • the multi-task classification model is used to analyze the processed text to obtain a similarity set between the processed text and each type of standard sentence and a confidence set corresponding to the similarity, and then filter the similarity set to obtain The type corresponding to the standard sentence with the highest similarity of the processed text is used as the second classification label of the processed text, and according to the second classification label, the confidence level set is queried to obtain the second confidence level.
  • the result processing module 105 is configured to determine, according to the first confidence space and the second confidence space, the classification label to which the text to be classified belongs and the classification confidence corresponding to the classification label.
  • the classification to which the text to be classified belongs can be determined according to the confidence level threshold. Label.
  • the confidence level with the confidence level greater than the confidence level threshold such as 0.8
  • the classification label corresponding to the confidence level as the classification result or select the confidence level from the first confidence level and the second confidence level.
  • the confidence degree greater than the confidence degree threshold such as 0.5
  • the result processing module 104 is specifically configured to:
  • the first classification label is the same as the second classification label
  • determine that the classification label to which the text to be classified belongs is the first classification label or the second classification label, and determine where the text to be classified belongs.
  • the confidence corresponding to the classification label is the average of the first confidence and the second confidence.
  • the predicted result of outputting the processed text is (label 1, confidence level 0.8), and the multi-task classification model prediction result is (label 1, Confidence level 0.7)
  • the prediction type is the same as label 1
  • the confidence levels are added and averaged
  • the type of the text to be classified is determined as label 1
  • the confidence level is 0.75
  • the output (label is 1, confidence level 0.75).
  • the result processing module 104 is also specifically used for:
  • the first confidence level is greater than the second confidence level, determine that the category label to which the text to be classified belongs is the first category label, and the confidence level corresponding to the category result is the first confidence level multiplied by a first coefficient;
  • the first confidence level is not greater than the second confidence level, it is determined that the category label to which the text to be classified belongs is the second category label, and the confidence level corresponding to the category result is the second confidence level multiplied by a second coefficient.
  • the values of the first coefficient and the second coefficient may be the same or different, for example, the values of the first coefficient and the second coefficient are both 0.5.
  • the prediction result of the processed text output by the multi-model structure classification voting model is (label 1, confidence level 0.8), and the prediction result of the multi-task classification model is (label 2, confidence level 0.9)
  • the prediction result of the multi-task classification model is (label 2, confidence level 0.9)
  • the type result with large degree is multiplied by 0.5
  • the type of the text to be classified is determined as label 2
  • the confidence degree is 0.45
  • the output is (label 2, confidence degree 0.45).
  • the prediction result of the processed text output by the multi-model structure classification voting model is (label 1, confidence level 0.8)
  • the prediction result of the multi-task classification model is (label 2, confidence level 0.8)
  • the confidence levels are the same, then randomly select A type result, multiply the corresponding confidence by 0.5, and finally determine the type of the text to be classified as label 1 or label 2, the confidence is 0.4, and output (label 1 or 2, confidence 0.4).
  • the apparatus described in the present application may further include a sample adding module, which is configured to add the classification confidence corresponding to the classification label to which the text to be classified belongs to the training sample set.
  • the multi-task classification model/or the multi-model structure classification voting model can be further optimized by continuously expanding the training sample set, which is beneficial to increase the accuracy of the confidence result.
  • the multi-model structure classification voting model and the multi-task classification model are used to classify the text to be classified, respectively, to obtain the first confidence that the processed text belongs to the first classification label and the first confidence that the processed text belongs to the second classification label.
  • Two confidence levels and then determine the classification label of the processing text according to the first confidence level and the second confidence level, which can improve the accuracy of text category judgment by combining different models for category judgment, and in the process of text classification. , without manual intervention, improving the efficiency of text category judgment; at the same time, since the first confidence level that the processed text belongs to the first classification label is obtained by analyzing the processed text through each base model in the multi-model structure classification voting model, many times Analysis to improve the accuracy of text category judgment. Therefore, the text classification device proposed in the present application can achieve the purpose of improving both the reliability of the text classification result and the efficiency of the text classification.
  • FIG. 3 it is a schematic structural diagram of an electronic device implementing a text classification method in the present application.
  • the electronic device 1 may include a processor 10, a memory 11 and a bus, and may also include a computer program stored in the memory 11 and executable on the processor 10, such as a text classification program 12.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc.
  • the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a mobile hard disk of the electronic device 1 .
  • the memory 11 may also be an external storage device of the electronic device 1, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash memory card (Flash Card), etc.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 can not only be used to store application software installed in the electronic device 1 and various types of data, such as a code of a text classification program 12, etc., but also can be used to temporarily store data that has been output or will be output.
  • the processor 10 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or more integrated circuits.
  • Central Processing Unit CPU
  • microprocessor digital processing chip
  • graphics processor and combination of various control chips, etc.
  • the processor 10 is the control core (Control Unit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing the program or module (for example, executing the program) stored in the memory 11. A text classification program, etc.), and call data stored in the memory 11 to perform various functions of the electronic device 1 and process data.
  • the bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (Extended industry standard architecture, EISA for short) bus or the like.
  • PCI peripheral component interconnect
  • EISA Extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the bus is configured to implement connection communication between the memory 11 and at least one processor 10 and the like.
  • FIG. 3 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 3 does not constitute a limitation on the electronic device 1, and may include fewer or more components than those shown in the figure. components, or a combination of certain components, or a different arrangement of components.
  • the electronic device 1 may also include a power supply (such as a battery) for powering the various components, preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that the power management
  • the device implements functions such as charge management, discharge management, and power consumption management.
  • the power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components.
  • the electronic device 1 may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the electronic device 1 may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • a network interface optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • the electronic device 1 may further include a user interface, and the user interface may be a display (Display), an input unit (eg, a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like.
  • the display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
  • a text classification program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple computer programs, and when running in the processor 10, can realize:
  • the processed text is input into the multi-model structure classification voting model, and the processed text is classified by a plurality of base models in the multi-model structure classification voting model to obtain a first confidence space, and the first confidence space is obtained.
  • the confidence space includes a first confidence that the processed text belongs to the first classification label;
  • the classification label to which the text to be classified belongs and the classification confidence degree corresponding to the classification label are determined.
  • the modules/units integrated in the electronic device 1 may be stored in a computer-readable storage medium.
  • the computer-readable storage medium may be volatile or non-volatile.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disc, a computer memory, a read-only memory (ROM, Read-Only) Memory).
  • the computer usable storage medium may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function, and the like; using the created data, etc.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be volatile or non-volatile.
  • the readable storage medium stores a computer program, and the computer program is stored in the When executed by the processor of the electronic device, it can achieve:
  • the processed text is input into the multi-model structure classification voting model, and the processed text is classified by a plurality of base models in the multi-model structure classification voting model to obtain a first confidence space, and the first confidence space is obtained.
  • the confidence space includes a first confidence that the processed text belongs to the first classification label;
  • the classification label to which the text to be classified belongs and the classification confidence degree corresponding to the classification label are determined.
  • modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé de classification de texte, se rapportant au domaine technique du traitement de langage naturel, consistant à : acquérir un modèle de vote de classification de structure multi-modèle et un modèle de classification multitâche ; prétraiter un texte à classifier, pour obtenir un texte traité ; entrer le texte traité dans le modèle de vote de classification de structure multi-modèle pour obtenir une première confiance que le texte traité se rapporte à une première étiquette de classification ; entrer le texte traité dans le modèle de classification multitâche pour obtenir une seconde confiance que le texte traité se rapporte à une seconde étiquette de classification ; et déterminer, selon un premier espace de confiance et un second espace de confiance, une étiquette de classification à laquelle le texte à classifier se rapporte et une confiance de classification correspondant à l'étiquette de classification. Le présent procédé concerne également une technologie de chaînes de blocs, et les espaces de confiance peuvent être stockés dans un nœud de chaîne de blocs. Le procédé peut améliorer non seulement la fiabilité de résultats de classification de texte mais également l'efficacité de classification de texte.
PCT/CN2021/083560 2021-01-28 2021-03-29 Procédé et appareil de classification de texte, dispositif électronique et support de stockage WO2022160449A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110121141.9 2021-01-28
CN202110121141.9A CN112883190A (zh) 2021-01-28 2021-01-28 文本分类方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022160449A1 true WO2022160449A1 (fr) 2022-08-04

Family

ID=76053277

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083560 WO2022160449A1 (fr) 2021-01-28 2021-03-29 Procédé et appareil de classification de texte, dispositif électronique et support de stockage

Country Status (2)

Country Link
CN (1) CN112883190A (fr)
WO (1) WO2022160449A1 (fr)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115049836A (zh) * 2022-08-16 2022-09-13 平安科技(深圳)有限公司 图像分割方法、装置、设备及存储介质
CN115168594A (zh) * 2022-09-08 2022-10-11 北京星天地信息科技有限公司 警情信息处理方法和装置、电子设备和存储介质
CN115409104A (zh) * 2022-08-25 2022-11-29 贝壳找房(北京)科技有限公司 用于识别对象类型的方法、装置、设备、介质和程序产品
CN115827875A (zh) * 2023-01-09 2023-03-21 无锡容智技术有限公司 一种文本数据的处理终端查找方法
CN117235270A (zh) * 2023-11-16 2023-12-15 中国人民解放军国防科技大学 基于信度混淆矩阵的文本分类方法、装置和计算机设备
CN117473339A (zh) * 2023-12-28 2024-01-30 智者四海(北京)技术有限公司 内容审核方法、装置、电子设备及存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378826B (zh) * 2021-08-11 2021-12-07 腾讯科技(深圳)有限公司 一种数据处理方法、装置、设备及存储介质
CN115470292B (zh) * 2022-08-22 2023-10-10 深圳市沃享科技有限公司 区块链共识方法、装置、电子设备及可读存储介质
CN116383724B (zh) * 2023-02-16 2023-12-05 北京数美时代科技有限公司 一种单一领域标签向量提取方法、装置、电子设备及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992887A (zh) * 2017-11-28 2018-05-04 东软集团股份有限公司 分类器生成方法、分类方法、装置、电子设备及存储介质
CN110019794A (zh) * 2017-11-07 2019-07-16 腾讯科技(北京)有限公司 文本资源的分类方法、装置、存储介质及电子装置
CN110309302A (zh) * 2019-05-17 2019-10-08 江苏大学 一种结合svm和半监督聚类的不平衡文本分类方法及系统
US10460257B2 (en) * 2016-09-08 2019-10-29 Conduent Business Services, Llc Method and system for training a target domain classifier to label text segments

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109389270B (zh) * 2017-08-09 2022-11-04 菜鸟智能物流控股有限公司 一种物流对象确定方法、装置和机器可读介质
CN108108766B (zh) * 2017-12-28 2021-10-29 东南大学 基于多传感器数据融合的驾驶行为识别方法及系统
US10832003B2 (en) * 2018-08-26 2020-11-10 CloudMinds Technology, Inc. Method and system for intent classification
CN110377727B (zh) * 2019-06-06 2022-06-17 深思考人工智能机器人科技(北京)有限公司 一种基于多任务学习的多标签文本分类方法和装置
CN110765267A (zh) * 2019-10-12 2020-02-07 大连理工大学 一种基于多任务学习的动态不完整数据分类方法
CN111444952B (zh) * 2020-03-24 2024-02-20 腾讯科技(深圳)有限公司 样本识别模型的生成方法、装置、计算机设备和存储介质
CN112256880A (zh) * 2020-11-11 2021-01-22 腾讯科技(深圳)有限公司 文本识别方法和装置、存储介质及电子设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10460257B2 (en) * 2016-09-08 2019-10-29 Conduent Business Services, Llc Method and system for training a target domain classifier to label text segments
CN110019794A (zh) * 2017-11-07 2019-07-16 腾讯科技(北京)有限公司 文本资源的分类方法、装置、存储介质及电子装置
CN107992887A (zh) * 2017-11-28 2018-05-04 东软集团股份有限公司 分类器生成方法、分类方法、装置、电子设备及存储介质
CN110309302A (zh) * 2019-05-17 2019-10-08 江苏大学 一种结合svm和半监督聚类的不平衡文本分类方法及系统

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115049836A (zh) * 2022-08-16 2022-09-13 平安科技(深圳)有限公司 图像分割方法、装置、设备及存储介质
CN115409104A (zh) * 2022-08-25 2022-11-29 贝壳找房(北京)科技有限公司 用于识别对象类型的方法、装置、设备、介质和程序产品
CN115168594A (zh) * 2022-09-08 2022-10-11 北京星天地信息科技有限公司 警情信息处理方法和装置、电子设备和存储介质
CN115827875A (zh) * 2023-01-09 2023-03-21 无锡容智技术有限公司 一种文本数据的处理终端查找方法
CN115827875B (zh) * 2023-01-09 2023-04-25 无锡容智技术有限公司 一种文本数据的处理终端查找方法
CN117235270A (zh) * 2023-11-16 2023-12-15 中国人民解放军国防科技大学 基于信度混淆矩阵的文本分类方法、装置和计算机设备
CN117235270B (zh) * 2023-11-16 2024-02-02 中国人民解放军国防科技大学 基于信度混淆矩阵的文本分类方法、装置和计算机设备
CN117473339A (zh) * 2023-12-28 2024-01-30 智者四海(北京)技术有限公司 内容审核方法、装置、电子设备及存储介质
CN117473339B (zh) * 2023-12-28 2024-04-30 智者四海(北京)技术有限公司 内容审核方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN112883190A (zh) 2021-06-01

Similar Documents

Publication Publication Date Title
WO2022160449A1 (fr) Procédé et appareil de classification de texte, dispositif électronique et support de stockage
WO2022121171A1 (fr) Procédé et appareil de mise en correspondance de textes similaires, ainsi que dispositif électronique et support de stockage informatique
WO2022141861A1 (fr) Procédé et appareil de classification d'émotions, dispositif électronique et support de stockage
CN111460797B (zh) 关键字抽取方法、装置、电子设备及可读存储介质
CN113449187A (zh) 基于双画像的产品推荐方法、装置、设备及存储介质
CN112883730B (zh) 相似文本匹配方法、装置、电子设备及存储介质
CN112906377A (zh) 基于实体限制的问答方法、装置、电子设备及存储介质
CN114491047A (zh) 多标签文本分类方法、装置、电子设备及存储介质
CN113887941A (zh) 业务流程生成方法、装置、电子设备及介质
CN111522782A (zh) 文件数据写入方法、装置及计算机可读存储介质
CN113313211A (zh) 文本分类方法、装置、电子设备及存储介质
CN116578696A (zh) 文本摘要生成方法、装置、设备及存储介质
WO2022141838A1 (fr) Procédé et appareil d'analyse de confiance de modèle, dispositif électronique et support de stockage informatique
CN116226315A (zh) 基于人工智能的敏感信息检测方法、装置及相关设备
WO2022141860A1 (fr) Procédé et appareil de déduplication de texte, dispositif électronique et support de stockage lisible par ordinateur
WO2022222228A1 (fr) Procédé et appareil pour reconnaître de mauvaises informations textuelles, et dispositif électronique et support de stockage
CN115146064A (zh) 意图识别模型优化方法、装置、设备及存储介质
WO2022141867A1 (fr) Procédé et appareil de reconnaissance de parole, dispositif électronique et support de stockage lisible
CN113343102A (zh) 基于特征筛选的数据推荐方法、装置、电子设备及介质
CN113011164A (zh) 数据质量检测方法、装置、电子设备及介质
WO2022227170A1 (fr) Procédé et appareil permettant de générer un vecteur de mot en langage croisé, dispositif électronique et support de stockage
CN113342941B (zh) 文本搜索方法、装置、电子设备及计算机可读存储介质
CN115146627B (zh) 实体识别方法、装置、电子设备及存储介质
CN113592606B (zh) 基于多重决策的产品推荐方法、装置、设备及存储介质
CN112528183B (zh) 基于大数据的网页组件布局方法、装置、电子设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21922048

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21922048

Country of ref document: EP

Kind code of ref document: A1