CN113836303A - Text type identification method and device, computer equipment and medium - Google Patents

Text type identification method and device, computer equipment and medium Download PDF

Info

Publication number
CN113836303A
CN113836303A CN202111131337.2A CN202111131337A CN113836303A CN 113836303 A CN113836303 A CN 113836303A CN 202111131337 A CN202111131337 A CN 202111131337A CN 113836303 A CN113836303 A CN 113836303A
Authority
CN
China
Prior art keywords
text
spliced
category
value
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111131337.2A
Other languages
Chinese (zh)
Inventor
黄振宇
王磊
吴文哲
王媛
王晶璇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202111131337.2A priority Critical patent/CN113836303A/en
Publication of CN113836303A publication Critical patent/CN113836303A/en
Priority to PCT/CN2022/070967 priority patent/WO2023045184A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Abstract

The invention relates to the field of artificial intelligence, and discloses a text type identification method, a text type identification device, a computer device and a medium, wherein the method comprises the following steps: acquiring a target text to be identified; splicing the target text with each text in the standard set to generate a first spliced text set; inputting each text in the first spliced text set into a pre-trained text type recognition model one by one, and outputting a predicted value of each text in the first spliced text set; the pre-trained text type recognition model is generated by training based on the spliced texts in the second spliced text set, and the second spliced text set is generated by splicing each text in the training set and the test set with each text in the standard set; and determining the category of the target text based on the predicted value of each text in the first spliced text set. According to the method and the device, the training set, the test set and the standard set are spliced to establish the correlation, and the model is trained by using the data with the correlation, so that the identification result of the model is more accurate.

Description

Text type identification method and device, computer equipment and medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a text type identification method, a text type identification device, computer equipment and a medium.
Background
In recent years, with the continuous development of informatization, online data shows explosive growth, and meanwhile, data information generated by various industries and industries every day is various, so that under the application scene of data information processing, more and more scenes related to text classification or classification are provided, and the text classification or classification can improve the utilization value of the data information. With the development of technologies such as deep learning and reinforcement learning, researchers are increasingly eager to enable machines to accurately identify the categories of different description texts.
The machine can accurately identify the type of the text without deep learning of the natural language, deep neural network models are mostly adopted for deep learning, namely the deep neural network models are adopted for training the natural language text, the identification accuracy of the neural network under the existing multi-classification problem is often influenced by more types of text data, namely when the types of the text data are more, the identification accuracy of the neural network is reduced.
Disclosure of Invention
In view of the above, it is necessary to provide a text type recognition method, apparatus, computer device, and medium for solving the problem of low accuracy of machine understanding of natural language.
A text category identification method comprises the following steps: acquiring a target text to be identified; splicing the target text with each text in the standard set to generate a first spliced text set; inputting each text in the first spliced text set into a pre-trained text type recognition model one by one, and outputting a predicted value of each text in the first spliced text set; the pre-trained text type recognition model is generated by training based on the spliced texts in the second spliced text set, and the second spliced text set is generated by splicing each text in the training set and the test set with each text in the standard set; and determining the category of the target text based on the predicted value of each text in the first spliced text set.
In one embodiment, the obtaining mode of the target text to be recognized at least comprises obtaining from a test set; before the target text to be recognized is obtained, the method further comprises the following steps: collecting a plurality of description texts from a text library; receiving a labeling instruction for each description text in a plurality of description texts, and generating a plurality of labeling texts after labeling based on each description text; and dividing the plurality of marked texts into a training set, a test set and a standard set according to a preset percentage.
In one embodiment, determining the category of the target text based on the predicted value of each text in the first stitched text set comprises: acquiring a threshold value of each of a plurality of preset categories; counting the counting result of each category according to the predicted value of each text in the first spliced text set and the threshold value of each category to generate a counting result sequence; obtaining a maximum counting result from the counting result sequence; the maximum counting result is subjected to quotient with the total number of the standard texts corresponding to the category of the maximum counting result, and the confidence coefficient of the target text is generated; and determining the category corresponding to the confidence coefficient as the category corresponding to the target text according to the fact that the confidence coefficient is larger than the preset value.
In one embodiment, counting the counting result of each category according to the predicted value of each text in the first stitched text set and the threshold of each category, and generating a counting result sequence, includes: judging whether the predicted value of each text in the first spliced text set is larger than the threshold value of each category one by one; if yes, automatically adding one to the initial value of each category; if not, keeping the initial value of each category unchanged; wherein the initial value is 0; and when the judgment of the predicted value of each text in the first spliced text set is finished, determining the final initial value of each category as the counting result of each category.
In one embodiment, the generation of the pre-trained text category recognition model comprises the following steps: splicing each text in the training set and the test set with each text in the standard set to generate a second spliced text set; creating a text category identification model; inputting each spliced text in the second spliced text set into a text type identification model, and outputting a loss value of the model; and when the loss value is smaller than a preset loss threshold value, generating a pre-trained text category recognition model.
In one embodiment, inputting each of the splice texts in the second splice text set into a text category identification model, and outputting a loss value of the model includes: inputting each spliced text in the second spliced text set into a text category identification model, and outputting a first semantic vector and a second semantic vector; calculating a category similarity value of each spliced text in the second spliced text set according to the first semantic vector and the second semantic vector; determining a label value of each spliced text in the second spliced text set; the category similarity value of each splicing text is subtracted from the corresponding label value to generate a loss value of the model; and outputting the loss value of the model.
In one embodiment, when the loss value is smaller than the preset loss threshold, generating a pre-trained text category recognition model, including: when the loss value is greater than or equal to a preset loss threshold value, the loss value is transmitted reversely; and continuing to perform the step of inputting each spliced text in the second spliced text set into the text category identification model.
A text category identification apparatus, the apparatus comprising: the text acquisition module is used for acquiring a target text to be identified; the spliced text set generating module is used for splicing the target text and each text in the standard set to generate a first spliced text set; the text input module is used for inputting each text in the first spliced text set into a pre-trained text type recognition model one by one and outputting a predicted value of each text in the first spliced text set; the pre-trained text type recognition model is generated by training based on the spliced texts in the second spliced text set, and the second spliced text set is generated by splicing each text in the training set and the test set with each text in the standard set; and the category determining module is used for determining the category of the target text based on the predicted value of each text in the first spliced text set.
A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the text category identification method described above.
A medium having computer-readable instructions stored thereon, which, when executed by one or more processors, cause the one or more processors to perform the steps of the text category identification method described above.
According to the text type identification method, the text type identification device and the medium, the target text to be identified is firstly obtained, then the target text and each text in the standard set are spliced to generate a first spliced text set, then each text in the first spliced text set is input into a pre-trained text type identification model one by one, and the predicted value of each text in the first spliced text set is output; the pre-trained text type recognition model is generated by training based on the spliced texts in the second spliced text set, the second spliced text set is generated by splicing each text in the training set and the test set with each text in the standard set, and finally the type of the target text is determined based on the predicted value of each text in the first spliced text set. According to the method and the device, the training set, the test set and the standard set are spliced to establish the correlation, and the model is trained by using the data with the correlation, so that the identification result of the model is more accurate.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a diagram of an implementation environment of a text category identification method provided in an embodiment of the present application;
FIG. 2 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a method for text category identification provided in an embodiment of the present application;
FIG. 4 is a diagram of a text category identification model provided in one embodiment of the present application;
FIG. 5 is a diagram illustrating text category identification concepts provided in an embodiment of the present application;
fig. 6 is a schematic diagram of an apparatus for recognizing text type according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It will be understood that, as used herein, the terms "first," "second," and the like may be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another.
Fig. 1 is a diagram of an implementation environment of the text category identification method provided in an embodiment, as shown in fig. 1, in the implementation environment, including a server 110 and a client 120.
The server 110 may be a server, which may specifically be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like, for example, a server device that deploys a topic extraction model. When text type identification is needed, the server 110 obtains a target text to be identified from the client 120, the server 110 inputs each text in the first spliced text set into a pre-trained text type identification model one by one, outputs a predicted value of each text in the first spliced text set, the server 110 queries a problem set associated with a thematic result of the description information and sends the problem set to the client 120 so that the client 120 displays the problem set on a display interface, and the server 110 determines the type of the target text based on the predicted value of each text in the first spliced text set and sends the type to the client 120 for display.
It should be noted that the client 120 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like. The server 110 and the client 120 may be connected through bluetooth, USB (Universal Serial Bus), or other communication connection methods, which is not limited herein.
FIG. 2 is a diagram showing an internal configuration of a computer device according to an embodiment. As shown in fig. 2, the computer device includes a processor, a medium, a memory, and a network interface connected through a system bus. The computer device medium stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions can make a processor realize a text category identification method when being executed by the processor. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform a text category identification method. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 2 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components. Wherein the medium is a readable storage medium.
The text type identification method provided by the embodiment of the present application will be described in detail below with reference to fig. 3 to 5. The method may be implemented in dependence on a computer program, executable on a text class recognition device based on the von neumann architecture. The computer program may be integrated into the application or may run as a separate tool-like application.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Referring to fig. 3, a flowchart of a text type identification method is provided in an embodiment of the present application.
As shown in fig. 3, the method of the embodiment of the present application may include the following steps:
s101, acquiring a target text to be identified;
the target text is a language expression, the language is a set of communication instructions expressed by a common processing rule, the instructions are transmitted in a visual, sound or tactile manner, and the instructions specifically refer to natural languages used for human communication, such as Chinese and English. Text, which refers to the presentation form of written language, is usually a sentence or a combination of sentences having complete and systematic meaning, and a text can be a sentence, a paragraph or a chapter. The acquisition mode of the target text to be recognized at least comprises the acquisition from a test set.
Generally, the text is a word composed of several characters or a sentence composed of several words, or may be a paragraph composed of several sentences, and the user can describe his or her own idea by text, and the description by text can change the complicated idea into an instruction that is easy to be understood by others. For texts, different expression modes can be used to enable complex ideas to be popular and easy to understand, and communication is easier to understand. One or more natural languages contained in the target text may be referred to as sentences for short, or colloquially as sentences, or may be divided into sentences according to punctuations in the text, that is, contents ending with periods, question marks, exclamation marks, commas, and the like are taken as a sentence.
In the present application, the target text may be text input by the user to the user terminal. The text of the input terminal may be a language text acquired from the internet, that is, acquired in an actual application scenario, or a language text acquired from a test set, that is, acquired in a model training scenario, and the generation of the target text has various ways, which is not limited herein.
In the embodiment of the application, before the target text to be recognized is obtained, a plurality of description texts are collected from a text library; receiving a labeling instruction for each description text in a plurality of description texts, and generating a plurality of labeling texts after labeling based on each description text; and dividing the plurality of marked texts into a training set, a test set and a standard set according to a preset percentage.
Specifically, a plurality of description texts are collected from a text library, and then each description text is labeled, for example, the text a + the label 1, and the text 2+ the label 2. The labeled text is divided into 3 sets (training set, test set, standard set), or may be added with a verification set to total 4 sets. The proportion of each set can be freely set, but the training set is generally far larger than the test set and the verification set (if actually needed), and the standard set is generally minimum. The more appropriate ratio is the training set: and (3) test set: and (4) verification set: standard set 8:2:2: 0.2. However, the minimum value also needs to be set, for example, the minimum standard text set also needs to be 20-30 sentences, or text diversity is measured by using methods such as semantic representation, a small amount of texts with rich diversity are selected, and a certain coverage rate of the texts in the set on the types is ensured.
In one possible implementation, after the model training is finished, a target text to be recognized may be determined from the test set.
In another possible implementation manner, after the model is trained and deployed in an actual application scene, online data information is acquired, and the data information is determined as a target text to be recognized.
S102, splicing the target text and each text in the standard set to generate a first spliced text set;
generally, after text splicing is performed, a label is automatically set at each text spliced, and the label is 0. It should be noted that 0 is used only as an placeholder, and does not mean that the text a' is not the same as the text D and the text E.
In a possible implementation manner, each target text to be recognized and each standard text set are spliced into one spliced text, and assuming that the target text to be recognized is a', and the standard text sets are a text D tag 1 and a text E tag 2, the spliced text set is: text a' text D0; text a' text E0.
According to the method and the device, the target text to be recognized and the text in the standard set are spliced to obtain the plurality of texts to be recognized, so that the prediction frequency of the texts to be predicted can be greatly improved, and the confidence coefficient of subsequent calculation is more accurate.
S103, inputting each text in the first spliced text set into a pre-trained text type recognition model one by one, and outputting a predicted value of each text in the first spliced text set;
the pre-trained text type recognition model is generated by training based on the spliced texts in the second spliced text set, and the second spliced text set is generated by splicing each text in the training set and the test set with each text in the standard set;
in the embodiment of the application, a pre-trained text type recognition model can be generated according to the following steps, first, each text in a training set and a test set is spliced with each text in a standard set to generate a second spliced text set, then, a text type recognition model is created, each spliced text in the second spliced text set is input into the text type recognition model, a loss value of the model is output, and finally, when the loss value is smaller than a preset loss threshold value, the pre-trained text type recognition model is generated.
Specifically, the text type recognition model is a bigeminal model (capable of receiving 2 input texts at the same time) such as Sbert, and is created and generated by using a language expression model (such as word2vec, BERT, GPT, and the like).
Specifically, each spliced text in the second spliced text set is input into the text type identification model, when a loss value of the model is output, each spliced text in the second spliced text set is input into the text type identification model, a first semantic vector and a second semantic vector are output, a type similarity value of each spliced text in the second spliced text set is calculated according to the first semantic vector and the second semantic vector, a label value of each spliced text in the second spliced text set is determined, and finally the type similarity value of each spliced text is differed from the corresponding label value to generate a loss value of the model and output the loss value of the model.
Specifically, when the loss value is greater than or equal to the preset loss threshold value, the loss value is propagated reversely, and the step of inputting each spliced text in the second spliced text set into the text category identification model is continuously performed.
In a possible implementation manner, when a pre-trained text type recognition model is generated, firstly, each text in a training set and a test set is combined with each text in a standard set to obtain a text pair, and a label of a spliced text is modified to be 0 or 1 (representing consistency or inconsistency) according to whether the original labels of the two texts are consistent or not.
For example, the original texts in the training set and the test set are: text a tag 1; text B label 2; text C label 3. While the current standard text of the standard set is: text D label 1 text E label 2, the second concatenation text set that the concatenation formed is:
Figure BDA0003280582110000081
and then, inputting the second spliced text set into the two-generation model of Sbert one by one for training, and generating a pre-trained text type recognition model after training.
The basic network structure of the bigeminal model of Sbert is shown in fig. 4, for example, and the bigeminal model includes two BERT language models, each BERT language model corresponds to a pooling layer posing, which can downsample vectors output by the language models. When the model is trained, for example, one text A is sentenceA, one text B is sentenceB, the sentenceA and the sentenceB are allowed to be input into a BERT language model at the same time, the model outputs a vector A and a vector B, the vector A and the vector B are respectively input into a pooling layer posing, a vector u and a vector v are output, cosine similarity cosine-sim (u, v) is calculated according to the vector u and the vector v, finally, the cosine similarity is differed with label values corresponding to the text A and the text B, a loss value loss of the model is generated, and the loss is propagated reversely until the model converges.
In a possible implementation manner, after model training is completed, each text in the first stitched text set is input into a pre-trained text type recognition model one by one, and a predicted value of each text in the first stitched text set is output.
For example, the user has preset 5 categories, each of which has 20 standard texts, and there are 100 standard texts in total; then, after 1 sentence of text to be recognized and 100 standard texts are spliced respectively, the text is input into the model, and the model directly outputs 100 predicted values.
And S104, determining the category of the target text based on the predicted value of each text in the first spliced text set.
In a possible implementation manner, when determining the category of the target text based on the predicted value of each text in the first stitched text set, first obtaining a threshold value of each category in a plurality of preset categories, then counting the counting result of each category according to the predicted value of each text in the first stitched text set and the threshold value of each category, generating a counting result sequence, then obtaining the maximum counting result from the counting result sequence, then taking the maximum counting result and the total number of standard texts corresponding to the category of the maximum counting result as a quotient, generating the confidence of the target text, and finally determining the category corresponding to the confidence as the category corresponding to the target text according to the confidence being greater than the preset value.
Specifically, counting the counting result of each category according to the predicted value of each text in the first spliced text set and the threshold value of each category, and when a counting result sequence is generated, firstly, judging whether the predicted value of each text in the first spliced text set is greater than the threshold value of each category one by one; if yes, automatically adding one to the initial value of each category; if not, keeping the initial value of each category unchanged; wherein the initial value is 0; and when the judgment of the predicted value of each text in the first spliced text set is finished, determining the final initial value of each category as the counting result of each category.
For example, the predicted values of each text in the first stitched text set are respectively: a. b, c; the threshold values of the types in preset 3 are A, B, C respectively; comparing a, B, C with A respectively, for example, a > A counts 1, otherwise, no counting, a type A counting result can be obtained, comparing a, B, C with B respectively, a type B counting result can be obtained, comparing a, B, C with C respectively, a type C counting result can be obtained. If the counting result of the last 3 categories is [10,19,5], the confidence of the second category is considered to be the highest, and after 19% of confidence is obtained after calculation, the target text to be recognized is considered to belong to the category.
In the method, the basic logic of the traditional text classification method is mainly changed, namely, specific categories are marked according to text characteristics, problems are modeled to be compared with standard texts of each category, the standard texts of which categories are closer to the problems are judged, and classification is finished. The conventional classification method and the present application can be expressed by fig. 5: the conventional method finds A, B, C the edges of the three ellipses in fig. 5, and the present application finds A, B, C the boundary (black bold line).
In the embodiment of the application, a text type recognition device firstly obtains a target text to be recognized, then splices the target text with each text in a standard set to generate a first spliced text set, then inputs each text in the first spliced text set into a pre-trained text type recognition model one by one, and outputs a predicted value of each text in the first spliced text set; the pre-trained text type recognition model is generated by training based on the spliced texts in the second spliced text set, the second spliced text set is generated by splicing each text in the training set and the test set with each text in the standard set, and finally the type of the target text is determined based on the predicted value of each text in the first spliced text set. According to the method and the device, the relevance is established after the training set, the test set and the standard set are spliced, and after the model is trained by utilizing the data with the relevance, the recognition result of the model is more accurate, so that the accuracy of text category recognition is improved.
The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the embodiments of the method of the present invention.
Referring to fig. 6, a schematic structural diagram of a text category identification apparatus according to an exemplary embodiment of the present invention is shown, which is applied to a server. The text type recognition means may be implemented as all or part of the device in software, hardware or a combination of both. The device 1 comprises a text acquisition module 10, a spliced text set generation module 20, a text input module 30 and a category determination module 40.
The text acquisition module 10 is used for acquiring a target text to be identified;
the spliced text set generating module 20 is configured to splice the target text and each text in the standard set to generate a first spliced text set;
the text input module 30 is configured to input each text in the first stitched text set into a pre-trained text type recognition model one by one, and output a predicted value of each text in the first stitched text set;
the pre-trained text type recognition model is generated by training based on the spliced texts in the second spliced text set, and the second spliced text set is generated by splicing each text in the training set and the test set with each text in the standard set;
and the category determining module 40 is used for determining the category of the target text based on the predicted value of each text in the first spliced text set.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
In the embodiment of the application, a text type recognition device firstly obtains a target text to be recognized, then splices the target text with each text in a standard set to generate a first spliced text set, then inputs each text in the first spliced text set into a pre-trained text type recognition model one by one, and outputs a predicted value of each text in the first spliced text set; the pre-trained text type recognition model is generated by training based on the spliced texts in the second spliced text set, the second spliced text set is generated by splicing each text in the training set and the test set with each text in the standard set, and finally the type of the target text is determined based on the predicted value of each text in the first spliced text set. According to the method and the device, the relevance is established after the training set, the test set and the standard set are spliced, and after the model is trained by utilizing the data with the relevance, the recognition result of the model is more accurate, so that the accuracy of text category recognition is improved.
In one embodiment, a computer device is provided, the device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: acquiring a target text to be identified; splicing the target text with each text in the standard set to generate a first spliced text set; inputting each text in the first spliced text set into a pre-trained text type recognition model one by one, and outputting a predicted value of each text in the first spliced text set; the pre-trained text type recognition model is generated by training based on the spliced texts in the second spliced text set, and the second spliced text set is generated by splicing each text in the training set and the test set with each text in the standard set; and determining the category of the target text based on the predicted value of each text in the first spliced text set.
In one embodiment, before the processor executes the step of acquiring the target text to be recognized, the following operations are further executed: collecting a plurality of description texts from a text library; receiving a labeling instruction for each description text in a plurality of description texts, and generating a plurality of labeling texts after labeling based on each description text; and dividing the plurality of marked texts into a training set, a test set and a standard set according to a preset percentage.
In an embodiment, when the processor determines the category of the target text based on the predicted value of each text in the first stitched text set, the following operations are specifically performed: acquiring a threshold value of each of a plurality of preset categories; counting the counting result of each category according to the predicted value of each text in the first spliced text set and the threshold value of each category to generate a counting result sequence; obtaining a maximum counting result from the counting result sequence; the maximum counting result is subjected to quotient with the total number of the standard texts corresponding to the category of the maximum counting result, and the confidence coefficient of the target text is generated; and determining the category corresponding to the confidence coefficient as the category corresponding to the target text according to the fact that the confidence coefficient is larger than the preset value.
In an embodiment, the processor performs statistics on a counting result of each category according to the predicted value of each text in the first stitched text set and the threshold of each category, and when generating a counting result sequence, specifically performs the following operations: judging whether the predicted value of each text in the first spliced text set is larger than the threshold value of each category one by one; if yes, automatically adding one to the initial value of each category; if not, keeping the initial value of each category unchanged; wherein the initial value is 0; and when the judgment of the predicted value of each text in the first spliced text set is finished, determining the final initial value of each category as the counting result of each category.
In one embodiment, when the processor generates the pre-trained text type recognition model, the following operations are specifically performed: splicing each text in the training set and the test set with each text in the standard set to generate a second spliced text set; creating a text category identification model; inputting each spliced text in the second spliced text set into a text type identification model, and outputting a loss value of the model; and when the loss value is smaller than a preset loss threshold value, generating a pre-trained text category recognition model.
In one embodiment, the processor performs the following operation when inputting each of the concatenated texts in the second concatenated text set into the text category identification model and outputting the loss value of the model: inputting each spliced text in the second spliced text set into a text category identification model, and outputting a first semantic vector and a second semantic vector; calculating a category similarity value of each spliced text in the second spliced text set according to the first semantic vector and the second semantic vector; determining a label value of each spliced text in the second spliced text set; the category similarity value of each splicing text is subtracted from the corresponding label value to generate a loss value of the model; and outputting the loss value of the model.
In one embodiment, when the loss value is smaller than the preset loss threshold value and the pre-trained text category recognition model is generated, the following operations are specifically performed: when the loss value is greater than or equal to a preset loss threshold value, the loss value is transmitted reversely; and continuing to perform the step of inputting each spliced text in the second spliced text set into the text category identification model.
In the embodiment of the application, a text type recognition device firstly obtains a target text to be recognized, then splices the target text with each text in a standard set to generate a first spliced text set, then inputs each text in the first spliced text set into a pre-trained text type recognition model one by one, and outputs a predicted value of each text in the first spliced text set; the pre-trained text type recognition model is generated by training based on the spliced texts in the second spliced text set, the second spliced text set is generated by splicing each text in the training set and the test set with each text in the standard set, and finally the type of the target text is determined based on the predicted value of each text in the first spliced text set. According to the method and the device, the relevance is established after the training set, the test set and the standard set are spliced, and after the model is trained by utilizing the data with the relevance, the recognition result of the model is more accurate, so that the accuracy of text category recognition is improved.
In one embodiment, a medium is presented having computer-readable instructions stored thereon which, when executed by one or more processors, cause the one or more processors to perform the steps of: acquiring a target text to be identified; splicing the target text with each text in the standard set to generate a first spliced text set; inputting each text in the first spliced text set into a pre-trained text type recognition model one by one, and outputting a predicted value of each text in the first spliced text set; the pre-trained text type recognition model is generated by training based on the spliced texts in the second spliced text set, and the second spliced text set is generated by splicing each text in the training set and the test set with each text in the standard set; and determining the category of the target text based on the predicted value of each text in the first spliced text set.
In one embodiment, before the processor executes the step of acquiring the target text to be recognized, the following operations are further executed: collecting a plurality of description texts from a text library; receiving a labeling instruction for each description text in a plurality of description texts, and generating a plurality of labeling texts after labeling based on each description text; and dividing the plurality of marked texts into a training set, a test set and a standard set according to a preset percentage.
In an embodiment, when the processor determines the category of the target text based on the predicted value of each text in the first stitched text set, the following operations are specifically performed: acquiring a threshold value of each of a plurality of preset categories; counting the counting result of each category according to the predicted value of each text in the first spliced text set and the threshold value of each category to generate a counting result sequence; obtaining a maximum counting result from the counting result sequence; the maximum counting result is subjected to quotient with the total number of the standard texts corresponding to the category of the maximum counting result, and the confidence coefficient of the target text is generated; and determining the category corresponding to the confidence coefficient as the category corresponding to the target text according to the fact that the confidence coefficient is larger than the preset value.
In an embodiment, the processor performs statistics on a counting result of each category according to the predicted value of each text in the first stitched text set and the threshold of each category, and when generating a counting result sequence, specifically performs the following operations: judging whether the predicted value of each text in the first spliced text set is larger than the threshold value of each category one by one; if yes, automatically adding one to the initial value of each category; if not, keeping the initial value of each category unchanged; wherein the initial value is 0; and when the judgment of the predicted value of each text in the first spliced text set is finished, determining the final initial value of each category as the counting result of each category.
In one embodiment, when the processor generates the pre-trained text type recognition model, the following operations are specifically performed: splicing each text in the training set and the test set with each text in the standard set to generate a second spliced text set; creating a text category identification model; inputting each spliced text in the second spliced text set into a text type identification model, and outputting a loss value of the model; and when the loss value is smaller than a preset loss threshold value, generating a pre-trained text category recognition model.
In one embodiment, the processor performs the following operation when inputting each of the concatenated texts in the second concatenated text set into the text category identification model and outputting the loss value of the model: inputting each spliced text in the second spliced text set into a text category identification model, and outputting a first semantic vector and a second semantic vector; calculating a category similarity value of each spliced text in the second spliced text set according to the first semantic vector and the second semantic vector; determining a label value of each spliced text in the second spliced text set; the category similarity value of each splicing text is subtracted from the corresponding label value to generate a loss value of the model; and outputting the loss value of the model.
In one embodiment, when the loss value is smaller than the preset loss threshold value and the pre-trained text category recognition model is generated, the following operations are specifically performed: when the loss value is greater than or equal to a preset loss threshold value, the loss value is transmitted reversely; and continuing to perform the step of inputting each spliced text in the second spliced text set into the text category identification model.
In the embodiment of the application, a text type recognition device firstly obtains a target text to be recognized, then splices the target text with each text in a standard set to generate a first spliced text set, then inputs each text in the first spliced text set into a pre-trained text type recognition model one by one, and outputs a predicted value of each text in the first spliced text set; the pre-trained text type recognition model is generated by training based on the spliced texts in the second spliced text set, the second spliced text set is generated by splicing each text in the training set and the test set with each text in the standard set, and finally the type of the target text is determined based on the predicted value of each text in the first spliced text set. According to the method and the device, the relevance is established after the training set, the test set and the standard set are spliced, and after the model is trained by utilizing the data with the relevance, the recognition result of the model is more accurate, so that the accuracy of text category recognition is improved.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer readable medium, and when executed, can include the processes of the embodiments of the methods described above. The medium may be a non-volatile medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A text category identification method, the method comprising:
acquiring a target text to be identified;
splicing the target text with each text in the standard set to generate a first spliced text set;
inputting each text in the first spliced text set into a pre-trained text type recognition model one by one, and outputting a predicted value of each text in the first spliced text set; the pre-trained text type recognition model is generated by training based on spliced texts in a second spliced text set, and the second spliced text set is generated by splicing each text in the training set and the test set with each text in the standard set;
and determining the category of the target text based on the predicted value of each text in the first spliced text set.
2. The method according to claim 1, wherein the target text to be recognized is obtained at least from a test set; before the target text to be recognized is obtained, the method further includes:
collecting a plurality of description texts from a text library;
receiving a labeling instruction for each description text in the plurality of description texts, and generating a plurality of labeling texts after labeling based on each description text;
and dividing the plurality of marked texts into a training set, a test set and a standard set according to a preset percentage.
3. The method of claim 1, wherein determining the category of the target text based on the predicted value of each text in the first stitched text set comprises:
acquiring a threshold value of each of a plurality of preset categories;
counting the counting result of each category according to the predicted value of each text in the first spliced text set and the threshold value of each category to generate a counting result sequence;
obtaining a maximum counting result from the counting result sequence;
the maximum counting result is subjected to quotient with the total number of standard texts corresponding to the category of the maximum counting result, and a confidence coefficient of a target text is generated;
and determining the category corresponding to the confidence coefficient as the category corresponding to the target text according to the fact that the confidence coefficient is larger than a preset value.
4. The method according to claim 3, wherein the counting result of each category according to the predicted value of each text in the first stitched text set and the threshold of each category to generate a counting result sequence comprises:
judging whether the predicted value of each text in the first spliced text set is larger than the threshold value of each category one by one;
if yes, automatically adding one to the initial value of each category; if not, keeping the initial value of each category unchanged; wherein the initial value is 0;
and when the judgment of the predicted value of each text in the first spliced text set is finished, determining the final initial value of each category as the counting result of each category.
5. The method of claim 2, wherein generating a pre-trained text category recognition model comprises:
splicing the training set, each text in the test set and each text in the standard set to generate a second spliced text set;
creating a text category identification model;
inputting each spliced text in the second spliced text set into the text type identification model, and outputting a loss value of the model;
and when the loss value is smaller than a preset loss threshold value, generating a pre-trained text type recognition model.
6. The method of claim 5, wherein the inputting each of the splice texts in the second splice text set into the text category identification model and outputting a loss value of the model comprises:
inputting each spliced text in the second spliced text set into the text type identification model, and outputting a first semantic vector and a second semantic vector;
calculating a category similarity value of each spliced text in the second spliced text set according to the first semantic vector and the second semantic vector;
determining a label value of each spliced text in the second spliced text set;
the category similarity value of each splicing text is differentiated from the corresponding label value, and a loss value of the model is generated;
and outputting the loss value of the model.
7. The method of claim 5, wherein generating a pre-trained text category recognition model when the loss value is less than a preset loss threshold comprises:
when the loss value is greater than or equal to a preset loss threshold value, the loss value is transmitted reversely;
and continuing to execute the step of inputting each splicing text in the second splicing text set into the text category identification model.
8. A text category identification apparatus, characterized in that the apparatus comprises:
the text acquisition module is used for acquiring a target text to be identified;
the spliced text set generating module is used for splicing the target text and each text in the standard set to generate a first spliced text set;
the text input module is used for inputting each text in the first spliced text set into a pre-trained text type recognition model one by one and outputting a predicted value of each text in the first spliced text set;
the pre-trained text type recognition model is generated by training based on spliced texts in a second spliced text set, and the second spliced text set is generated by splicing each text in the training set and the test set with each text in the standard set;
and the category determining module is used for determining the category of the target text based on the predicted value of each text in the first spliced text set.
9. A computer device comprising a memory and a processor, the memory having stored therein computer-readable instructions which, when executed by the processor, cause the processor to carry out the steps of the text category identification method according to any one of claims 1 to 7.
10. A medium having computer-readable instructions stored thereon which, when executed by one or more processors, cause the one or more processors to perform the steps of text category identification as claimed in any one of claims 1 to 7.
CN202111131337.2A 2021-09-26 2021-09-26 Text type identification method and device, computer equipment and medium Pending CN113836303A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111131337.2A CN113836303A (en) 2021-09-26 2021-09-26 Text type identification method and device, computer equipment and medium
PCT/CN2022/070967 WO2023045184A1 (en) 2021-09-26 2022-01-10 Text category recognition method and apparatus, computer device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111131337.2A CN113836303A (en) 2021-09-26 2021-09-26 Text type identification method and device, computer equipment and medium

Publications (1)

Publication Number Publication Date
CN113836303A true CN113836303A (en) 2021-12-24

Family

ID=78970196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111131337.2A Pending CN113836303A (en) 2021-09-26 2021-09-26 Text type identification method and device, computer equipment and medium

Country Status (2)

Country Link
CN (1) CN113836303A (en)
WO (1) WO2023045184A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115035510A (en) * 2022-08-11 2022-09-09 深圳前海环融联易信息科技服务有限公司 Text recognition model training method, text recognition device, and medium
WO2023045184A1 (en) * 2021-09-26 2023-03-30 平安科技(深圳)有限公司 Text category recognition method and apparatus, computer device, and medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116167336B (en) * 2023-04-22 2023-07-07 拓普思传感器(太仓)有限公司 Sensor data processing method based on cloud computing, cloud server and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111382271A (en) * 2020-03-09 2020-07-07 支付宝(杭州)信息技术有限公司 Training method and device of text classification model and text classification method and device
CN112464662A (en) * 2020-12-02 2021-03-09 平安医疗健康管理股份有限公司 Medical phrase matching method, device, equipment and storage medium
CN113157927A (en) * 2021-05-27 2021-07-23 中国平安人寿保险股份有限公司 Text classification method and device, electronic equipment and readable storage medium
CN113326379A (en) * 2021-06-30 2021-08-31 中国平安人寿保险股份有限公司 Text classification prediction method, device, equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411563B (en) * 2010-09-26 2015-06-17 阿里巴巴集团控股有限公司 Method, device and system for identifying target words
CN108875743B (en) * 2017-05-15 2022-02-22 创新先进技术有限公司 Text recognition method and device
CN112182217A (en) * 2020-09-28 2021-01-05 云知声智能科技股份有限公司 Method, device, equipment and storage medium for identifying multi-label text categories
CN113360654B (en) * 2021-06-23 2024-04-05 深圳平安综合金融服务有限公司 Text classification method, apparatus, electronic device and readable storage medium
CN113360660A (en) * 2021-07-27 2021-09-07 北京有竹居网络技术有限公司 Text type identification method and device, electronic equipment and storage medium
CN113836303A (en) * 2021-09-26 2021-12-24 平安科技(深圳)有限公司 Text type identification method and device, computer equipment and medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111382271A (en) * 2020-03-09 2020-07-07 支付宝(杭州)信息技术有限公司 Training method and device of text classification model and text classification method and device
CN112464662A (en) * 2020-12-02 2021-03-09 平安医疗健康管理股份有限公司 Medical phrase matching method, device, equipment and storage medium
CN113157927A (en) * 2021-05-27 2021-07-23 中国平安人寿保险股份有限公司 Text classification method and device, electronic equipment and readable storage medium
CN113326379A (en) * 2021-06-30 2021-08-31 中国平安人寿保险股份有限公司 Text classification prediction method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023045184A1 (en) * 2021-09-26 2023-03-30 平安科技(深圳)有限公司 Text category recognition method and apparatus, computer device, and medium
CN115035510A (en) * 2022-08-11 2022-09-09 深圳前海环融联易信息科技服务有限公司 Text recognition model training method, text recognition device, and medium

Also Published As

Publication number Publication date
WO2023045184A1 (en) 2023-03-30

Similar Documents

Publication Publication Date Title
CN111859960B (en) Semantic matching method, device, computer equipment and medium based on knowledge distillation
US11004448B2 (en) Method and device for recognizing text segmentation position
CN112685565A (en) Text classification method based on multi-mode information fusion and related equipment thereof
CN113836303A (en) Text type identification method and device, computer equipment and medium
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
CN111488931A (en) Article quality evaluation method, article recommendation method and corresponding devices
CN111539197A (en) Text matching method and device, computer system and readable storage medium
CN111831826B (en) Training method, classification method and device of cross-domain text classification model
CN108304376B (en) Text vector determination method and device, storage medium and electronic device
CN113392179A (en) Text labeling method and device, electronic equipment and storage medium
CN111859940A (en) Keyword extraction method and device, electronic equipment and storage medium
CN111368066B (en) Method, apparatus and computer readable storage medium for obtaining dialogue abstract
CN111737961A (en) Method and device for generating story, computer equipment and medium
CN113723077A (en) Sentence vector generation method and device based on bidirectional characterization model and computer equipment
CN111597336A (en) Processing method and device of training text, electronic equipment and readable storage medium
US11036996B2 (en) Method and apparatus for determining (raw) video materials for news
CN112307738A (en) Method and device for processing text
CN114398903B (en) Intention recognition method, device, electronic equipment and storage medium
CN113408282B (en) Method, device, equipment and storage medium for topic model training and topic prediction
CN111680514A (en) Information processing and model training method, device, equipment and storage medium
WO2022262080A1 (en) Dialogue relationship processing method, computer and readable storage medium
CN114817523A (en) Abstract generation method and device, computer equipment and storage medium
CN114398482A (en) Dictionary construction method and device, electronic equipment and storage medium
CN110895924B (en) Method and device for reading document content aloud, electronic equipment and readable storage medium
CN111967253A (en) Entity disambiguation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40063337

Country of ref document: HK