CN114138972A

CN114138972A - Text type identification method and device

Info

Publication number: CN114138972A
Application number: CN202111440947.0A
Authority: CN
Inventors: 武文杰
Original assignee: Shenzhen Jizhi Digital Technology Co Ltd
Current assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-03-04
Anticipated expiration: 2041-11-30
Also published as: CN114138972B

Abstract

The disclosure relates to the technical field of artificial intelligence, and provides a text type identification method and device. The method comprises the following steps: in the text to be annotated of each category, determining a standard subcategory corresponding to the text to be annotated of each category according to the standard questions and the first similarity of each similar question; in the text to be labeled of each category, determining a nonstandard sub-category corresponding to the text to be labeled of each category according to the second similarity between any one similar question and each of the other similar questions; determining a category set according to a standard sub-category and a plurality of non-standard sub-categories corresponding to the text to be labeled of each category; when a second text to be labeled is detected, updating the category set according to the second text to be labeled; and when the text to be recognized is detected, determining the category corresponding to the text to be recognized from the category set by using a nearest neighbor algorithm.

Description

Text type identification method and device

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to a text type identification method and device.

Background

In the identification of text categories, the prior art usually marks the text at one time, and performs text identification according to the marked text. However, in some text recognition scenarios, it is necessary to perform recognition operations of text categories for multiple times, or to update labeled text for multiple times, so as to ensure that the accuracy of text recognition is not lower than expected. In view of the above situation, the prior art has not yet solved.

In the course of implementing the disclosed concept, the inventors found that there are at least the following technical problems in the related art: the labeled text cannot be updated in real time, so that the accuracy rate of identifying the text category is low.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a method and an apparatus for identifying a text category, an electronic device, and a computer-readable storage medium, so as to solve the problem in the prior art that the accuracy of identifying a text category is low because a labeled text cannot be updated in real time.

In a first aspect of the embodiments of the present disclosure, a method for recognizing a text category is provided, including: acquiring a first text to be annotated, wherein the first text to be annotated comprises a plurality of categories of texts to be annotated, and the text to be annotated of each category comprises a standard question and a plurality of similar questions; in the text to be annotated of each category, determining a standard subcategory corresponding to the text to be annotated of each category according to the standard questions and the first similarity of each similar question; in the text to be labeled of each category, determining a nonstandard sub-category corresponding to the text to be labeled of each category according to the second similarity between any one similar question and each of the other similar questions; determining a category set according to a standard sub-category and a plurality of non-standard sub-categories corresponding to the text to be labeled of each category; when a second text to be labeled is detected, updating the category set according to the second text to be labeled; and when the text to be recognized is detected, determining the category corresponding to the text to be recognized from the category set by using a nearest neighbor algorithm.

In a second aspect of the embodiments of the present disclosure, there is provided an apparatus for recognizing a text category, including: the system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is configured to acquire a first text to be annotated, the first text to be annotated comprises a plurality of categories of texts to be annotated, and each category of texts to be annotated comprises a standard question and a plurality of similar questions; the first determining module is configured to determine a standard subcategory corresponding to the text to be labeled of each category according to the standard questions and the first similarity of each similar question in the text to be labeled of each category; the second determining module is configured to determine a nonstandard sub-category corresponding to the text to be labeled of each category according to the second similarity between any one similar question and each of the other similar questions in the text to be labeled of each category; the third determining module is configured to determine a category set according to the standard sub-category and the plurality of non-standard sub-categories corresponding to the text to be labeled of each category; the updating module is configured to update the category set according to the second text to be labeled when the second text to be labeled is detected; and the recognition module is configured to determine a category corresponding to the text to be recognized from the category set by using a nearest neighbor algorithm when the text to be recognized is detected.

In a third aspect of the embodiments of the present disclosure, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the above method when executing the computer program.

In a fourth aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, which stores a computer program, which when executed by a processor, implements the steps of the above-mentioned method.

Compared with the prior art, the embodiment of the disclosure has the following beneficial effects: in the text to be labeled of each category, determining a standard subcategory corresponding to the text to be labeled of each category according to the standard questions and the first similarity of each similar question; in the text to be labeled of each category, determining a nonstandard sub-category corresponding to the text to be labeled of each category according to the second similarity between any one similar question and each of the other similar questions; determining a category set according to a standard sub-category and a plurality of non-standard sub-categories corresponding to the text to be labeled of each category; when a second text to be labeled is detected, updating the category set according to the second text to be labeled; when the text to be recognized is detected, the closest algorithm is used for determining the category corresponding to the text to be recognized from the category set, so that the technical means can solve the problem that the accuracy of text category recognition is low due to the fact that the labeled text cannot be updated in real time in the prior art, and further the method can update the labeled text in real time and further recognize the text category.

Drawings

To more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without inventive efforts.

FIG. 1 is a scenario diagram of an application scenario of an embodiment of the present disclosure;

fig. 2 is a flowchart illustrating a text category identification method according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an apparatus for recognizing text categories according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.

A text category identification method and apparatus according to an embodiment of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a scene schematic diagram of an application scenario of an embodiment of the present disclosure. The application scenario may include

terminal devices

1, 2, and 3, server 4, and network 5.

The

terminal devices

1, 2, and 3 may be hardware or software. When the

terminal devices

1, 2 and 3 are hardware, they may be various electronic devices having a display screen and supporting communication with the server 4, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like; when the

terminal devices

1, 2, and 3 are software, they may be installed in the electronic devices as above. The

terminal devices

1, 2 and 3 may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module, which is not limited by the embodiments of the present disclosure. Further, the

terminal devices

1, 2, and 3 may have various applications installed thereon, such as a data processing application, an instant messaging tool, social platform software, a search-type application, a shopping-type application, and the like.

The server 4 may be a server providing various services, for example, a backend server receiving a request sent by a terminal device establishing a communication connection with the server, and the backend server may receive and analyze the request sent by the terminal device and generate a processing result. The server 4 may be one server, may also be a server cluster composed of a plurality of servers, or may also be a cloud computing service center, which is not limited in this disclosure.

The server 4 may be hardware or software. When the server 4 is hardware, it may be various electronic devices that provide various services to the

terminal devices

1, 2, and 3. When the server 4 is software, it may be a plurality of software or software modules providing various services for the

terminal devices

1, 2, and 3, or may be a single software or software module providing various services for the

terminal devices

1, 2, and 3, which is not limited by the embodiment of the present disclosure.

The network 5 may be a wired network connected by a coaxial cable, a twisted pair and an optical fiber, or may be a wireless network that can interconnect various Communication devices without wiring, for example, Bluetooth (Bluetooth), Near Field Communication (NFC), Infrared (Infrared), and the like, which is not limited in the embodiment of the present disclosure.

A user can establish a communication connection with the server 4 via the network 5 through the

terminal devices

1, 2, and 3 to receive or transmit information or the like. It should be noted that the specific types, numbers and combinations of the

terminal devices

1, 2 and 3, the server 4 and the network 5 may be adjusted according to the actual requirements of the application scenarios, and the embodiment of the present disclosure does not limit this.

Fig. 2 is a flowchart illustrating a text category identification method according to an embodiment of the present disclosure. The recognition method of the text category of fig. 2 may be performed by the server of fig. 1. As shown in fig. 2, the method for identifying a text category includes:

s201, acquiring a first text to be annotated, wherein the first text to be annotated comprises a plurality of categories of texts to be annotated, and the text to be annotated of each category comprises a standard question and a plurality of similar questions;

s202, determining a standard sub-category corresponding to the text to be labeled of each category according to the standard questions and the first similarity of each similar question in the text to be labeled of each category;

s203, in the text to be labeled of each category, determining a nonstandard sub-category corresponding to the text to be labeled of each category according to the second similarity between any one similar question and each of the other similar questions;

s204, determining a category set according to the standard subcategory and the plurality of non-standard subcategories corresponding to the text to be labeled of each category;

s205, when a second text to be labeled is detected, updating a category set according to the second text to be labeled;

s206, when the text to be recognized is detected, determining the category corresponding to the text to be recognized from the category set by using a nearest neighbor algorithm.

It should be noted that, determining the standard sub-category corresponding to the text to be labeled of each category, determining the non-standard sub-category corresponding to the text to be labeled of each category, and determining the category set may be understood as performing labeling processing on the first text to be labeled, and belongs to the identification of the first text category. Since the labeling processing of the first text to be labeled is the recognition of the first text category, it is necessary to determine the order of the standard sub-category, the non-standard sub-category and the category set. The method comprises the steps of updating a category set according to a second text to be labeled, labeling the second text to be labeled, adding a result of labeling the second text to be labeled into the category set, and identifying the text which is not the first text category. Because the labeling processing of the second text to be labeled is the recognition of the text category not for the first time, the labeling processing can be performed sequentially without determining the standard sub-category, the non-standard sub-category and the category set, and only the comparison with the category set is needed. The similarity may be cosine similarity, text similarity, or the like.

According to the technical scheme provided by the embodiment of the disclosure, in the text to be labeled of each category, the standard subcategory corresponding to the text to be labeled of each category is determined according to the standard question and the first similarity of each similar question; in the text to be labeled of each category, determining a nonstandard sub-category corresponding to the text to be labeled of each category according to the second similarity between any one similar question and each of the other similar questions; determining a category set according to a standard sub-category and a plurality of non-standard sub-categories corresponding to the text to be labeled of each category; when a second text to be labeled is detected, updating the category set according to the second text to be labeled; when the text to be recognized is detected, the closest algorithm is used for determining the category corresponding to the text to be recognized from the category set, so that the technical means can solve the problem that the accuracy of text category recognition is low due to the fact that the labeled text cannot be updated in real time in the prior art, and further the method can update the labeled text in real time and further recognize the text category.

In step S202, in the text to be labeled of each category, according to the standard questions and the first similarity of each similar question, determining a standard subcategory corresponding to the text to be labeled of each category, including: in the text to be labeled of each category: calculating a first similarity between the standard questions and each similarity question; and when the first similarity is greater than a preset threshold, adding the similar questions corresponding to the first similarity greater than the preset threshold into the standard subcategory, and deleting the similar questions which are already added into the standard subcategory in the text to be annotated.

The standard question of the text to be labeled of a category can be understood as an average value of the text representation of the text to be labeled, or a most central part or data of the text representation of the text to be labeled, and the similarity question of the text to be labeled of a category can be understood as other parts or data in the text representation of the text to be labeled except the standard question. The standard question and the similar question of a text can firstly make the text pass through a text encoder to obtain a text representation of the text, the most central part or data of the text representation is used as the standard question of the text, and other parts or data are used as the similar question of the text.

The category set in the embodiment of the present disclosure includes one standard sub-category and multiple non-standard sub-categories of text to be labeled of each category, and actually represents the text to be labeled of each category as multiple vectors, and each sub-category corresponds to one vector. Because the embodiment of the present disclosure represents the text to be labeled of each category as a plurality of vectors, the text recognition according to the category set is more accurate.

In step S203, in the text to be labeled of each category, according to the second similarity between any one similar question and each of the other similar questions, determining a non-standard sub-category corresponding to the text to be labeled of each category, including: in the text to be labeled of each category: calculating a second similarity between any one of the similarity questions and each of the other similarity questions; and when the second similarity is larger than a preset threshold, adding the similar questions corresponding to the second similarity larger than the preset threshold into the non-standard subcategory corresponding to any one of the similar questions, and deleting the similar questions which are already added into the non-standard subcategory in the text to be labeled.

Because the standard sub-category is determined first and then the non-standard sub-category is determined when the text category is identified, the text to be annotated in the embodiment of the present disclosure is the text to be annotated after the similar questions that have been added to the standard sub-category are deleted. For example, in one category, after determining the standard subcategory, there are 10 similar questions in the text to be annotated. Calculating second similarity between the 1 st similar question and other 9 similar questions, wherein the second similarity between the 1 st similar question and the 5 th and 7 th similar questions is larger than a preset threshold value, so that the 5 th and 7 th similar questions are added to the non-standard subcategories corresponding to the 1 st similar question; then, a second similarity between the 2 nd similar question and the other 9 similar questions is calculated, and it is determined that the 2 nd similar question corresponds to the non-standard sub-category … … each of which corresponds to a non-standard sub-category. A non-standard sub-category may be understood as a queue, which is initially empty.

In step S204, determining a category set according to the standard sub-category and the multiple non-standard sub-categories corresponding to the text to be labeled of each category, including: in the standard subcategory and the plurality of non-standard subcategories corresponding to the text to be annotated of each category: calculating the arithmetic mean of all similar questions in each nonstandard sub-category to obtain the nonstandard sub-category representation corresponding to each nonstandard sub-category, calculating the arithmetic mean of the standard questions and all similar questions in the standard sub-category to obtain the standard sub-category representation corresponding to the standard sub-category, calculating the arithmetic mean of all nonstandard sub-category representations and the standard sub-category representation to obtain the mother category representation corresponding to the text to be labeled of each category; and determining a category set according to a plurality of non-standard sub-category representations, standard sub-category representations and mother category representations corresponding to the text to be labeled of each category.

The parent category representation corresponding to the text to be labeled of each category is the category representation of the category, each category is a parent category, so that the parent category is called parent category representation, a plurality of sub-categories are distinguished under each category, and the text representation of each sub-category is shown as a sub-category representation. In the embodiment of the present disclosure, the arithmetic mean of all similar questions in each nonstandard sub-category is calculated, and the arithmetic mean is used as the nonstandard sub-category corresponding to each nonstandard sub-category to represent, and may be obtained by first obtaining vectors corresponding to all similar questions in each nonstandard sub-category according to the related knowledge represented by the text, then calculating the arithmetic mean of the vectors corresponding to all similar questions in each nonstandard sub-category, and using the arithmetic mean as the nonstandard sub-category corresponding to each nonstandard sub-category to represent. Or calculating the weighted sum of all similar questions in each nonstandard sub-category, averaging the weighted sum, and representing the averaged value as the corresponding nonstandard sub-category of each nonstandard sub-category. The computation of the standard subcategory representation and the parent category representation is similar to the computation of the non-standard subcategory representation.

In step S205, when the second text to be labeled is detected, updating the category set according to the second text to be labeled includes: inputting the second text to be labeled into a text encoder to obtain a first text representation corresponding to the second text to be labeled; calculating a third similarity between the first text representation and each parent category representation in the category set, and when the third similarity is greater than a preset threshold: calculating fourth similarity of the first text representation and standard subcategory representations corresponding to each parent category representation in the category set, adding the second text to be labeled to the standard subcategory when the fourth similarity is larger than a preset threshold value, and updating the standard subcategory representations; when the fourth similarity is smaller than a preset threshold value, calculating fifth similarity of the first text representation and each nonstandard sub-category representation corresponding to each parent category representation in the category set, when the fifth similarity is larger than the preset threshold value, adding the second text to be labeled to the nonstandard sub-category corresponding to the fifth similarity larger than the preset threshold value, and updating the nonstandard sub-category representation corresponding to the fifth similarity larger than the preset threshold value; and when the third similarity is smaller than a preset threshold value, adding the first text representation as a new mother class representation into the class set.

The text encoder, which may be a BERT model, has been trained, learns and stores correspondences between the text to be annotated and the text representation.

The third similarity is greater than the preset threshold, which indicates that the second text to be labeled must belong to the parent category corresponding to the third similarity greater than the preset threshold, and therefore the second text to be labeled either belongs to the standard subcategory of the parent category or belongs to a non-standard subcategory of the parent category. Therefore, when the fourth similarity is smaller than the preset threshold, then the fifth similarity must be larger than the preset threshold. The text to be annotated, the parent category and the subcategory can be understood as text information, and the text representation, the subcategory representation and the parent category representation can be understood as vector information. And the third similarity is smaller than a preset threshold value, which indicates that the second text to be labeled does not belong to any parent category in the category set, so that the second text to be labeled is added to the category set as a new parent category. The set of categories comprises, or may be understood to comprise, respective parent categories, respective sub-categories, respective parent category representations and respective sub-category representations, each parent category representation being inclusive of the corresponding parent category and each sub-category representation being inclusive of the corresponding sub-category. And adding the second text to be annotated to the standard subcategory and updating the standard subcategory representation, wherein updating the standard subcategory representation means adding the first text representation to the standard subcategory representation. Further, the second text to be labeled can be added to the category set as a new sub-category of a new parent category, and when other texts to be labeled are detected subsequently, the new parent category is perfected or updated. If the similarity corresponding to the third text to be labeled and the second text to be labeled is greater than the preset threshold but smaller than the second preset threshold, the third text to be labeled can be used as the new parent category, except the second text to be labeled, and another new sub-category.

After step S205 is executed, when a second text to be labeled is detected, after the category set is updated according to the second text to be labeled, the method further includes: when a text recognition instruction is received, acquiring the text to be recognized and a category set; inputting the text to be recognized into a text encoder to obtain a second text representation corresponding to the text to be recognized; calculating a sixth similarity of the second textual representation to any one of the following category representations: each parent category representation in the category set, the standard sub-category representation corresponding to each parent category representation, and any one non-standard sub-category representation corresponding to each parent category representation; and when the sixth similarity is greater than the preset threshold, identifying the text to be recognized as the category corresponding to the sixth similarity greater than the preset threshold.

The text recognition method provided by the embodiment of the disclosure can recognize the parent category of the text to be recognized, and also can recognize the sub-category of the text to be recognized, wherein the sub-category includes: a non-standard sub-category and a standard sub-category. The preset threshold value appearing for multiple times in the present disclosure may be the same threshold value, for example, all of the threshold values are 0.7, or the preset threshold values appearing in different embodiments of the present disclosure may be different. And when the sixth similarity is greater than the preset threshold, identifying the text to be recognized as the category corresponding to the sixth similarity greater than the preset threshold. The category may be any parent category in the category set or any sub-category under any parent category.

In step S206, when the text to be recognized is detected, determining a category corresponding to the text to be recognized from the category set by using a nearest neighbor algorithm, including: inputting the text to be recognized into a text encoder to obtain a second text representation corresponding to the text to be recognized; and determining the category corresponding to the text to be recognized from the category set according to the second text representation by using a nearest neighbor algorithm.

For example, using a nearest neighbor algorithm, 50 sub-categories which are nearest to the second text representation are selected from each parent category in the category set, then the arithmetic mean of the 50 selected sub-categories of each parent category is respectively calculated, and the parent category which is nearest to the second text representation and has the arithmetic mean of each parent category is taken as the category corresponding to the text to be recognized.

In an optional embodiment, early warning can be performed on similar questions in the subclasses, so that improvement of data quality already marked in the class set is assisted. Specifically, the similarity between the standard subclass and the non-standard subclass and other similarity is smaller than a certain threshold, and the labeling processing or text representation is performed again.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.

Fig. 3 is a schematic diagram of an apparatus for recognizing a text category according to an embodiment of the present disclosure. As shown in fig. 3, the text category recognition device includes:

the obtaining module 301 is configured to obtain a first text to be annotated, where the first text to be annotated includes a plurality of categories of texts to be annotated, and each category of texts to be annotated includes a standard question and a plurality of similar questions;

the first determining module 302 is configured to determine a standard subcategory corresponding to the text to be labeled of each category according to the standard questions and the first similarity of each similar question in the text to be labeled of each category;

the second determining module 303 is configured to determine, in the text to be annotated in each category, a non-standard sub-category corresponding to the text to be annotated in each category according to a second similarity between any one similar question and each of the other similar questions;

a third determining module 304, configured to determine a category set according to the standard sub-category and the multiple non-standard sub-categories corresponding to the text to be annotated of each category;

an updating module 305, configured to, when detecting a second text to be labeled, update the category set according to the second text to be labeled;

the identifying module 306 is configured to determine, when the text to be identified is detected, a category corresponding to the text to be identified from the category set by using a nearest neighbor algorithm.

Optionally, the first determining module 302 is further configured to, in the text to be annotated of each category: calculating a first similarity between the standard questions and each similarity question; and when the first similarity is greater than a preset threshold, adding the similar questions corresponding to the first similarity greater than the preset threshold into the standard subcategory, and deleting the similar questions which are already added into the standard subcategory in the text to be annotated.

Optionally, the second determining module 303 is further configured to, in the text to be annotated of each category: calculating a second similarity between any one of the similarity questions and each of the other similarity questions; and when the second similarity is larger than a preset threshold, adding the similar questions corresponding to the second similarity larger than the preset threshold into the non-standard subcategory corresponding to any one of the similar questions, and deleting the similar questions which are already added into the non-standard subcategory in the text to be labeled.

Optionally, the second determining module 304 is further configured to, in the standard sub-category and the plurality of non-standard sub-categories corresponding to the text to be annotated of each category: calculating the arithmetic mean of all similar questions in each nonstandard sub-category to obtain the nonstandard sub-category representation corresponding to each nonstandard sub-category, calculating the arithmetic mean of the standard questions and all similar questions in the standard sub-category to obtain the standard sub-category representation corresponding to the standard sub-category, calculating the arithmetic mean of all nonstandard sub-category representations and the standard sub-category representation to obtain the mother category representation corresponding to the text to be labeled of each category; and determining a category set according to a plurality of non-standard sub-category representations, standard sub-category representations and mother category representations corresponding to the text to be labeled of each category.

Optionally, the updating module 305 is further configured to input the second text to be labeled into the text encoder, so as to obtain a first text representation corresponding to the second text to be labeled; calculating a third similarity between the first text representation and each parent category representation in the category set, and when the third similarity is greater than a preset threshold: calculating fourth similarity of the first text representation and standard subcategory representations corresponding to each parent category representation in the category set, adding the second text to be labeled to the standard subcategory when the fourth similarity is larger than a preset threshold value, and updating the standard subcategory representations; when the fourth similarity is smaller than a preset threshold value, calculating fifth similarity of the first text representation and each nonstandard sub-category representation corresponding to each parent category representation in the category set, when the fifth similarity is larger than the preset threshold value, adding the second text to be labeled to the nonstandard sub-category corresponding to the fifth similarity larger than the preset threshold value, and updating the nonstandard sub-category representation corresponding to the fifth similarity larger than the preset threshold value; and when the third similarity is smaller than a preset threshold value, adding the first text representation as a new mother class representation into the class set.

Optionally, the recognition module 306 is further configured to, when receiving a text recognition instruction, obtain the text to be recognized and the category set; inputting the text to be recognized into a text encoder to obtain a second text representation corresponding to the text to be recognized; calculating a sixth similarity of the second textual representation to any one of the following category representations: each parent category representation in the category set, the standard sub-category representation corresponding to each parent category representation, and any one non-standard sub-category representation corresponding to each parent category representation; and when the sixth similarity is greater than the preset threshold, identifying the text to be recognized as the category corresponding to the sixth similarity greater than the preset threshold.

Optionally, the recognition module 306 is further configured to input the text to be recognized into a text encoder, so as to obtain a second text representation corresponding to the text to be recognized; and determining the category corresponding to the text to be recognized from the category set according to the second text representation by using a nearest neighbor algorithm.

Optionally, the updating module 305 is further configured to perform early warning on similar questions in the subclasses, so as to assist in improving the quality of data already labeled in the class set. Specifically, the similarity between the standard subclass and the non-standard subclass and other similarity is smaller than a certain threshold, and the labeling processing or text representation is performed again.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure.

Fig. 4 is a schematic diagram of an electronic device 4 provided by the embodiment of the present disclosure. As shown in fig. 4, the electronic apparatus 4 of this embodiment includes: a processor 401, a memory 402 and a computer program 403 stored in the memory 402 and executable on the processor 401. The steps in the various method embodiments described above are implemented when the processor 401 executes the computer program 403. Alternatively, the processor 401 implements the functions of the respective modules/units in the above-described respective apparatus embodiments when executing the computer program 403.

Illustratively, the computer program 403 may be partitioned into one or more modules/units, which are stored in the memory 402 and executed by the processor 401 to accomplish the present disclosure. One or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 403 in the electronic device 4.

The electronic device 4 may be a desktop computer, a notebook, a palm computer, a cloud server, or other electronic devices. The electronic device 4 may include, but is not limited to, a processor 401 and a memory 402. Those skilled in the art will appreciate that fig. 4 is merely an example of the electronic device 4, and does not constitute a limitation of the electronic device 4, and may include more or less components than those shown, or combine certain components, or different components, e.g., the electronic device may also include input-output devices, network access devices, buses, etc.

The Processor 401 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 402 may be an internal storage unit of the electronic device 4, for example, a hard disk or a memory of the electronic device 4. The memory 402 may also be an external storage device of the electronic device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the electronic device 4. Further, the memory 402 may also include both internal storage units of the electronic device 4 and external storage devices. The memory 402 is used for storing computer programs and other programs and data required by the electronic device. The memory 402 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

In the embodiments provided in the present disclosure, it should be understood that the disclosed apparatus/electronic device and method may be implemented in other ways. For example, the above-described apparatus/electronic device embodiments are merely illustrative, and for example, a module or a unit may be divided into only one logical function, and may be implemented in other ways, and multiple units or components may be combined or integrated into another system, or some features may be omitted or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method in the above embodiments, and may also be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the above methods and embodiments. The computer program may comprise computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain suitable additions or additions that may be required in accordance with legislative and patent practices within the jurisdiction, for example, in some jurisdictions, computer readable media may not include electrical carrier signals or telecommunications signals in accordance with legislative and patent practices.

The above examples are only intended to illustrate the technical solutions of the present disclosure, not to limit them; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present disclosure, and are intended to be included within the scope of the present disclosure.

Claims

1. A method for recognizing text categories is characterized by comprising the following steps:

acquiring a first text to be annotated, wherein the first text to be annotated comprises a plurality of categories of texts to be annotated, and the text to be annotated of each category comprises a standard question and a plurality of similar questions;

in the text to be labeled of each category, determining a standard subcategory corresponding to the text to be labeled of each category according to the standard question and the first similarity of each similar question;

in the text to be labeled of each category, determining a nonstandard sub-category corresponding to the text to be labeled of each category according to a second similarity between any one of the similar questions and each of the other similar questions;

determining a category set according to the standard subcategory and the non-standard subcategories corresponding to the text to be labeled of each category;

when a second text to be labeled is detected, updating the category set according to the second text to be labeled;

and when the text to be recognized is detected, determining the category corresponding to the text to be recognized from the category set by using a nearest neighbor algorithm.

2. The method according to claim 1, wherein the determining, in the text to be labeled in each category, a standard sub-category corresponding to the text to be labeled in each category according to the standard question and a first similarity of each similar question comprises:

in the text to be labeled of each category:

calculating a first similarity between the standard question and each similarity question;

and when the first similarity is greater than a preset threshold value, adding the similar questions corresponding to the first similarity greater than the preset threshold value into a standard subcategory, and deleting the similar questions which are already added into the standard subcategory in the text to be annotated.

3. The method according to claim 1, wherein the determining, in the text to be labeled in each category, a non-standard sub-category corresponding to the text to be labeled in each category according to a second similarity between any one of the similar questions and each of the other similar questions comprises:

in the text to be labeled of each category:

calculating a second similarity between any one of the similarity questions and each of the other similarity questions;

and when the second similarity is larger than a preset threshold value, adding the similar questions corresponding to the second similarity larger than the preset threshold value into the non-standard subcategory corresponding to any one of the similar questions, and deleting the similar questions which are already added into the non-standard subcategory in the text to be labeled.

4. The method according to claim 1, wherein the determining a category set according to the standard sub-category and the non-standard sub-categories corresponding to the text to be labeled of each category comprises:

in the standard sub-category and the plurality of non-standard sub-categories corresponding to the text to be labeled of each category: calculating the arithmetic mean value of all the similar questions in each nonstandard sub-category to obtain the non-standard sub-category representation corresponding to each non-standard sub-category, calculating the arithmetic mean value of the standard questions and all the similar questions in the standard sub-category to obtain the standard sub-category representation corresponding to the standard sub-category, calculating the arithmetic mean value of all the non-standard sub-category representations and the standard sub-category representation to obtain the mother category representation corresponding to the text to be labeled of each category;

and determining the category set according to the plurality of non-standard sub-category representations, the standard sub-category representations and the mother category representation corresponding to the text to be labeled of each category.

5. The method according to claim 1, wherein the updating the category set according to the second text to be labeled when the second text to be labeled is detected comprises:

inputting the second text to be labeled into a text encoder to obtain a first text representation corresponding to the second text to be labeled;

calculating a third similarity of the first text representation to each parent class representation in the set of classes, when the third similarity is greater than a preset threshold:

calculating a fourth similarity of the standard sub-category representation corresponding to the first text representation and each parent category representation in the category set, and when the fourth similarity is greater than a preset threshold, adding the second text to be annotated to the standard sub-category and updating the standard sub-category representation;

when the fourth similarity is smaller than a preset threshold, calculating a fifth similarity of each non-standard sub-category representation corresponding to each parent category representation in the category set and the first text representation, when the fifth similarity is larger than a preset threshold, adding the second text to be annotated to the non-standard sub-category corresponding to the fifth similarity larger than the preset threshold, and updating the non-standard sub-category representation corresponding to the fifth similarity larger than the preset threshold;

and when the third similarity is smaller than a preset threshold value, adding the first text representation as a new parent class representation to the class set.

6. The method according to claim 1, wherein when detecting a second text to be labeled, after updating the category set according to the second text to be labeled, the method further comprises:

when a text recognition instruction is received, acquiring the text to be recognized and the category set;

inputting the text to be recognized into a text encoder to obtain a second text representation corresponding to the text to be recognized;

calculating a sixth similarity of the second textual representation to any one of the following category representations:

each parent category representation in the category set, the standard sub-category representation corresponding to each parent category representation, and any one of the non-standard sub-category representations corresponding to each parent category representation;

and when the sixth similarity is greater than a preset threshold value, identifying the text to be recognized as a category corresponding to the sixth similarity greater than the preset threshold value.

7. The method according to claim 1, wherein when the text to be recognized is detected, determining the category corresponding to the text to be recognized from the category set by using a nearest neighbor algorithm comprises:

and determining the category corresponding to the text to be recognized from the category set according to the second text representation by using a nearest neighbor algorithm.

8. An apparatus for recognizing a text category, comprising:

the system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is configured to acquire a first text to be annotated, the first text to be annotated comprises a plurality of categories of texts to be annotated, and each category of texts to be annotated comprises a standard question and a plurality of similar questions;

the first determining module is configured to determine a standard subcategory corresponding to the text to be labeled of each category according to the standard questions and the first similarity of each similar question in the text to be labeled of each category;

the second determining module is configured to determine a nonstandard sub-category corresponding to the text to be labeled of each category according to a second similarity between any one of the similar questions and each of the other similar questions in the text to be labeled of each category;

the third determination module is configured to determine a category set according to the standard sub-category and the plurality of non-standard sub-categories corresponding to the text to be labeled of each category;

the updating module is configured to update the category set according to a second text to be labeled when the second text to be labeled is detected;

and the recognition module is configured to determine a category corresponding to the text to be recognized from the category set by using a nearest neighbor algorithm when the text to be recognized is detected.

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.