CN115080733A

CN115080733A - Long text recognition method, device, server and computer readable storage medium

Info

Publication number: CN115080733A
Application number: CN202210095207.6A
Authority: CN
Inventors: 聂镭; 齐凯杰; 王竹欣
Original assignee: Longma Zhixin Zhuhai Hengqin Technology Co ltd
Current assignee: Longma Zhixin Zhuhai Hengqin Technology Co ltd
Priority date: 2022-08-01
Filing date: 2022-08-01
Publication date: 2022-09-20

Abstract

The application is applicable to the technical field of text processing, and provides a long text recognition method, a long text recognition device, a server and a computer-readable storage medium, wherein the method comprises the following steps: acquiring a text to be identified; determining a long text in the text to be recognized; the target type of the long text is identified according to a voting mechanism. Therefore, the method and the device solve the technical problem that text information may be lost to cause wrong judgment on the text category in the prior art, and achieve the effect of improving the identification accuracy of long texts.

Description

Long text recognition method, device, server and computer readable storage medium

Technical Field

The present application belongs to the technical field of text processing, and in particular, to a long text recognition method, apparatus, server, and computer-readable storage medium.

Background

The text classification is one of important tasks of Natural Language Processing (NLP), and the technology can perform important label extraction on the text, realize public opinion monitoring on the text and grasp public opinion hotspots and tendencies. The existing methods for realizing text classification are numerous, wherein a Bidirectional Encoder reproduction from transformations (BERT) pre-training characterization model shows better results in a plurality of NLP tasks. However, the pre-training model limits the text length thereof in consideration of the calculation and operation efficiency, the maximum length is limited to 512, and flag bits [ CLS ] and [ SEP ] are also required to be included, and the maximum text length thereof is substantially 510. When the text length exceeds 510, text interception is required. Therefore, text information may be lost resulting in erroneous judgment of the text category.

Disclosure of Invention

The embodiment of the application provides a long text classification identification method, a long text classification identification device, a long text server and a long text storage medium, and can solve the technical problem that in the prior art, the classification identification process of long texts can cause wrong judgment on text types due to lost text information.

In a first aspect, an embodiment of the present application provides a long text recognition method, including:

acquiring a text to be recognized;

determining a long text in the text to be recognized;

the target type of the long text is identified according to a voting mechanism.

In a possible implementation manner of the first aspect, identifying a type of the long text according to a voting mechanism includes:

splitting the long text to obtain short sentences;

inputting the short sentence into a classification model trained in advance to obtain a prediction type;

and according to a preset voting mechanism, judging the target type of the long text based on the prediction type corresponding to the short sentence.

In a possible implementation manner of the first aspect, before the short sentence is input to a classification model trained in advance and a prediction type is obtained, the method further includes:

and training the classification model.

In a possible implementation manner of the first aspect, training the classification model includes:

acquiring sample data;

constructing a classification model according to the sample data;

and training the classification model according to the sample data.

In a second aspect, an embodiment of the present application provides a long text recognition apparatus, including:

the acquisition module is used for acquiring a text to be recognized;

the determining module is used for determining a long text in the text to be recognized;

and the identification module is used for identifying the target type of the long text according to the voting mechanism.

In one possible implementation manner of the second aspect, the identification module includes:

the splitting unit is used for splitting the long text to obtain a short sentence;

the prediction unit is used for inputting the short sentence into a classification model trained in advance to obtain a prediction type;

and the judging unit is used for judging the target type of the long text based on the prediction type corresponding to the short sentence according to a preset voting mechanism.

In a possible implementation manner of the second aspect, the apparatus further includes:

and the training module is used for the classification model.

In one possible implementation manner of the second aspect, the training module includes:

an acquisition unit configured to acquire sample data;

the construction unit is used for constructing a classification model according to the sample data;

and the training unit is used for training the classification model according to the sample data.

In a third aspect, an embodiment of the present application provides a server, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the method according to the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method according to the first aspect.

Compared with the prior art, the embodiment of the application has the advantages that:

in the embodiment of the application, the text to be recognized is obtained, the long text in the text to be recognized is determined, the target type of the long text is recognized according to the voting mechanism, the technical problem that text information may be lost to cause wrong judgment on the text type in the prior art is solved, and the effect of improving the recognition accuracy of the long text is achieved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flowchart of a long text recognition method according to an embodiment of the present application;

FIG. 2 is a block diagram of a long text recognition apparatus according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a server provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of a model input for text classification as provided in the practice of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

The technical solutions provided in the embodiments of the present application will be described below with specific embodiments.

Referring to fig. 1, a flow chart of a long text recognition method provided in an embodiment of the present application is schematically illustrated, by way of example and not limitation, where the method may be applied to a server, and the method may include the following steps:

and step S101, acquiring a text to be recognized.

It will be appreciated that the source of the text to be recognized may be an open source data set.

And step S102, determining a long text in the text to be recognized.

It is understood that, in the embodiment of the present application, the determination that the text length is greater than 510 is a long text. And when the text length is less than 510, directly performing type prediction on the text by adopting a classification model to obtain the corresponding type of the text.

And step S103, identifying the target type of the long text according to a voting mechanism.

In a specific application, the target type of the long text is identified according to a voting mechanism, and the method comprises the following steps:

step S201, splitting the long text to obtain a short sentence.

It will be appreciated that when the text length is above 510, the text is punctuated ". And splitting to obtain a plurality of short sentences. Combining the short sentences from front to back to enable the short sentences to be smaller than 510, if the length after combination is smaller than 510, combining successfully, continuing to combine the next sentence, and if the length after combination is smaller than 510, continuing to combine; if the merging length is larger than 510, merging fails, the previous merging result is reserved, another new short sentence is merged, and the like is repeated until all the short sentences are merged. For example, clause 1: 200, clause 2: 180, clause 3: 300, clause 4: 120, clause 5: 400, clause 6: 600. and the length of clause 1 is 200, the length of clause 2 is 180, the length of the clause after clause 1 and clause 2 are combined is 380 and is less than 510, the clause is continuously combined, the length of the clause after clause 3 is combined is 680 and is greater than 510, the combination fails, only the combination result of clause 1 and clause 2 is reserved, and clause 3 is used as a new starting clause and the remaining clauses to be combined according to the same logic. Finally, the split short sentence results are [ short sentence 1 and short sentence 2, short sentence 3 and short sentence 4, short sentence 5 and short sentence 6 ].

Step S202, short sentences are input into a classification model trained in advance to obtain a prediction type.

It can be understood that, aiming at the fact that the long text which can be obtained in the previous step is divided into a plurality of short sentences, classification model prediction is respectively carried out on the plurality of short sentences to obtain classification results of the short sentences.

In a possible implementation manner, before the short sentence is input to a classification model trained in advance and a prediction type is obtained, the method further includes:

and training a classification model.

In a specific application, the training of the classification model comprises the following steps:

in step S301, sample data is acquired.

The step of obtaining the sample data comprises data preprocessing, data labeling and data segmentation.

Specifically, the data preprocessing refers to a process of performing various checks on data to correct wrong characters or redundant characters in a text, so that the text has higher usability; for example, removing redundant spaces, carriage returns, etc. in the text.

And the data marking refers to manually marking the text obtained in the data preprocessing so as to support the subsequent model training. In the process of data annotation, the accuracy of data annotation needs to be ensured, so that the quality of annotation is controlled by adopting a cross annotation mode in the annotation process.

And after the labeled data is finished, determining that each data set required by model training comprises a training set, a verification set and a test set, and segmenting the data sets according to the proportion of 8:1: 1. I.e., 80% are training sets, and the validation set and test set each account for 10%.

Step S302, a classification model is constructed according to the sample data.

The method comprises the steps of establishing a classification model according to sample data, determining the version of the classification model, determining the input and the output of the classification model, and determining a loss function and an evaluation function of the classification model.

Specifically, determining a version of the classification model includes:

the classification model supports multiple languages, mainly including the top 100 languages (except Thai) with the largest expectation of Wikipedia. The multilingual model also contains Chinese, but if the model training is limited to Chinese, then directly employing the Chinese model may yield better results. The selection of the pre-training model version is mainly selected according to the language type of the task and the purpose of the task, and the comparison of results of different versions in practical application can be performed, so that the corresponding version is selected.

Determining inputs and outputs of a classification model, comprising:

for the task of text classification, the BERT model inserts a [ CLS ] symbol in front of a text, and uses an output vector corresponding to the symbol as a semantic representation of the whole text for text classification, as shown in fig. 4, a model input diagram for text classification provided by the implementation of the present application is shown, and according to fig. 4, the input of the model is to use the sum of a word vector, a segment vector and a position vector as a model input. The word vector is a one-dimensional vector converted from each word in the text by inquiring the word vector table; the segment vector is the prediction task of the next sentence in the BERT, so that two sentences can be spliced, the upper sentence and the lower sentence are provided, the upper sentence is provided with an upper sentence segment vector, the lower sentence is provided with a lower sentence segment vector, namely A and B in the graph, in addition, the tail of the sentence is provided with an added SEP tail symbol, and the head of the spliced two sentences is provided with a CLS symbol; the position vector is a vector that represents a position and is artificially added because the transform model cannot remember a time sequence. The sum of the three components together constitutes the input of the model. The output of the model is different according to different tasks and the required output result is different, and the classification model is mainly used for text classification in the embodiment of the application, so that the output of the classification model is the text type.

Determining a loss function and an evaluation function of the classification model, comprising:

the loss function is also called an objective function, that is, a performance function in the training process of the classification model, and there are many loss functions, and based on the role of the classification model in the embodiments of the present application, the embodiments of the present application may select the following formula: also known as multi-class log loss. And the evaluation function verifies the accuracy of the classification model in the training process.

Step S303, training a classification model according to the sample data.

In specific application, after the input and the output of the classification model and the structure of the model are determined, model training is carried out on the labeled data. According to the fact that the data length does not exceed 510 in the data preprocessing process, when the text length is less than 510, the text length needs to be supplemented to ensure that the length of each text is consistent.

And 203, judging the target type of the long text based on the prediction type corresponding to the short sentence according to a preset voting mechanism.

It can be understood that after the classification result of each short sentence is obtained, the classification result of each short sentence is thrown into the corresponding category, that is, the count of the corresponding category is increased by 1, and finally the category count accumulation is obtained, and the category name corresponding to the largest category count accumulation is determined as the text type. When the category count finally obtained by the voting mechanism is accumulated to have a plurality of maximum values, the category result judged by the initial sentence of the short sentence is used as the judgment of the target type of the long text.

Preferably, the text has a certain timeliness, so that continuous update iteration needs to be performed on the model to support the classification of subsequent texts. Therefore, new training data needs to be added to enhance training on the basis of the existing model. According to the change of the text information, repeating the steps to meet the result of practical application.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 2 shows a block diagram of a long text recognition apparatus provided in the embodiment of the present application, which corresponds to the long text recognition method described in the foregoing embodiment, and only shows portions related to the embodiment of the present application for convenience of description.

Referring to fig. 2, the apparatus includes:

the acquiring module 21 is used for acquiring a text to be recognized;

the determining module 22 is configured to determine a long text in the text to be recognized;

and the identification module 23 is used for identifying the target type of the long text according to a voting mechanism.

In one possible implementation, the identification module includes:

In one possible implementation, the apparatus further includes:

and the training module is used for the classification model.

In one possible implementation, the training module includes:

an acquisition unit configured to acquire sample data;

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

Fig. 3 is a schematic structural diagram of a server according to an embodiment of the present application. As shown in fig. 3, the server 3 of this embodiment includes: at least one processor 30, a memory 31 and a computer program 32 stored in the memory 31 and executable on the at least one processor 30, the processor 30 implementing the steps of any of the various method embodiments described above when executing the computer program 32.

The server 3 may be a computing device such as a desktop computer, a notebook, a palm computer, and a cloud server. The server may include, but is not limited to, a processor 30, a memory 31. Those skilled in the art will appreciate that fig. 3 is merely an example of the server 3, and does not constitute a limitation of the server 3, and may include more or less components than those shown, or combine some components, or different components, such as input and output devices, network access devices, etc.

The Processor 30 may be a Central Processing Unit (CPU), and the Processor 30 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 31 may in some embodiments be an internal storage unit of the server 3, such as a hard disk or a memory of the server 3. The memory 31 may also be an external storage device of the server 3 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the server 3. Further, the memory 31 may also include both an internal storage unit and an external storage device of the server 3. The memory 31 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. The memory 31 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a server, recording medium, computer Memory, Read-Only Memory (ROM), Random-Access Memory (RAM), electrical carrier wave signals, telecommunications signals, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. For example, the above-described apparatus/network device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A long text recognition method, comprising:

acquiring a text to be identified;

determining a long text in the text to be recognized;

the target type of the long text is identified according to a voting mechanism.

2. The long text recognition method of claim 1, wherein recognizing the type of the long text according to a voting mechanism comprises:

splitting the long text to obtain short sentences;

3. The method for identifying long texts according to claim 2, wherein before inputting the short sentences into a pre-trained classification model and obtaining the prediction type, the method further comprises:

and training the classification model.

4. The long text recognition method of claim 3, training the classification model, comprising:

acquiring sample data;

constructing a classification model according to the sample data;

and training the classification model according to the sample data.

5. A long text recognition device, comprising:

the acquisition module is used for acquiring a text to be recognized;

6. The long text recognition apparatus of claim 5, wherein the recognition module comprises:

7. The long text recognition apparatus of claim 6, wherein the apparatus further comprises:

and the training module is used for the classification model.

8. The long text recognition apparatus of claim 7, wherein the training module comprises:

an acquisition unit configured to acquire sample data;

9. A server comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 4.