CN111143551A - Text preprocessing method, classification method, device and equipment - Google Patents

Text preprocessing method, classification method, device and equipment Download PDF

Info

Publication number
CN111143551A
CN111143551A CN201911228510.3A CN201911228510A CN111143551A CN 111143551 A CN111143551 A CN 111143551A CN 201911228510 A CN201911228510 A CN 201911228510A CN 111143551 A CN111143551 A CN 111143551A
Authority
CN
China
Prior art keywords
text
length
characters
processed
specified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911228510.3A
Other languages
Chinese (zh)
Inventor
刘凡
张格皓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN201911228510.3A priority Critical patent/CN111143551A/en
Publication of CN111143551A publication Critical patent/CN111143551A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification provides a text preprocessing method, a classification method, a device and equipment. After the text to be processed is obtained, whether the length of the text to be processed is larger than a specified length or not can be judged, if so, at least one specified character in the text to be processed is used as a position reference, a plurality of characters are intercepted from the text, the intercepted characters are spliced to obtain a new text with the length equal to the specified length, and then the new text is used for training a preset language model. By truncating and splicing the long text, key characters representing core content can be intercepted from the long text, a new text with the length meeting the requirement of the language model is obtained by splicing, the language model is trained through the new text, so that the long text can be supported by the model, the core content in the long text is learned through the model, the performance of the model is improved, and the trained language model has higher accuracy when the text is classified.

Description

Text preprocessing method, classification method, device and equipment
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a text preprocessing method, a text classification device, and a text preprocessing apparatus.
Background
For example, due to the openness and the spreading characteristics of the internet, a small negative public opinion is probably a sword which destroys the image of an enterprise, so that it is very necessary for online public opinion monitoring and online public opinion analysis report acquisition, and many public opinion analysis platforms can acquire various comments, articles, news and the like from the network, and then classify the comments, the articles and the like to distinguish the negative comments from the positive comments. Because many texts in the network are long texts and the number of words is large, the current machine learning algorithm is limited by the memory and hardware configuration of a machine, and the whole content of the long texts cannot be trained to obtain a classification model. Therefore, when inputting a long text into a language model for training and classification, the long text often needs to be preprocessed to meet the requirements of the language model. In the related art, when a long text is adopted to train a language model, either the manual maintenance cost is high, or the performance of the language model obtained by training is ideal enough, and the accuracy is low when the text is classified. Therefore, there is a need to improve the preprocessing method and the text classification method of long texts, so that the method is suitable for some language models with better effect, and the classification accuracy is improved.
Disclosure of Invention
Based on the above, the present specification provides a text preprocessing method, a classification method, a device and an apparatus.
According to a first aspect of embodiments herein, there is provided a text preprocessing method, the method including:
acquiring a text to be processed;
judging whether the length of the text to be processed is larger than a specified length;
if so, intercepting a plurality of characters from the text to be processed by taking at least one specified character of the text to be processed as a position reference;
splicing the intercepted characters into a new text, and training a preset language model through the new text, wherein the length of the new text is equal to the specified length, and the specified length is determined based on the text length supported by the language model.
According to a second aspect of embodiments of the present specification, there is provided a text classification method, the method comprising:
acquiring a text to be classified;
judging whether the length of the text to be classified is larger than a specified length;
if so, intercepting a plurality of characters from the text to be processed by taking at least one specified character of the text to be classified as a position reference;
splicing the intercepted characters into a new text, wherein the length of the new text is equal to the specified length, and the specified length is determined based on the text length supported by a preset language model;
classifying the new text by the language model.
According to a third aspect of embodiments herein, there is provided a text preprocessing apparatus, the apparatus including:
the acquisition module is used for acquiring a text to be processed;
the judging module is used for judging whether the length of the text to be processed is greater than the specified length;
the intercepting module is used for intercepting a plurality of characters from the text to be processed by taking at least one specified character of the text to be processed as a position reference if the size of the character is larger than the size of the character;
and the splicing module is used for splicing the intercepted characters into a new text so as to train a preset language model through the new text, wherein the length of the new text is equal to the specified length, and the specified length is determined based on the text length supported by the language model.
According to a fourth aspect of embodiments herein, there is provided a text classification apparatus, the apparatus comprising:
the acquisition module is used for acquiring texts to be classified;
the judging module is used for judging whether the length of the text to be classified is greater than the specified length;
the intercepting module is used for intercepting a plurality of characters from the text to be processed by taking at least one specified character of the text to be classified as a position reference if the number of the specified characters is larger than the number of the specified characters;
the splicing module splices the intercepted characters into a new text, wherein the length of the new text is equal to the specified length, and the specified length is determined based on the text length supported by a preset language model;
and the classification module is used for classifying the new text through the language model.
According to a fifth aspect of the embodiments of the present specification, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the text preprocessing method and the text classification method according to any one of the embodiments when executing the program.
By applying the scheme of the embodiment of the specification, after the text to be processed is obtained, whether the length of the text to be processed is larger than the specified length or not can be judged, if so, at least one specified character in the text to be processed is taken as a reference, a plurality of characters are intercepted from the text, the intercepted characters are spliced to obtain a new text with the length equal to the specified length, and then the new text is used for training the preset language model. By truncating and splicing the long text, the key characters representing the core content can be intercepted from the long text, the spliced new text with the length meeting the requirement of the language model is obtained, and the language model is trained through the new text, so that the problem that the word number of the long text does not meet the requirement of the language model is solved, the key characters representing the core content of the text can be intercepted from the long text to train the model, the performance of the model is improved, and the trained language model has higher accuracy when the text is classified.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the specification.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present specification and together with the description, serve to explain the principles of the specification.
Fig. 1 is a flowchart of a text preprocessing method according to an embodiment of the present disclosure.
Fig. 2 is a schematic diagram of a text truncation and concatenation according to an embodiment of the present specification.
Fig. 3 is a flowchart of a text classification method according to an embodiment of the present disclosure.
Fig. 4 is a text length distribution statistical table according to an embodiment of the present specification.
Fig. 5 is a block diagram of a logical structure of a text preprocessing apparatus according to an embodiment of the present disclosure.
Fig. 6 is a block diagram of a logical structure of a text classification apparatus according to an embodiment of the present specification.
FIG. 7 is a block diagram of a computer device for implementing the methods of the present description, in accordance with one embodiment of the present description.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.
The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
The text classification is applied to many fields, for example, due to the openness and the spreading characteristics of the internet, a small negative public opinion is likely to be a sword which destroys the image of an enterprise or a product, so that it is very necessary to monitor the network public opinion and obtain an analysis report of the network public opinion. Many public opinion analysis platforms can obtain texts such as various comments, articles and news from the network, and then classify the texts to distinguish negative comments from positive comments. Because many of various texts in the network are long texts and have more words, such as microblogs and many characters exceeding 3000, the current machine learning algorithm is limited by the memory and hardware configuration of a machine, and many algorithm models limit the number of words of the texts, the learning of all contents of the long texts is not possible, and some important information cannot be learned. For example, the BERT (bidirectional encoding retrieval from transforms) model is a language understanding model with better effect at present, but it has limitation on the number of text words, and can only support 512 characters at most, so that the advantages of BERT model cannot be exerted on training long text.
At present, for the classification of long texts, some technologies firstly perform word segmentation on the long texts, remove high-frequency words and low-frequency words, and manually distinguish the remaining words in a category mode. In the other technology, long texts are firstly subjected to word segmentation, then feature vectors are calculated and generated through TF-IDF (Term Frequency-inverse document Frequency), the feature vectors are input into an XGboost algorithm for classification training, the sequential association between words is omitted in the mode, and a large number of non-important features are doped in the long texts to form noise interference core features, so that the accuracy of final results is influenced. Based on this, first, an embodiment of the present specification provides a text preprocessing method, in which contents of a comparison core are intercepted from a long text, and after splicing the contents into a new text, a language model is trained, so that a problem that the long text cannot be trained due to the limitation of the language model on the number of words can be effectively solved, and the core contents of the text are intercepted according to structural characteristics of the text and logic expression habits of people, so that a training effect can be improved.
Specifically, as shown in fig. 1, the method may include the following steps:
s102, acquiring a text to be processed;
s104, judging whether the length of the text to be processed is larger than a specified length;
s106, if the number of the characters is larger than the preset number, intercepting a plurality of characters from the text to be processed by taking at least one appointed character of the text to be processed as a position reference;
and S108, splicing the intercepted characters into a new text, and training a preset language model through the new text, wherein the length of the new text is equal to the specified length, and the specified length is determined based on the text length supported by the language model.
The text preprocessing method provided by the embodiment of the specification can be used for various devices capable of preprocessing texts, such as a mobile phone, a tablet computer, a notebook computer, a desktop computer or a cloud server.
At present, the analysis and classification of texts can be realized through some trained language models, and because a machine learning algorithm is limited by the memory and other hardware configurations of the current hardware equipment, various language models for analyzing and classifying texts have certain word number limitation on the input texts, and long texts cannot be learned. In practical application scenarios, the texts to be classified by the language model are often long texts with a large number of words. In order to learn the language model with word number limitation and the long text, in the embodiment of the present specification, the truncation and concatenation processing may be performed on the long text first to obtain a new text that meets the requirement of the language model and includes a text core idea.
The text to be processed in the embodiments of the present specification may be various texts used for training a language model, such as various comments, articles, news, and the like in a network, and the text may be a text in various languages, such as a chinese text, an english text, and the like.
When the text to be processed is obtained, it may be determined whether the length of the text to be processed is greater than a specified length. Wherein the specified length is determined based on a text length supported by a preset language model. Because the specified length is the reference length of the new text generated after the text to be processed is cut and spliced, the specified length can be smaller than or equal to the text length supported by the preset language model. For example, the length of the text supported by the language model is 512 characters, and the specified length may be smaller than or equal to 512 characters, for example, 512 characters or 256 characters may be selected, and specifically, a numerical value with a better training effect may be selected according to the actual usage scenario setting.
If the text to be processed is larger than the specified length, the text to be processed can be cut off and spliced to generate a new text meeting the requirements of the language model. Specifically, at least one designated character of the text to be processed may be used as a position reference to intercept a plurality of characters from the text to be processed, and then the intercepted characters are spliced into a new text, where the length of the new text is equal to the designated length. In order that the trained language model can achieve a good effect, the new text obtained by truncation and splicing can be a text representing the core thought of the text to be processed, namely the content of the key in the text to be processed, so that the language model can learn the key information of the text.
The designated characters are used for determining the positions of the core contents of the texts to be processed in the texts, and can be one character or a plurality of characters and can be determined according to the distribution characteristics of the text structures and the expression logic habits of people. For example, for Chinese, the expression methods used by people when writing articles are written by using article structures of "total-score-total", "total-score" and "score-total". Therefore, for an article, the core ideas and main points of the article can be concentrated at the beginning of the article or at the end of the article, so that a segment of characters at the beginning of the text or a segment of texts at the end of the text can be intercepted, or a segment of characters can be intercepted at the beginning and the end of the text, and then the intercepted segment of characters are input into a language model to train the model. Thus, in some embodiments, the specified character may be one or more of a first character of the text, a last character of the text. Of course, in some cases, when people express their own view in the article, they are used to use some summarized or generalized words, such as "in general", "in summary", "last", etc., so that a segment of words following these words is also likely to represent the content of the core idea of the article, so in some embodiments, these word characters can also be used as a position reference, and a segment of words before or after the character is intercepted from the text to be processed. Of course, there may be one or more designated characters, and the intercepted characters may be one or more words in the text to be processed.
After the plurality of characters are intercepted according to the designated characters, the plurality of characters can be spliced into a new text. If the plurality of characters are a plurality of sections of characters respectively intercepted by taking one designated character as the position reference, the plurality of characters can be directly formed into a new text, and if the plurality of characters are a plurality of sections of characters respectively intercepted by taking a plurality of designated characters as the position reference, the plurality of sections of characters are spliced to form the new text. The splicing sequence may be determined according to the sequence of the characters appearing in the text, for example, the first character appearing is placed at the top, and the last character appearing is placed at the last, and of course, the splicing sequence may also be in the reverse order, or in a random order, which is not limited in the present application.
Since most of the core viewpoints of the article are focused on the beginning or the end, in some embodiments, as shown in fig. 2, a first character of the text may be used as a starting position, a first number of characters may be truncated backwards along a direction of a character subsequent to the first character to obtain a text a, a last character of the text may be used as an ending position, a second number of characters may be truncated forwards along a direction of a character previous to the last character to obtain a text B, the texts a and B are then concatenated to obtain a new text a + B or B + a, the length of the new text is equal to a specified length, and the obtained new text is used to train the language model. In some embodiments, the first number and the second number may be equal, for example, the number of two text segments cut from the beginning and the end of the text may be equal. In some embodiments, the first number and the second number may not be equal, for example, the number of characters intercepted at the beginning of the text is larger, the number of characters intercepted at the end of the text is shorter, or the number of characters intercepted at the beginning of the text is smaller, and the number of characters intercepted at the end of the text is larger. For example, assuming that the specified length is 512 characters long, the first number and the second number may both be 256 characters long, or one may be 200 characters long and the other 312 characters long.
In training a language model using text, the text may be vectorized for presentation in order to convert the text into a machine-recognizable language. As for the long text, a new text which contains a text core thought and has the length equal to the specified length can be obtained through truncation and splicing treatment, and the new text is used for training the language model. For the short text, the length of the short text may be far smaller than the specified length, so when the text is expressed in a vectorization mode, because the vector dimensions of the long text and the short text are consistent, and the vectors of the short text have fewer characters, many dimensions in the vectors are filled with 0, that is, no actual meaning exists, and the final effect of the language model is not ideal. Therefore, in some embodiments, if it is determined that the length of the text to be processed is smaller than the specified length, multiple characters may be copied from the text to be processed and spliced until the length of the spliced text is equal to the specified length, where the multiple characters may be partial characters of the text to be processed or all characters of the text to be processed. For example, assuming that the specified length is 256 characters, and the text a to be processed is only 128 characters, the text a to be processed can be copied to obtain a text a ', the text a' is spliced into the text a to be processed to obtain a new text with a length of 256 characters, and then the new text is input into the language model to train the model. Of course, if the specified length is 256 characters in length, and the text a to be processed is only 200 characters in length, the first 56 characters of the text to be processed or the last 56 characters of the text to be processed may be copied, or 56 characters may be copied from the middle of the text to be processed and spliced into the text to be processed. Of course, in some embodiments, the length of the text characters after splicing may also be close to the specified length, for example, if the specified length is 256 characters long and the text a to be processed has a length of 250 characters, then the copy splicing process may not be performed on the characters to be processed. Of course, in some embodiments, the copy splicing operation may also be performed only on short texts with a length smaller than a certain length, for example, the length is specified to be 256, when the text length is smaller than 128 characters, the copy splicing operation is performed, and if the text length is larger than the 128 characters, the copy splicing operation is not performed. By copying and splicing the short text to be processed, excessive useless information can not appear after the final short text is vectorized and expressed, the training effect of the model is improved, and the model has higher accuracy.
Since the text usually includes some characters such as numbers, letters, punctuations, etc., which may increase the length of the text but have no great influence on the meaning of the text, in some embodiments, after the text to be processed is obtained, specified characters in the text to be processed may be deleted first, and the specified characters may be characters having no substantial meaning or having no great influence on the meaning of the text. In some embodiments, the designated characters may be letters, numbers, punctuation, emoticons, and space bars. After the characters can be deleted, the text to be processed is cut off and spliced, so that redundant information in the text to be processed is reduced.
In some embodiments, the preset language model may be a BERT model, the specified length is less than 512 characters in length, the BERT model is a pre-trained model developed by google and obtained by training a large amount of corpora in wiki, the pre-trained BERT representation may be fine-tuned by using only one additional output layer, and the current optimal model may be created for many tasks without modifying the model architecture. The BERT model is a language model that works well, but supports text lengths up to 512 characters due to its limitations on the number of text words. Therefore, the BERT model is limited in application to long texts. In the embodiment of the description, the new text with the length less than 512 characters is obtained by performing truncation and splicing processing on the long text, so that the requirement of the BERT model on the number of words of the text can be met, and the new text contains the core content of the text to be processed, so that the BERT model can learn the key information of the text to be processed. In some embodiments, if the language model is a BERT model, the specified length may be 256 characters in length. Through a large number of experiments, it is found that it is possible to intercept 128 consecutive characters in the direction of the next character with the first character of the text to be processed as a start position, and intercept 128 consecutive characters in the direction of the previous character with the last character of the text to be processed as an end position. By means of the method, after the text to be processed is cut and spliced, the BERT model is trained, the performance of the model can be greatly improved, and the accuracy of the trained model in classifying the text is greatly improved.
In addition, an embodiment of the present specification further provides a text classification method, as shown in fig. 3, the method includes:
s302, obtaining a text to be classified;
s304, judging whether the length of the text to be classified is larger than a specified length;
s306, if the number of the characters is larger than the preset number, intercepting a plurality of characters from the text to be processed by taking at least one appointed character of the text to be classified as a position reference;
s308, splicing the intercepted characters into a new text, wherein the length of the new text is equal to the specified length, and the specified length is determined based on the text length supported by a preset language model;
s3010, classifying the new texts through the language model.
After the text to be classified is obtained, whether the character length of the text to be classified is larger than a designated character or not can be judged, if so, a plurality of characters can be intercepted from the text to be classified by taking the designated character in the cost to be classified as a position reference, the intercepted characters are spliced into a new text, and then the spliced new text is input into a pre-trained language model for classification. The designated characters are used for determining the position of the core content of the text to be processed, can be one character or a plurality of characters, and can be determined according to the distribution characteristics of the text structure and the expression logic habits of people. For example, for Chinese, the expression methods used by people when writing articles are written by using article structures of "total-score-total", "total-score" and "score-total". Therefore, for an article, the core ideas and main points of the article can be concentrated at the beginning of the article or at the end of the article, so that a segment of characters at the beginning of the text or a segment of texts at the end of the text can be intercepted, or a segment of characters can be intercepted at the beginning and the end of the text, and then the intercepted segment of characters are input into a language model to train the model. Thus, in some embodiments, the specified character may be one or more of a first character of the text, a last character of the text. Of course, in some cases, when people express their own view in the article, they are used to use some summarized or generalized words, such as "in general", "to sum up", "in summary", "last", etc., so that a segment of words following these words is also likely to represent the content of the core idea of the article, so in some embodiments, these word characters can also be used as a position reference, and a segment of words following this character is intercepted from the text to be processed. Of course, there may be one or more designated characters, and the intercepted characters may be one or more words in the text to be processed.
In some embodiments, the specified character includes a first character of the text to be classified and/or a last character of the text to be processed.
In some embodiments, if the length of the text to be classified is smaller than the specified length, copying a plurality of characters from the text to be classified and splicing the text to be classified until the length of the spliced text is equal to the specified length.
In some embodiments, intercepting a plurality of characters from the classified text with at least one specified character of the text to be classified as a positional reference comprises:
taking the first character of the text to be classified as an initial position, and intercepting a first number of characters in the direction of a next character;
and intercepting a second number of characters in the direction of a previous character by taking the last character of the text to be classified as a termination position.
In certain embodiments, the first number is equal to the second number.
In some embodiments, the language model is a BERT model, and the specified length is less than 512 characters in length.
In some embodiments, the specified length is 256 characters in length, and then intercepting a plurality of characters from the text to be classified with at least one specified character of the text to be classified as a position reference comprises:
taking the first character of the text to be classified as a starting position, and intercepting 128 characters in the direction of the next character;
and taking the last character of the text to be classified as a termination position, and cutting 128 characters in the direction of the previous character.
In some embodiments, before determining whether the length of the text to be classified is greater than a specified length, the method further includes:
and deleting the specified characters in the text to be classified.
In some embodiments, the designated characters include one or more of: letters, numbers, punctuation, emoticons, and space bars.
In order to further explain the text preprocessing method and the text classification method provided in the embodiments of the present specification, a specific embodiment is explained below.
In order to carry out public opinion monitoring, various texts such as articles, comments, news and the like in the network can be analyzed and classified, and some texts with negative comments and positive comments can be screened out.
The BERT model is a mainstream pre-training model at present, and has a good language understanding effect. However, because of limitations of machine memory and hardware configuration, the BERT model has a limit to the length of text input by the number of words, and only supports text with a length less than 512 characters. Through statistics of texts such as various articles, news, comments, microblogs and the like in the network, it is found that word number distribution of the texts is shown in fig. 4, where "range" on the left side represents word number distribution intervals of various texts in the network, and "count" on the right side represents the number of texts in each word number distribution interval. It can be seen from the figure that most texts still have words larger than 512 characters, and the number of short texts smaller than 128 characters is also larger.
In order to improve the classification effect of the model, the text can be preprocessed and then trained and classified.
The training process of the BERT model comprises the following steps:
(1) and acquiring a training text, and removing letters, numbers, punctuations, emoticons and various space bars in the training text.
(2) And counting the length of the training text after the characters are removed, and counting according to the character level.
(3) Judging whether the length of the training text is larger than 256 characters, if so, acquiring 128 characters at the beginning and 128 characters at the end of the text, splicing the 128 characters at the beginning and the 128 characters at the end of the text into a new text, inputting the new text after truncation and splicing into a BERT model, and training the BERT model. If the length of the training text is less than 256 characters, the training text A is copied into a text A ', at this time, 2 texts A and A ' are arranged, the texts A and A ' are connected end to form a new text, and if the new text is still less than 256 characters in length, the cyclic copying operation is continued until the length is equal to 256 characters in length. And inputting the copied and spliced new text into the BERT model, and training the BERT model.
And (3) adopting a trained BERT model to classify the texts:
(1) and acquiring a text to be classified, and removing letters, numbers, punctuations, emoticons and various space bars in the text to be classified.
(2) And counting the length of the text to be classified after the characters are removed, and counting according to the character level.
(3) Judging whether the length of the text to be classified is larger than 256 characters, if so, acquiring 128 characters at the beginning and 128 characters at the end of the text, splicing the 128 characters at the beginning and the 128 characters at the end of the text into a new text, and inputting the new text after truncation and splicing into a BERT model to classify the text. If the length of the text to be classified is less than 256 characters, copying the text B to be classified into a text B ', wherein 2 texts B and B ' exist, connecting the texts B and B ' end to form a new text, and if the length of the new text is still less than 256 characters, continuing to circularly copy until the length is equal to 256 characters. And inputting the copied and spliced new text into a BERT model to classify the text.
By the method, the performance of the model can be greatly improved, so that the trained BERT model has higher accuracy in text classification.
The various technical features in the above embodiments can be arbitrarily combined, so long as there is no conflict or contradiction between the combinations of the features, but the combination is limited by the space and is not described one by one, and therefore, any combination of the various technical features in the above embodiments also falls within the scope disclosed in the present specification.
As shown in fig. 5, a text preprocessing apparatus according to an embodiment of the present disclosure may include:
an obtaining module 51, configured to obtain a text to be processed;
the judging module 52 is configured to judge whether the length of the text to be processed is greater than a specified length;
an intercepting module 53, configured to intercept a plurality of characters from the text to be processed with at least one specified character of the text to be processed as a position reference if the size of the character is larger than the size of the specified character;
a splicing module 54, configured to splice the intercepted characters into a new text, so as to train a preset language model through the new text, where a length of the new text is equal to the specified length, and the specified length is determined based on a text length supported by the language model.
In some embodiments, the specified character comprises a first character of the text to be processed and/or a last character of the text to be processed
In some embodiments, if the length of the text to be processed is smaller than the specified length, copying a plurality of characters from the text to be processed and splicing the text to be processed until the length of the spliced text is equal to the specified length.
In some embodiments, intercepting a plurality of characters from the text to be processed with at least one specified character of the text to be processed as a position reference comprises:
taking the first character of the text to be processed as an initial position, and intercepting a first number of characters in the direction of a next character;
and intercepting a second number of characters in the direction of a previous character by taking the last character of the text to be processed as a termination position.
In certain embodiments, the first number is equal to the second number.
In some embodiments, the language model is a BERT model, and the specified length is less than 512 characters in length.
In some embodiments, the specified length is 256 characters in length, and then intercepting a plurality of characters from the text to be processed with at least one specified character of the text to be processed as a position reference comprises:
taking the first character of the text to be processed as a starting position, and intercepting 128 characters in the direction of the next character;
and taking the last character of the text to be processed as a termination position, and cutting 128 characters in the direction of the previous character.
In some embodiments, before determining whether the length of the text to be processed is greater than a specified length, the method further includes:
and deleting the specified characters in the text to be processed.
In some embodiments, the designated characters include one or more of: letters, numbers, punctuation, emoticons, and space bars.
As shown in fig. 6, a text classification apparatus according to an embodiment of the present specification may include:
the acquiring module 61 is used for acquiring texts to be classified;
a judging module 62, configured to judge whether the length of the text to be classified is greater than a specified length;
an intercepting module 63, if the number of the characters is larger than the preset number, intercepting a plurality of characters from the text to be processed by taking at least one specified character of the text to be classified as a position reference;
the splicing module 64 splices the intercepted characters into a new text, wherein the length of the new text is equal to the specified length, and the specified length is determined based on the text length supported by a preset language model;
a classification module 65 for classifying the new text by the language model.
In some embodiments, the specified character comprises a first character of the text to be classified and/or a last character of the text to be processed
In some embodiments, if the length of the text to be classified is smaller than the specified length, copying a plurality of characters from the text to be classified and splicing the text to be classified until the length of the spliced text is equal to the specified length.
In some embodiments, intercepting a plurality of characters from the classified text with at least one specified character of the text to be classified as a positional reference comprises:
taking the first character of the text to be classified as an initial position, and intercepting a first number of characters in the direction of a next character;
and intercepting a second number of characters in the direction of a previous character by taking the last character of the text to be classified as a termination position.
In certain embodiments, the first number is equal to the second number.
In some embodiments, the language model is a BERT model, and the specified length is less than 512 characters in length.
In some embodiments, the specified length is 256 characters in length, and then intercepting a plurality of characters from the text to be classified with at least one specified character of the text to be classified as a position reference comprises:
taking the first character of the text to be classified as a starting position, and intercepting 128 characters in the direction of the next character;
and taking the last character of the text to be classified as a termination position, and cutting 128 characters in the direction of the previous character.
In some embodiments, before determining whether the length of the text to be classified is greater than a specified length, the method further includes:
and deleting the specified characters in the text to be classified.
In some embodiments, the designated characters include one or more of: letters, numbers, punctuation, emoticons, and space bars.
The specific details of the implementation process of the functions and actions of each module in the device are referred to the implementation process of the corresponding step in the method, and are not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.
The embodiment of the device in the specification can be applied to computer equipment, such as a server or an intelligent terminal. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for operation through the processor in which the file processing is located. From a hardware aspect, as shown in fig. 7, it is a hardware structure diagram of a computer device in which the apparatus of this specification is located, except for the processor 702, the memory 704, the network interface 706, and the nonvolatile memory 708 shown in fig. 7, a server or an electronic device in which the apparatus is located in an embodiment may also include other hardware according to an actual function of the computer device, which is not described again. Wherein the non-volatile memory 708 has a computer program stored thereon, and the processor 702 implements any one of the text preprocessing method and the text classification method when executing the computer program.
Accordingly, the embodiments of the present specification also provide a computer storage medium, in which a program is stored, and the program, when executed by a processor, implements the method in any of the above embodiments.
Embodiments of the present description may take the form of a computer program product embodied on one or more storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having program code embodied therein. Computer-usable storage media include permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of the storage medium of the computer include, but are not limited to: phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technologies, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by a computing device.
Other embodiments of the present disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. The embodiments of the present specification are intended to cover any variations, uses, or adaptations of the embodiments of the specification following, in general, the principles of the embodiments of the specification and including such departures from the present disclosure as come within known or customary practice in the art to which the embodiments of the specification pertain. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the embodiments being indicated by the following claims.
It is to be understood that the embodiments of the present specification are not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the embodiments of the present specification is limited only by the appended claims.
The above description is only a preferred embodiment of the present disclosure, and should not be taken as limiting the present disclosure, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims (13)

1. A method of text pre-processing, the method comprising:
acquiring a text to be processed;
judging whether the length of the text to be processed is larger than a specified length;
if so, intercepting a plurality of characters from the text to be processed by taking at least one specified character of the text to be processed as a position reference;
splicing the intercepted characters into a new text, and training a preset language model through the new text, wherein the length of the new text is equal to the specified length, and the specified length is determined based on the text length supported by the language model.
2. The text preprocessing method according to claim 1, wherein if the length of the text to be processed is smaller than the specified length, copying a plurality of characters from the text to be processed and splicing the text to be processed until the spliced text length is equal to the specified length.
3. The text preprocessing method according to any one of claims 1-2, wherein the specified character comprises a first character of the text to be processed and/or a last character of the text to be processed.
4. The text preprocessing method of claim 3, wherein intercepting a plurality of characters from the text to be processed with at least one specified character of the text to be processed as a positional reference comprises:
taking the first character of the text to be processed as an initial position, and intercepting a first number of characters in the direction of a next character;
and intercepting a second number of characters in the direction of a previous character by taking the last character of the text to be processed as a termination position.
5. The text pre-processing method of claim 4, the first number being equal to the second number.
6. The text pre-processing method of claim 1, the language model being a BERT model, the specified length being less than 512 characters in length.
7. The text preprocessing method of claim 6, wherein the specified length is 256 characters in length, then truncating a plurality of characters from the text to be processed with at least one specified character of the text to be processed as a position reference comprises:
taking the first character of the text to be processed as a starting position, and intercepting 128 characters in the direction of the next character;
and taking the last character of the text to be processed as a termination position, and cutting 128 characters in the direction of the previous character.
8. The method for preprocessing the text according to claim 1, before determining whether the length of the text to be processed is larger than a specified length, further comprising:
and deleting the specified characters in the text to be processed.
9. The text pre-processing method of claim 8, the specified characters comprising one or more of: letters, numbers, punctuation, emoticons, and space bars.
10. A method of text classification, the method comprising:
acquiring a text to be classified;
judging whether the length of the text to be classified is larger than a specified length;
if so, intercepting a plurality of characters from the text to be processed by taking at least one specified character of the text to be classified as a position reference;
splicing the intercepted characters into a new text, wherein the length of the new text is equal to the specified length, and the specified length is determined based on the text length supported by a preset language model;
classifying the new text by the language model.
11. A text pre-processing apparatus, the apparatus comprising:
the acquisition module is used for acquiring a text to be processed;
the judging module is used for judging whether the length of the text to be processed is greater than the specified length;
the intercepting module is used for intercepting a plurality of characters from the text to be processed by taking at least one specified character of the text to be processed as a position reference if the size of the character is larger than the size of the character;
and the splicing module is used for splicing the intercepted characters into a new text so as to train a preset language model through the new text, wherein the length of the new text is equal to the specified length, and the specified length is determined based on the text length supported by the language model.
12. An apparatus for text classification, the apparatus comprising:
the acquisition module is used for acquiring texts to be classified;
the judging module is used for judging whether the length of the text to be classified is greater than the specified length;
the intercepting module is used for intercepting a plurality of characters from the text to be processed by taking at least one specified character of the text to be classified as a position reference if the number of the specified characters is larger than the number of the specified characters;
the splicing module splices the intercepted characters into a new text, wherein the length of the new text is equal to the specified length, and the specified length is determined based on the text length supported by a preset language model;
and the classification module is used for classifying the new text through the language model.
13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 10 when executing the program.
CN201911228510.3A 2019-12-04 2019-12-04 Text preprocessing method, classification method, device and equipment Pending CN111143551A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911228510.3A CN111143551A (en) 2019-12-04 2019-12-04 Text preprocessing method, classification method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911228510.3A CN111143551A (en) 2019-12-04 2019-12-04 Text preprocessing method, classification method, device and equipment

Publications (1)

Publication Number Publication Date
CN111143551A true CN111143551A (en) 2020-05-12

Family

ID=70517555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911228510.3A Pending CN111143551A (en) 2019-12-04 2019-12-04 Text preprocessing method, classification method, device and equipment

Country Status (1)

Country Link
CN (1) CN111143551A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738437A (en) * 2020-07-17 2020-10-02 支付宝(杭州)信息技术有限公司 Training method, text generation device and electronic equipment
CN112328674A (en) * 2020-11-17 2021-02-05 深圳力维智联技术有限公司 Cross-data-format model conversion acceleration method and device
CN112861045A (en) * 2021-02-20 2021-05-28 北京金山云网络技术有限公司 Method and device for displaying file, storage medium and electronic device
CN114510911A (en) * 2022-02-16 2022-05-17 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and storage medium
CN115033678A (en) * 2022-08-09 2022-09-09 北京聆心智能科技有限公司 Dialogue model training method, device and equipment
CN116541527A (en) * 2023-07-05 2023-08-04 国网北京市电力公司 Document classification method based on model integration and data expansion

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6279018B1 (en) * 1998-12-21 2001-08-21 Kudrollis Software Inventions Pvt. Ltd. Abbreviating and compacting text to cope with display space constraint in computer software
US20040122979A1 (en) * 2002-12-19 2004-06-24 International Business Machines Corporation Compression and abbreviation for fixed length messaging
CN106484810A (en) * 2016-09-23 2017-03-08 广州视源电子科技股份有限公司 A kind of recommendation method and system of multimedia programming
CN108038107A (en) * 2017-12-22 2018-05-15 东软集团股份有限公司 Sentence sensibility classification method, device and its equipment based on convolutional neural networks
CN108091323A (en) * 2017-12-19 2018-05-29 想象科技(北京)有限公司 For identifying the method and apparatus of emotion from voice
CN108632896A (en) * 2017-03-15 2018-10-09 华为技术有限公司 A kind of data transmission method and relevant device
CN109508448A (en) * 2018-07-17 2019-03-22 网易传媒科技(北京)有限公司 Short information method, medium, device are generated based on long article and calculate equipment
CN109918497A (en) * 2018-12-21 2019-06-21 厦门市美亚柏科信息股份有限公司 A kind of file classification method, device and storage medium based on improvement textCNN model
CN110032733A (en) * 2019-03-12 2019-07-19 中国科学院计算技术研究所 A kind of rumour detection method and system for news long text
CN110096591A (en) * 2019-04-04 2019-08-06 平安科技(深圳)有限公司 Long text classification method, device, computer equipment and storage medium based on bag of words
CN110209819A (en) * 2019-06-05 2019-09-06 江苏满运软件科技有限公司 File classification method, device, equipment and medium
CN110427482A (en) * 2019-07-31 2019-11-08 腾讯科技(深圳)有限公司 A kind of abstracting method and relevant device of object content
CN110532563A (en) * 2019-09-02 2019-12-03 苏州美能华智能科技有限公司 The detection method and device of crucial paragraph in text

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6279018B1 (en) * 1998-12-21 2001-08-21 Kudrollis Software Inventions Pvt. Ltd. Abbreviating and compacting text to cope with display space constraint in computer software
US20040122979A1 (en) * 2002-12-19 2004-06-24 International Business Machines Corporation Compression and abbreviation for fixed length messaging
CN106484810A (en) * 2016-09-23 2017-03-08 广州视源电子科技股份有限公司 A kind of recommendation method and system of multimedia programming
CN108632896A (en) * 2017-03-15 2018-10-09 华为技术有限公司 A kind of data transmission method and relevant device
CN108091323A (en) * 2017-12-19 2018-05-29 想象科技(北京)有限公司 For identifying the method and apparatus of emotion from voice
CN108038107A (en) * 2017-12-22 2018-05-15 东软集团股份有限公司 Sentence sensibility classification method, device and its equipment based on convolutional neural networks
CN109508448A (en) * 2018-07-17 2019-03-22 网易传媒科技(北京)有限公司 Short information method, medium, device are generated based on long article and calculate equipment
CN109918497A (en) * 2018-12-21 2019-06-21 厦门市美亚柏科信息股份有限公司 A kind of file classification method, device and storage medium based on improvement textCNN model
CN110032733A (en) * 2019-03-12 2019-07-19 中国科学院计算技术研究所 A kind of rumour detection method and system for news long text
CN110096591A (en) * 2019-04-04 2019-08-06 平安科技(深圳)有限公司 Long text classification method, device, computer equipment and storage medium based on bag of words
CN110209819A (en) * 2019-06-05 2019-09-06 江苏满运软件科技有限公司 File classification method, device, equipment and medium
CN110427482A (en) * 2019-07-31 2019-11-08 腾讯科技(深圳)有限公司 A kind of abstracting method and relevant device of object content
CN110532563A (en) * 2019-09-02 2019-12-03 苏州美能华智能科技有限公司 The detection method and device of crucial paragraph in text

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738437A (en) * 2020-07-17 2020-10-02 支付宝(杭州)信息技术有限公司 Training method, text generation device and electronic equipment
CN111738437B (en) * 2020-07-17 2020-11-20 支付宝(杭州)信息技术有限公司 Training method, text generation device and electronic equipment
CN112328674A (en) * 2020-11-17 2021-02-05 深圳力维智联技术有限公司 Cross-data-format model conversion acceleration method and device
CN112328674B (en) * 2020-11-17 2024-05-14 深圳力维智联技术有限公司 Cross-data format model conversion acceleration method and device
CN112861045A (en) * 2021-02-20 2021-05-28 北京金山云网络技术有限公司 Method and device for displaying file, storage medium and electronic device
CN114510911A (en) * 2022-02-16 2022-05-17 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and storage medium
CN115033678A (en) * 2022-08-09 2022-09-09 北京聆心智能科技有限公司 Dialogue model training method, device and equipment
CN116541527A (en) * 2023-07-05 2023-08-04 国网北京市电力公司 Document classification method based on model integration and data expansion
CN116541527B (en) * 2023-07-05 2023-09-29 国网北京市电力公司 Document classification method based on model integration and data expansion

Similar Documents

Publication Publication Date Title
CN111143551A (en) Text preprocessing method, classification method, device and equipment
CN104735468B (en) A kind of method and system that image is synthesized to new video based on semantic analysis
US20180157636A1 (en) Methods and systems for language-agnostic machine learning in natural language processing using feature extraction
US10437912B2 (en) Sorting and displaying documents according to sentiment level in an online community
CN110046637B (en) Training method, device and equipment for contract paragraph annotation model
US20210397639A1 (en) Clustering topics for data visualization
US20200364216A1 (en) Method, apparatus and storage medium for updating model parameter
CN109978139B (en) Method, system, electronic device and storage medium for automatically generating description of picture
CN111310464A (en) Word vector acquisition model generation method and device and word vector acquisition method and device
CN112364664B (en) Training of intention recognition model, intention recognition method, device and storage medium
CN109918658A (en) A kind of method and system obtaining target vocabulary from text
CN110489559A (en) A kind of file classification method, device and storage medium
CN113032001B (en) Intelligent contract classification method and device
CN116150327A (en) Text processing method and device
CN112765976A (en) Text similarity calculation method, device and equipment and storage medium
EP4322066A1 (en) Method and apparatus for generating training data
CN109522921A (en) Statement similarity method of discrimination and equipment
CN111783453B (en) Text emotion information processing method and device
US11120204B2 (en) Comment-based article augmentation
CN111625579B (en) Information processing method, device and system
CN110879868A (en) Consultant scheme generation method, device, system, electronic equipment and medium
CN112446206A (en) Menu title generation method and device
CN114239590B (en) Data processing method and device
US20240126791A1 (en) Method and system for long-form answer extraction based on combination of sentence index generation techniques
CN117909505B (en) Event argument extraction method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200512