CN111078878B - Text processing method, device, equipment and computer readable storage medium - Google Patents

Text processing method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN111078878B
CN111078878B CN201911239505.2A CN201911239505A CN111078878B CN 111078878 B CN111078878 B CN 111078878B CN 201911239505 A CN201911239505 A CN 201911239505A CN 111078878 B CN111078878 B CN 111078878B
Authority
CN
China
Prior art keywords
text
classified
classifier
information
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911239505.2A
Other languages
Chinese (zh)
Other versions
CN111078878A (en
Inventor
石逸轩
戴明洋
潘剑飞
周俊
罗程亮
许金泉
姚远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201911239505.2A priority Critical patent/CN111078878B/en
Publication of CN111078878A publication Critical patent/CN111078878A/en
Application granted granted Critical
Publication of CN111078878B publication Critical patent/CN111078878B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure discloses a text processing method, a device, equipment and a computer readable storage medium, relating to the field of text processing. The specific implementation scheme is as follows: acquiring data to be classified input by a terminal device, wherein the data to be classified comprises a text to be classified and an identifier of a user inputting the text to be classified; obtaining user characteristics corresponding to the user according to the user identification, and carrying out vectorization processing on the text to be classified and the user characteristics to obtain vector information to be processed; processing the vector information to be processed by adopting a preset feature extraction model to obtain feature information corresponding to the vector information to be processed; and classifying the characteristic information through a cascade classifier to obtain the category information corresponding to the text to be classified. Therefore, factors of user characteristics can be considered in the classification process, and the accuracy of text classification is improved.

Description

Text processing method, device, equipment and computer readable storage medium
Technical Field
The present disclosure relates to the field of data processing, and in particular, to a text processing technique.
Background
When analyzing content generated by a user, a class of problems is often encountered, requiring hierarchical subject classification of the text content generated by the user. In practical applications, this task is applied in many business scenarios, such as post classification, question answering, advertisement delivery, search result organization, etc.
In order to realize classification of content data, a classification tree structure is generally constructed in advance in the prior art, different classification models are respectively constructed aiming at leaf nodes of the tree structure, and each classification model is adopted to classify the content data.
However, the text content produced by the user is generally greatly different from the natural language, the used language is more random, and the Out voice phenomenon is more serious, so that the text content is more dependent on the user information. Therefore, when classifying content data by the above method, such content data cannot be classified accurately.
Disclosure of Invention
The present disclosure provides a text processing method, apparatus, device, and computer readable storage medium for solving the problem that when the existing text processing method classifies content data, such content data cannot be accurately classified.
In a first aspect, an embodiment of the present disclosure provides a text processing method, including:
acquiring data to be classified input by a terminal device, wherein the data to be classified comprises a text to be classified and an identifier of a user inputting the text to be classified;
obtaining user characteristics corresponding to the user according to the user identification, and carrying out vectorization processing on the text to be classified and the user characteristics to obtain vector information to be processed;
processing the vector information to be processed by adopting a preset feature extraction model to obtain feature information corresponding to the vector information to be processed;
and classifying the characteristic information through a cascade classifier to obtain the category information corresponding to the text to be classified.
According to the text processing method provided by the embodiment, the user characteristics used for representing the conventional characteristics when the user publishes the text information are added in the characteristic extraction process, so that factors of the user characteristics can be considered in the classification process, and the accuracy of text classification is improved.
In one possible design, after the obtaining the data to be classified input by the terminal device, the method further includes:
word segmentation, punctuation mark removal and coding treatment are carried out on the text to be classified, and a preprocessed text to be classified is obtained;
Accordingly, the vectorizing the text to be classified and the user features includes:
and vectorizing the preprocessed text to be classified and the user features.
According to the text processing method provided by the embodiment, the user characteristics used for representing the conventional characteristics when the user publishes the text information are added in the characteristic extraction process, so that factors of the user characteristics can be considered in the classification process, and the accuracy of text classification is improved.
In one possible design, the vectorizing the text to be classified and the user features includes:
and vectorizing the text to be classified and the user features through the Embedding.
According to the text processing method, vectorization processing is carried out on the text to be classified and the user characteristics in an Embedding mode, so that basic granularity vector representation of the text to be classified can be accurately obtained.
In one possible design, the cascade classifier includes a multi-layer classifier, and the classifying the feature information by the cascade classifier includes:
and sequentially inputting the characteristic information and the classification result output by the classifier of the upper layer into the classifier of the lower layer, and taking the result output by the classifier of the last layer as the class information corresponding to the text to be classified.
According to the text processing method provided by the embodiment, the output result of the classifier of the upper layer and the feature information are input into the classifier of the lower layer, so that the subcategory of the classifier of the lower layer under the classification result can perform reclassifying operation on the feature information, and the classification efficiency and classification accuracy are effectively improved.
In one possible design, the sequentially inputting the feature information and the classification result output by the previous layer of classifier into the next layer of classifier, and taking the result output by the last layer of classifier as the category information corresponding to the text to be classified includes:
inputting the characteristic information into a preset first layer classifier to obtain a first class identifier corresponding to the characteristic information;
inputting the characteristic information and the first class identifier into a preset second-layer classifier, wherein the second classifier is used for classifying the characteristic information under the sub-class of the first class identifier to obtain a second class identifier corresponding to the characteristic information, and associating the first class identifier and the second class identifier to obtain a target class identifier;
and judging whether other subcategories are included under the second category identification, if so, inputting the target category identification and the characteristic information into a classifier of the next layer to perform classification operation until the category information output by the classifier does not include other subcategories.
According to the text processing method provided by the embodiment, the output result of the classifier of the upper layer and the feature information are input into the classifier of the lower layer, so that the subcategory of the classifier of the lower layer under the classification result can perform reclassifying operation on the feature information, and the classification efficiency and classification accuracy are effectively improved.
In one possible design, after the classifying operation is performed on the feature information by using the cascade classifier to obtain the category information corresponding to the text to be classified, the method further includes:
and storing the text to be classified into a storage path corresponding to the category information according to the category information corresponding to the text to be classified.
According to the text processing method, the text to be classified is stored in the storage path corresponding to the category information according to the category information corresponding to the text to be classified, so that the text to be classified can be conveniently applied after being classified.
In a second aspect, an embodiment of the present disclosure provides a text processing apparatus, including:
the terminal equipment comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring data to be classified, which are input by the terminal equipment, wherein the data to be classified comprise texts to be classified and identifiers of users who input the texts to be classified;
The vectorization processing module is used for acquiring user characteristics corresponding to the user according to the user identification, vectorizing the text to be classified and the user characteristics, and acquiring vector information to be processed;
the feature extraction module is used for processing the vector information to be processed by adopting a preset feature extraction model to obtain feature information corresponding to the vector information to be processed;
and the classification module is used for performing classification operation on the characteristic information through a cascade classifier to obtain category information corresponding to the text to be classified.
In one possible design, the apparatus further comprises:
the preprocessing module is used for carrying out word segmentation, punctuation mark removal and coding processing on the text to be classified to obtain preprocessed text to be classified;
accordingly, the vectorization processing module is configured to:
and vectorizing the preprocessed text to be classified and the user features.
In one possible design, the vectorization processing module is configured to:
and vectorizing the text to be classified and the user features through the Embedding.
In one possible design, the cascade classifier includes a multi-layer classifier, and the classification module is configured to:
And sequentially inputting the characteristic information and the classification result output by the classifier of the upper layer into the classifier of the lower layer, and taking the result output by the classifier of the last layer as the class information corresponding to the text to be classified.
In one possible design, the classification module is configured to:
inputting the characteristic information into a preset first layer classifier to obtain a first class identifier corresponding to the characteristic information;
inputting the characteristic information and the first class identifier into a preset second-layer classifier, wherein the second classifier is used for classifying the characteristic information under the sub-class of the first class identifier to obtain a second class identifier corresponding to the characteristic information, and associating the first class identifier and the second class identifier to obtain a target class identifier;
and judging whether other subcategories are included under the second category identification, if so, inputting the target category identification and the characteristic information into a classifier of the next layer to perform classification operation until the category information output by the classifier does not include other subcategories.
In one possible design, the apparatus further comprises:
and the processing module is used for storing the text to be classified into a storage path corresponding to the category information according to the category information corresponding to the text to be classified.
In a third aspect, an embodiment of the present disclosure provides a text processing apparatus, including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.
In a fourth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of the first aspect.
In a fifth aspect, an embodiment of the present disclosure provides a text processing method, including:
obtaining data to be classified, wherein the data to be classified comprises a text to be classified and an identifier of a user inputting the text to be classified;
obtaining user characteristics corresponding to the user according to the user identification, and carrying out vectorization processing on the text to be classified and the user characteristics to obtain vector information to be processed;
processing the vector information to be processed to obtain characteristic information corresponding to the vector information to be processed;
And classifying the characteristic information to obtain the category information corresponding to the text to be classified.
According to the text processing method, device, equipment and computer readable storage medium, the user characteristics used for representing the conventional characteristics when the user publishes text information are added in the characteristic extraction process, so that factors of the user characteristics can be considered in the classification process, and the accuracy of text classification is improved.
Other effects of the above alternative will be described below in connection with specific embodiments.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a diagram of a system architecture upon which the present disclosure is based;
fig. 2 is a flow chart of a text processing method according to a first embodiment of the disclosure;
FIG. 3 is a class organization provided by an embodiment of the present disclosure;
fig. 4 is a flow chart of a text processing method according to a second embodiment of the disclosure;
fig. 5 is a schematic structural diagram of a text processing device according to a third embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a text processing device according to a fourth embodiment of the present disclosure;
fig. 7 is a flow chart of a text processing method according to a fifth embodiment of the disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In order to solve the problem that when the existing text processing method classifies content data, the content data cannot be accurately classified, the disclosure provides a text processing method, a text processing device, text processing equipment and a computer readable storage medium. Because the existing text processing method does not consider personalized information of the user, the classification result is inaccurate, and therefore, in order to improve the accuracy of the classification result, the user characteristic information can be added in the characteristic extraction process.
It should be noted that the text processing method, apparatus, device and computer readable storage medium provided in the present disclosure can be applied to any scene of text classification.
Fig. 1 is a system architecture diagram based on the present disclosure, and as shown in fig. 1, the system architecture diagram based on the present disclosure at least includes a plurality of terminal devices 1 and a text processing device 2, where the text processing device 2 is written in a language such as C/c++, java, shell, python, or the like; the terminal device 1 may be, for example, a desktop computer, a tablet computer, etc. The terminal device 1 is communicatively connected to the text processing means 2 so as to be capable of information interaction with the terminal device.
Fig. 2 is a flow chart of a text processing method according to an embodiment of the disclosure, as shown in fig. 2, where the method includes:
step 101, obtaining data to be classified input by a terminal device, wherein the data to be classified comprises a text to be classified and an identification of a user inputting the text to be classified.
The execution main body of the embodiment is a text processing device, and the text processing device is in communication connection with the terminal equipment, so that information interaction with the terminal equipment can be performed. The terminal device can acquire data to be classified which needs to be classified. Specifically, the user can publish text content on the terminal equipment, and correspondingly, after receiving the text content, the terminal equipment can send the text content to the text processing device in real time for classification processing; optionally, the text processing device may also acquire text content published by the user from the terminal device according to a preset period, and perform a classification operation on the text content. Accordingly, the text processing device may obtain the data to be classified from the terminal device.
It should be noted that, since the text content produced by the user generally has a larger difference from the natural language, the language used is more random and depends on the user information, in order to improve the accuracy of classifying the text to be classified, the text to be classified can also carry the identification of the user who publishes the text to be classified.
Step 102, obtaining user characteristics corresponding to the user according to the user identification, and carrying out vectorization processing on the text to be classified and the user characteristics to obtain vector information to be processed.
In this embodiment, in order to achieve the acquisition of the user features, a database including a large number of user features may be pre-established, where the user features can characterize the usage features of the user when publishing text information, and so on. Accordingly, after the user identifier is obtained, the user feature corresponding to the user can be obtained from the database according to the user identifier.
After the text to be classified and the user characteristics are obtained, the feature extraction operation can be performed on the text to be classified and the user characteristics. Accordingly, before extracting the features, in order to facilitate model processing, vectorization processing can be performed on the text to be classified and the user features, so as to obtain the text to be classified and the vector information to be processed corresponding to the user features.
Specifically, based on the above embodiment, step 102 specifically includes:
and vectorizing the text to be classified and the user features through the Embedding.
In this embodiment, vectorization processing may be performed on the text to be classified and the user feature in an Embedding manner, so as to obtain a basic granularity vector representation of the text to be classified. The basic granularity may be word granularity or word granularity. The method can be used for carrying out word segmentation on the text to be classified, carrying out vectorization on each word group after word segmentation to obtain word vectors corresponding to each word group, or carrying out vectorization on the text to be classified directly without carrying out word segmentation on the text to be classified to obtain the word vectors. The present disclosure is not limited in this regard.
And 103, processing the vector information to be processed by adopting a preset feature extraction model to obtain feature information corresponding to the vector information to be processed.
In this embodiment, after obtaining the text to be classified and the to-be-processed vector information corresponding to the user feature, the feature information of the to-be-processed vector information may be obtained. Specifically, the vector information to be processed can be processed by adopting a preset feature extraction model, so as to obtain feature information corresponding to the vector information to be processed. Any feature extraction model capable of performing feature extraction may be used to process the vector information to be processed, for example CNN, RNN, LSTM, transformer, which is not limited in this disclosure.
As an implementation manner, since different network models have different advantages in task processing, after receiving data to be classified, the characteristics of the text to be classified can be judged first, and for different characteristics, different network models are adopted to perform characteristic extraction. For example, CNNs are good at extracting text relationships of neighboring windows; the LSTM can obtain the dependency information in the long sentence text; the transducer is suitable for the task of the Seq2Seq, and the BERT model adopts a bidirectional transducer structure to make breakthrough progress on a plurality of NLP tasks.
And 104, classifying the characteristic information through a cascade classifier to obtain the category information corresponding to the text to be classified.
In this embodiment, fig. 3 is a class organization structure provided by the embodiment of the present disclosure, as shown in fig. 3, because the classification process includes multiple different class levels, for example, secondary classification under technology (digital, internet, mathematics, physics, etc.), tertiary classification (television, mobile phone, etc.), and quaternary classification (full-screen mobile phone, non-full-screen mobile phone, etc.). Therefore, in order to realize accurate classification of the feature information, a preset cascade classifier can be adopted to classify the feature information, so that the category information corresponding to the text to be classified is obtained.
Further, on the basis of any of the above embodiments, after step 104, the method further includes:
and storing the text to be classified into a storage path corresponding to the category information according to the category information corresponding to the text to be classified.
In this embodiment, after the text to be classified is subjected to the classification operation, it may be stored in a storage path corresponding to the category information. When the text information corresponding to a certain category is called, all the text information can be directly obtained from the storage path corresponding to the category.
According to the text processing method provided by the embodiment, the user characteristics used for representing the conventional characteristics when the user publishes the text information are added in the characteristic extraction process, so that factors of the user characteristics can be considered in the classification process, and the accuracy of text classification is improved.
Further, on the basis of any one of the above embodiments, after step 101, the method further includes:
word segmentation, punctuation mark removal and coding treatment are carried out on the text to be classified, and a preprocessed text to be classified is obtained;
accordingly, step 102 specifically includes:
and vectorizing the preprocessed text to be classified and the user features.
In this embodiment, in order to improve the classification efficiency of the text to be classified, the text to be classified may be first preprocessed before performing the classification operation. Specifically, the text to be classified after pretreatment can be obtained by performing word segmentation, punctuation removal, encoding treatment and the like on the text to be classified. Accordingly, vectorization processing can be performed on the user characteristics and the preprocessed text to be classified to obtain vector information to be processed.
According to the text processing method, the text to be classified is preprocessed before being classified, so that useless characters and the like in the text to be classified can be removed, and the classification efficiency of the text to be classified is improved.
Further, on the basis of any one of the foregoing embodiments, the cascade classifier includes a multi-layer classifier, and step 104 specifically includes:
and sequentially inputting the characteristic information and the classification result output by the classifier of the upper layer into the classifier of the lower layer, and taking the result output by the classifier of the last layer as the class information corresponding to the text to be classified.
In this embodiment, since the classification process includes a plurality of different class levels, a cascade classifier including a multi-layer classifier may be used to classify the feature information. Specifically, in order to improve the classification efficiency and the classification precision, the classification result output by the previous layer may be input to the next layer classifier together with the feature information, so that the next layer classifier can perform the reclassification operation on the feature information in the subcategory under the classification result. For example, if the classification result output by the first layer classifier is "science and technology", the science and technology label and the feature information can be input into the next layer classifier, and accordingly, the next layer classifier can perform classification operation on the feature information in a plurality of subcategories of "digital, internet, mathematics, physical, etc. under the science and technology. And executing the steps aiming at each layer of classifier, and taking the classification result output by the last layer of classifier as the class information corresponding to the text to be classified. If the current classifier is the first classifier in the cascade classifier, the feature information can be input into the classifier only; if the current classifier is the nth classifier in the cascade classifier, the classification result and the feature information of the previous classifier can be input into the classifier.
It should be noted that in the prior art, a classification tree structure is generally constructed in advance, different classification models are respectively constructed for leaf nodes of the tree structure, and each classification model is adopted to classify content data. However, classification of content data using the above method relies on the fact that a model needs to be trained for each sub-class for each layer to solve the sub-class classification problem. If the topic tree structure is deep, the models which cover each sub-category are difficult to train, so that the classification efficiency is seriously affected.
According to the text processing method provided by the embodiment, the output result of the classifier of the upper layer and the feature information are input into the classifier of the lower layer, so that the subcategory of the classifier of the lower layer under the classification result can perform reclassifying operation on the feature information, and the classification efficiency and classification accuracy are effectively improved.
Fig. 4 is a schematic flow chart of a text processing method according to a second embodiment of the present disclosure, where, based on any one of the embodiments described above, as shown in fig. 4, the steps of sequentially inputting the feature information and the classification result output by the previous layer of classifier into the next layer of classifier, and taking the result output by the last layer of classifier as the classification information corresponding to the text to be classified include:
Step 201, inputting the characteristic information into a preset first layer classifier to obtain a first class identifier corresponding to the characteristic information;
step 202, inputting the characteristic information and the first class identifier into a preset second layer classifier, wherein the second classifier is used for classifying the characteristic information under the sub-class of the first class identifier to obtain a second class identifier corresponding to the characteristic information, and associating the first class identifier and the second class identifier to obtain a target class identifier;
and 203, judging whether other subcategories are included under the second category identification, if so, inputting the target category identification and the characteristic information into a classifier of the next layer to perform classification operation until the category information output by the classifier does not include other subcategories.
In this embodiment, after the feature information is obtained, the feature information may be input into a preset first layer classifier, a first class identifier corresponding to the feature information is obtained, and the first class identifier and the feature information are input into a second layer classifier together, so that the second layer classifier can perform a classification operation on the feature information in multiple sub-classes under the first class identifier, and a second class identifier corresponding to the feature information is obtained. And associating the first category identifier with the second category identifier to obtain the target category identifier. And determining whether other subcategories are included under the second category identification, if so, continuing to adopt a subsequent classifier to classify the characteristic information, and if not, taking the second category identification as the category information corresponding to the text to be classified.
For example, the classification information output by the first layer classifier is scientific and technical, and the classification information is provided with a plurality of subcategories, so that the label of the science and technology and the characteristic information can be input into the second layer classifier together, the second classifier classifies the characteristic information in a plurality of subcategories of digital, internet, mathematics and physics under the scientific and technical classification, the classification result of digital is obtained, the characteristic information is continuously classified in a plurality of subcategories under the digital classification, the classification result of mobile phone is obtained, and the classification operation is continuously performed on a plurality of subcategories under the mobile phone classification, so that the mobile phone with a comprehensive screen is obtained. Correspondingly, a plurality of category identifiers are associated to obtain final characteristic information of 'science and technology-digital code-mobile phone-full screen mobile phone'.
According to the text processing method provided by the embodiment, the output result of the classifier of the upper layer and the feature information are input into the classifier of the lower layer, so that the subcategory of the classifier of the lower layer under the classification result can perform reclassifying operation on the feature information, and the classification efficiency and classification accuracy are effectively improved.
Fig. 5 is a schematic structural diagram of a text processing device according to a third embodiment of the present disclosure, as shown in fig. 5, the text processing device 30 includes: an acquisition module 31, a vectorization processing module 32, a feature extraction module 33 and a classification module 34. The acquiring module 31 is configured to acquire data to be classified input by a terminal device, where the data to be classified includes a text to be classified and an identifier of a user who inputs the text to be classified; the vectorization processing module 32 is configured to obtain a user feature corresponding to the user according to the identifier of the user, and perform vectorization processing on the text to be classified and the user feature to obtain vector information to be processed; the feature extraction module 33 is configured to process the vector information to be processed by using a preset feature extraction model, so as to obtain feature information corresponding to the vector information to be processed; and the classification module is used for performing classification operation on the characteristic information through a cascade classifier to obtain category information corresponding to the text to be classified.
Further, on the basis of the third embodiment, the apparatus further includes:
the preprocessing module is used for carrying out word segmentation, punctuation mark removal and coding processing on the text to be classified to obtain preprocessed text to be classified;
accordingly, the vectorization processing module is configured to:
and vectorizing the preprocessed text to be classified and the user features.
Further, on the basis of any one of the foregoing embodiments, the vectorization processing module is configured to:
and vectorizing the text to be classified and the user features through the Embedding.
Further, on the basis of any one of the foregoing embodiments, the cascade classifier includes a multi-layer classifier, and the classification module is configured to:
and sequentially inputting the characteristic information and the classification result output by the classifier of the upper layer into the classifier of the lower layer, and taking the result output by the classifier of the last layer as the class information corresponding to the text to be classified.
Further, on the basis of any one of the above embodiments, the classification module is configured to:
inputting the characteristic information into a preset first layer classifier to obtain a first class identifier corresponding to the characteristic information;
Inputting the characteristic information and the first class identifier into a preset second-layer classifier, wherein the second classifier is used for classifying the characteristic information under the sub-class of the first class identifier to obtain a second class identifier corresponding to the characteristic information, and associating the first class identifier and the second class identifier to obtain a target class identifier;
and judging whether other subcategories are included under the second category identification, if so, inputting the target category identification and the characteristic information into a classifier of the next layer to perform classification operation until the category information output by the classifier does not include other subcategories.
Further, on the basis of any one of the foregoing embodiments, the apparatus further includes:
and the processing module is used for storing the text to be classified into a storage path corresponding to the category information according to the category information corresponding to the text to be classified.
According to an embodiment of the present disclosure, the present disclosure also provides a text processing apparatus and a readable storage medium.
Fig. 6 is a schematic structural diagram of a text processing device according to a fourth embodiment of the present disclosure, and as shown in fig. 6, is a block diagram of the text processing device according to the text processing method according to the embodiment of the present disclosure. Text processing devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The text processing device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the text processing apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the text processing device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple text processing devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 601 is illustrated in fig. 6.
Memory 602 is a non-transitory computer-readable storage medium provided by the present disclosure. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the text processing methods provided by the present disclosure. The non-transitory computer readable storage medium of the present disclosure stores computer instructions for causing a computer to perform the text processing method provided by the present disclosure.
The memory 602 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules (e.g., the acquisition module 31, the vectorization processing module 32, the feature extraction module 33, and the classification module 34 shown in fig. 5) corresponding to the text processing method in the embodiments of the present disclosure. The processor 601 executes various functional applications of the server and data processing, i.e., implements the text processing method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 602.
The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for a function; the storage data area may store data created according to the use of a text processing device for text processing, or the like. In addition, the memory 602 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 602 may optionally include memory located remotely from processor 601, which may be connected to a text processing device for text processing via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The text processing apparatus of the text processing method may further include: an input device 603 and an output device 604. The processor 601, memory 602, input device 603 and output device 604 may be connected by a bus or otherwise, for example in fig. 6.
The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the text processing apparatus, such as a touch screen, keypad, mouse, trackpad, touchpad, pointer stick, one or more mouse buttons, trackball, joystick, and like input devices. The output means 604 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
Fig. 7 is a flow chart of a text processing method provided in a fifth embodiment of the present disclosure, as shown in fig. 5, where the method includes:
Step 501, obtaining data to be classified, wherein the data to be classified comprises a text to be classified and an identifier of a user inputting the text to be classified;
step 502, obtaining user characteristics corresponding to the user according to the user identification, and carrying out vectorization processing on the text to be classified and the user characteristics to obtain vector information to be processed;
step 503, processing the vector information to be processed to obtain feature information corresponding to the vector information to be processed;
and 504, classifying the characteristic information to obtain the class information corresponding to the text to be classified.
According to the text processing method provided by the embodiment, the user characteristics used for representing the conventional characteristics when the user publishes the text information are added in the characteristic extraction process, so that factors of the user characteristics can be considered in the classification process, and the accuracy of text classification is improved.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present application may be performed in parallel or sequentially or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (15)

1. A text processing method, comprising:
acquiring data to be classified input by a terminal device, wherein the data to be classified comprises a text to be classified and an identifier of a user inputting the text to be classified;
obtaining user characteristics corresponding to the user according to the user identification, and carrying out vectorization processing on the text to be classified and the user characteristics to obtain vector information to be processed;
processing the vector information to be processed by adopting a preset feature extraction model to obtain feature information corresponding to the vector information to be processed;
and classifying the characteristic information through a cascade classifier to obtain the category information corresponding to the text to be classified.
2. The method according to claim 1, wherein after the obtaining the data to be classified input by the terminal device, further comprises:
Word segmentation, punctuation mark removal and coding treatment are carried out on the text to be classified, and a preprocessed text to be classified is obtained;
accordingly, the vectorizing the text to be classified and the user features includes:
and vectorizing the preprocessed text to be classified and the user features.
3. The method of claim 1, wherein the vectorizing the text to be classified and the user features comprises:
and vectorizing the text to be classified and the user features through the Embedding.
4. A method according to any one of claims 1-3, wherein the cascade classifier comprises a multi-layer classifier, and the classifying the feature information by the cascade classifier comprises:
and sequentially inputting the characteristic information and the classification result output by the classifier of the upper layer into the classifier of the lower layer, and taking the result output by the classifier of the last layer as the class information corresponding to the text to be classified.
5. The method according to claim 4, wherein the sequentially inputting the feature information and the classification result output by the previous classifier into the next classifier, and taking the result output by the last classifier as the classification information corresponding to the text to be classified includes:
Inputting the characteristic information into a preset first layer classifier to obtain a first class identifier corresponding to the characteristic information;
inputting the characteristic information and the first class identifier into a preset second-layer classifier, wherein the second-layer classifier is used for classifying the characteristic information under the sub-class of the first class identifier to obtain a second class identifier corresponding to the characteristic information, and associating the first class identifier and the second class identifier to obtain a target class identifier;
and judging whether other subcategories are included under the second category identification, if so, inputting the target category identification and the characteristic information into a classifier of the next layer to perform classification operation until the category information output by the classifier does not include other subcategories.
6. A method according to any one of claims 1 to 3, wherein after the classifying operation is performed on the feature information by using a cascade classifier to obtain the category information corresponding to the text to be classified, the method further includes:
and storing the text to be classified into a storage path corresponding to the category information according to the category information corresponding to the text to be classified.
7. A text processing apparatus, comprising:
the terminal equipment comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring data to be classified, which are input by the terminal equipment, wherein the data to be classified comprise texts to be classified and identifiers of users who input the texts to be classified;
the vectorization processing module is used for acquiring user characteristics corresponding to the user according to the user identification, vectorizing the text to be classified and the user characteristics, and acquiring vector information to be processed;
the feature extraction module is used for processing the vector information to be processed by adopting a preset feature extraction model to obtain feature information corresponding to the vector information to be processed;
and the classification module is used for performing classification operation on the characteristic information through a cascade classifier to obtain category information corresponding to the text to be classified.
8. The apparatus of claim 7, wherein the apparatus further comprises:
the preprocessing module is used for carrying out word segmentation, punctuation mark removal and coding processing on the text to be classified to obtain preprocessed text to be classified;
accordingly, the vectorization processing module is configured to:
and vectorizing the preprocessed text to be classified and the user features.
9. The apparatus of claim 7, wherein the vectorization processing module is to:
and vectorizing the text to be classified and the user features through the Embedding.
10. The apparatus of any one of claims 7-9, wherein the cascade classifier includes a multi-layer classifier therein, the classification module being configured to:
and sequentially inputting the characteristic information and the classification result output by the classifier of the upper layer into the classifier of the lower layer, and taking the result output by the classifier of the last layer as the class information corresponding to the text to be classified.
11. The apparatus of claim 10, wherein the classification module is configured to:
inputting the characteristic information into a preset first layer classifier to obtain a first class identifier corresponding to the characteristic information;
inputting the characteristic information and the first class identifier into a preset second-layer classifier, wherein the second-layer classifier is used for classifying the characteristic information under the sub-class of the first class identifier to obtain a second class identifier corresponding to the characteristic information, and associating the first class identifier and the second class identifier to obtain a target class identifier;
And judging whether other subcategories are included under the second category identification, if so, inputting the target category identification and the characteristic information into a classifier of the next layer to perform classification operation until the category information output by the classifier does not include other subcategories.
12. The apparatus according to any one of claims 7-9, wherein the apparatus further comprises:
and the processing module is used for storing the text to be classified into a storage path corresponding to the category information according to the category information corresponding to the text to be classified.
13. A text processing apparatus, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.
15. A text processing method, comprising:
obtaining data to be classified, wherein the data to be classified comprises a text to be classified and an identifier of a user inputting the text to be classified;
obtaining user characteristics corresponding to the user according to the user identification, and carrying out vectorization processing on the text to be classified and the user characteristics to obtain vector information to be processed;
processing the vector information to be processed to obtain characteristic information corresponding to the vector information to be processed;
and classifying the characteristic information to obtain the category information corresponding to the text to be classified.
CN201911239505.2A 2019-12-06 2019-12-06 Text processing method, device, equipment and computer readable storage medium Active CN111078878B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911239505.2A CN111078878B (en) 2019-12-06 2019-12-06 Text processing method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911239505.2A CN111078878B (en) 2019-12-06 2019-12-06 Text processing method, device, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN111078878A CN111078878A (en) 2020-04-28
CN111078878B true CN111078878B (en) 2023-07-04

Family

ID=70313132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911239505.2A Active CN111078878B (en) 2019-12-06 2019-12-06 Text processing method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111078878B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111581381B (en) * 2020-04-29 2023-10-10 北京字节跳动网络技术有限公司 Method and device for generating training set of text classification model and electronic equipment
CN112257432A (en) * 2020-11-02 2021-01-22 北京淇瑀信息科技有限公司 Self-adaptive intention identification method and device and electronic equipment
CN112487295A (en) * 2020-12-04 2021-03-12 中国移动通信集团江苏有限公司 5G package pushing method and device, electronic equipment and computer storage medium
CN112802568A (en) * 2021-02-03 2021-05-14 紫东信息科技(苏州)有限公司 Multi-label stomach disease classification method and device based on medical history text
CN113535951B (en) * 2021-06-21 2023-02-17 深圳大学 Method, device, terminal equipment and storage medium for information classification

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101853250A (en) * 2009-04-03 2010-10-06 华为技术有限公司 Method and device for classifying documents
CN109543032A (en) * 2018-10-26 2019-03-29 平安科技(深圳)有限公司 File classification method, device, computer equipment and storage medium
CN110309306A (en) * 2019-06-19 2019-10-08 淮阴工学院 A kind of Document Modeling classification method based on WSD level memory network

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6701311B2 (en) * 2001-02-07 2004-03-02 International Business Machines Corporation Customer self service system for resource search and selection
JP5170787B2 (en) * 2010-01-28 2013-03-27 楽天株式会社 Posted Document Analysis Device, Posted Document Analysis Method, and Posted Document Analysis Device Program
CN103106211B (en) * 2011-11-11 2017-05-03 中国移动通信集团广东有限公司 Emotion recognition method and emotion recognition device for customer consultation texts
CN102684997A (en) * 2012-04-13 2012-09-19 亿赞普(北京)科技有限公司 Classification method, classification device, training method and training device of communication messages
CN108416616A (en) * 2018-02-05 2018-08-17 阿里巴巴集团控股有限公司 The sort method and device of complaints and denunciation classification
CN110147445A (en) * 2019-04-09 2019-08-20 平安科技(深圳)有限公司 Intension recognizing method, device, equipment and storage medium based on text classification
CN110334216A (en) * 2019-07-12 2019-10-15 福建省趋普物联科技有限公司 A kind of rubbish text recognition methods and system
CN110503054B (en) * 2019-08-27 2022-09-23 广东工业大学 Text image processing method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101853250A (en) * 2009-04-03 2010-10-06 华为技术有限公司 Method and device for classifying documents
CN109543032A (en) * 2018-10-26 2019-03-29 平安科技(深圳)有限公司 File classification method, device, computer equipment and storage medium
CN110309306A (en) * 2019-06-19 2019-10-08 淮阴工学院 A kind of Document Modeling classification method based on WSD level memory network

Also Published As

Publication number Publication date
CN111078878A (en) 2020-04-28

Similar Documents

Publication Publication Date Title
CN111967268B (en) Event extraction method and device in text, electronic equipment and storage medium
CN111078878B (en) Text processing method, device, equipment and computer readable storage medium
EP3923160A1 (en) Method, apparatus, device and storage medium for training model
CN111967262B (en) Determination method and device for entity tag
KR20210038449A (en) Question and answer processing, language model training method, device, equipment and storage medium
CN111221984A (en) Multimodal content processing method, device, equipment and storage medium
CN111967256B (en) Event relation generation method and device, electronic equipment and storage medium
CN111709247A (en) Data set processing method and device, electronic equipment and storage medium
EP3920088A1 (en) Method and apparatus for generating tag of video, electronic device, and storage medium
CN111259671B (en) Semantic description processing method, device and equipment for text entity
US20210397791A1 (en) Language model training method, apparatus, electronic device and readable storage medium
US20220067439A1 (en) Entity linking method, electronic device and storage medium
CN111325020A (en) Event argument extraction method and device and electronic equipment
CN111488740B (en) Causal relationship judging method and device, electronic equipment and storage medium
CN111950292B (en) Training method of text error correction model, text error correction processing method and device
CN111680517B (en) Method, apparatus, device and storage medium for training model
CN111428514A (en) Semantic matching method, device, equipment and storage medium
CN111611468B (en) Page interaction method and device and electronic equipment
CN111522967A (en) Knowledge graph construction method, device, equipment and storage medium
CN111858905B (en) Model training method, information identification device, electronic equipment and storage medium
CN111241234B (en) Text classification method and device
CN111127191B (en) Risk assessment method and risk assessment device
CN112507702B (en) Text information extraction method and device, electronic equipment and storage medium
CN110532487B (en) Label generation method and device
CN111611990A (en) Method and device for identifying table in image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant