CN113486147A - Text processing method and device, electronic equipment and computer readable medium - Google Patents

Text processing method and device, electronic equipment and computer readable medium Download PDF

Info

Publication number
CN113486147A
CN113486147A CN202110767855.7A CN202110767855A CN113486147A CN 113486147 A CN113486147 A CN 113486147A CN 202110767855 A CN202110767855 A CN 202110767855A CN 113486147 A CN113486147 A CN 113486147A
Authority
CN
China
Prior art keywords
text
semantic information
title
category
information vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110767855.7A
Other languages
Chinese (zh)
Inventor
罗奕康
刘海
聂砂
贾国琛
师文宝
戴菀庭
崔震
张士存
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202110767855.7A priority Critical patent/CN113486147A/en
Publication of CN113486147A publication Critical patent/CN113486147A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a text processing method, a text processing device, electronic equipment and a computer readable medium, which relate to the technical field of artificial intelligence, in particular to the technical field of machine learning, deep learning and natural language processing, wherein the method comprises the steps of receiving a text classification request, and acquiring a text title and a label category of a text corresponding to the text classification request; calling a multilayer neural network model to extract semantic information of the text title, and generating a text title semantic information vector based on the extracted semantic information; and calling a tag semantic information vector corresponding to the tag category, generating a subject identifier and a category identifier of the text based on the tag semantic information vector, the text title semantic information vector and the classifier, and processing the text based on the subject identifier and the category identifier. The classification of multi-dimensional and multi-level labels can be completed only through one multi-level neural network model, so that the text is processed based on classification, and time and labor are saved.

Description

Text processing method and device, electronic equipment and computer readable medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to the field of machine learning, deep learning, and natural language processing technologies, and in particular, to a text processing method, apparatus, electronic device, and computer readable medium.
Background
At present, when a text is processed, multi-label classification needs to be performed on the text, the classification has multiple levels, and when a large class of each level contains multiple small classes, data are respectively processed, and a large amount of time and labor are needed for training a model.
In the process of implementing the present application, the inventor finds that at least the following problems exist in the prior art:
when the text is processed by multi-label classification, when the classification has a plurality of levels and each level of the large class includes a plurality of small classes, data can be processed respectively, and a large amount of time and labor are needed for training a model.
Disclosure of Invention
In view of the above, embodiments of the present application provide a text processing method, an apparatus, an electronic device, and a computer readable medium, which can solve the problem that when a text is processed by multi-label classification in the prior art, when there are multiple levels in the classification, and each level includes multiple small categories, it takes a lot of time and labor to process data and train a model. To a problem of (a).
To achieve the above object, according to an aspect of an embodiment of the present application, there is provided a text processing method including:
receiving a text classification request, and acquiring a text title and a label category of a text corresponding to the text classification request;
calling a multilayer neural network model to extract semantic information of the text title, and further generating a text title semantic information vector based on the extracted semantic information;
and calling a tag semantic information vector corresponding to the tag category, and further generating a subject identifier and a category identifier of the text based on the tag semantic information vector, the text title semantic information vector and the classifier so as to process the text based on the subject identifier and the category identifier.
Optionally, performing semantic information extraction on the text header, and further generating a text header semantic information vector based on the extracted semantic information, including:
and inputting the text title into a language model to output text title semantic information corresponding to the text title, wherein the language model is used for representing the corresponding relation between the text and the semantic information.
Optionally, generating a text header semantic information vector based on the extracted semantic information includes:
determining a title identifier corresponding to the text title;
and inputting the extracted semantic information into a model corresponding to the title identifier in the multilayer neural network model to output a corresponding text title semantic information vector.
Optionally, performing semantic information extraction on the text header includes:
determining the preset character length corresponding to the label type;
determining the length of a character corresponding to the text title;
determining the length of a target character according to the preset character length and the character length corresponding to the text title;
and expanding the character length corresponding to the text title to the target character length, and further extracting semantic information of the text title of the target character length.
Optionally, invoking a tag semantic information vector corresponding to the tag category, including:
and calling a label semantic information vector corresponding to the label category corresponding to the target character length.
Optionally, generating a topic identifier and a category identifier of the corresponding text based on the tag semantic information vector, the text title semantic information vector, and the classifier, including:
and multiplying the label semantic information vector by the title semantic information vector to generate a vector matrix, and then inputting the vector matrix into a classifier to generate a theme identifier and a category identifier of the corresponding text.
Optionally, before invoking the multi-layer neural network model, the method further comprises:
acquiring an initial multilayer neural network model;
acquiring a training sample set, wherein the training sample set comprises a multi-dimensional text title, a multi-level label category, a subject identifier corresponding to a labeled text title and a category identifier corresponding to a labeled text title;
and taking the multi-dimensional text title and the multi-level label category as the input of the multilayer neural network model, taking the subject identification corresponding to the labeled text title and the category identification corresponding to the labeled text title as the expected output, and training the initial multilayer neural network model to obtain the multilayer neural network model.
In addition, the present application also provides a text processing apparatus, including:
the receiving unit is configured to receive the text classification request and acquire a text title and a label category of a text corresponding to the text classification request;
the text header semantic information vector generating unit is configured to call the multilayer neural network model to extract semantic information of the text header and further generate a text header semantic information vector based on the extracted semantic information;
and the text classification unit is configured to call the tag semantic information vector corresponding to the tag category, and further generate a subject identifier and a category identifier of the corresponding text based on the tag semantic information vector, the text title semantic information vector and the classifier so as to process the text based on the subject identifier and the category identifier.
Optionally, the text title semantic information vector generating unit is further configured to:
and inputting the text title into a language model to output text title semantic information corresponding to the text title, wherein the language model is used for representing the corresponding relation between the text and the semantic information.
Optionally, the text title semantic information vector generating unit is further configured to:
determining a title identifier corresponding to the text title;
and inputting the extracted semantic information into a model corresponding to the title identifier in the multilayer neural network model to output a corresponding text title semantic information vector.
Optionally, the text title semantic information vector generating unit is further configured to:
determining the preset character length corresponding to the label type;
determining the length of a character corresponding to the text title;
determining the length of a target character according to the preset character length and the character length corresponding to the text title;
and expanding the character length corresponding to the text title to the target character length, and further extracting semantic information of the text title of the target character length.
Optionally, the text classification unit is further configured to:
and calling a label semantic information vector corresponding to the label category corresponding to the target character length.
Optionally, the text classification unit is further configured to:
and multiplying the label semantic information vector by the title semantic information vector to generate a vector matrix, and inputting the vector matrix into a classifier to generate a theme identifier and a category identifier of the corresponding text.
Optionally, the text processing apparatus further comprises a training unit configured to:
acquiring an initial multilayer neural network model;
acquiring a training sample set, wherein the training sample set comprises a multi-dimensional text title, a multi-level label category, a subject identifier corresponding to a labeled text title and a category identifier corresponding to a labeled text title;
and taking the multi-dimensional text title and the multi-level label category as the input of the multilayer neural network model, taking the subject identification corresponding to the labeled text title and the category identification corresponding to the labeled text title as the expected output, and training the initial multilayer neural network model to obtain the multilayer neural network model.
In addition, the present application also provides a text processing electronic device, including: one or more processors; a storage device for storing one or more programs which, when executed by one or more processors, cause the one or more processors to implement a text processing method as described above.
In addition, the present application also provides a computer readable medium, on which a computer program is stored, which when executed by a processor implements the text processing method as described above.
One embodiment of the above invention has the following advantages or benefits: the method comprises the steps of receiving a text classification request, and obtaining a text title and a label category of a text corresponding to the text classification request; calling a multilayer neural network model to extract semantic information of the text title, and further generating a text title semantic information vector based on the extracted semantic information; and calling a tag semantic information vector corresponding to the tag category, and further generating a subject identifier and a category identifier of the corresponding text based on the tag semantic information vector, the text title semantic information vector and the classifier so as to process the text based on the subject identifier and the category identifier. The text title and the label category are input into the multilayer neural network model, so that a text title semantic information vector and a label semantic information vector can be generated. The method has the advantages that the topic identification and the category identification corresponding to the text are generated based on the classifier, the classification of the multi-dimensional and multi-level labels can be completed only through one multi-level neural network model, the text is processed based on the classification, and time and labor are saved.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a further understanding of the application and are not to be construed as limiting the application. Wherein:
fig. 1 is a schematic diagram of a main flow of a text processing method according to a first embodiment of the present application;
fig. 2 is a schematic diagram of a main flow of a text processing method according to a second embodiment of the present application;
fig. 3 is a schematic view of an application scenario of a text processing method according to a third embodiment of the present application;
FIG. 4 is a schematic diagram of the main elements of a text processing apparatus according to an embodiment of the present application;
FIG. 5 is an exemplary system architecture diagram to which embodiments of the present application may be applied;
fig. 6 is a schematic structural diagram of a computer system suitable for implementing the terminal device or the server according to the embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of a main flow of a text processing method according to a first embodiment of the present application, and as shown in fig. 1, the text processing method includes:
step S101, receiving a text classification request, and acquiring a text title and a label category of a text corresponding to the text classification request.
In this embodiment, an execution subject (for example, a server) of the text processing method may receive the text classification request through a wired connection or a wireless connection. The text classification request may include one or more of a classification request for hierarchically classifying the text, a classification request for classifying the text into small categories at each level, and a classification request for classifying the text in multiple dimensions, i.e., a classification request for the text under multiple different label systems.
For example, the AB institute issued a document policy whose topics need to be classified, and the topics of the policy may not correspond to one topic, and thus, this may be a multi-label classification problem. Also, there may be multiple hierarchical labels in the topic taxonomy. The enforcement agent may first determine that this policy corresponds to a broad class of "general government affairs" (and may also correspond to multiple topics "AB institute", "general government affairs", etc.), and the second step may need to determine which labels under "general government affairs", which are multi-level classification issues. And multi-dimension means that the same policy needs to be classified into multiple labels under different label systems. For example, suppose a business needs to classify a policy document to determine which topics this policy belongs to, and for which industries the audience is. And the multi-dimension system can comprise three dimensions of a theme, an industry and an audience, wherein the three dimensions respectively belong to different label systems.
After receiving the text classification request, the execution main body may obtain a text corresponding to the text classification request. The executing entity may then obtain the text title and tag category of the text. The text title may be "the regulations of the law of C law in B and country C", and the label categories may be 22 in total for { "integrated government", "finance, auditing", "AB institute", … … }.
And step S102, calling a multilayer neural network model to extract semantic information of the text title, and further generating a text title semantic information vector based on the extracted semantic information.
In the multilayer neural network model in this embodiment, the model is disassembled, and weight information of each part (a module, B module, and C module (which may be separately stored)) is stored. The model uses the article title and label data as input, and through the pre-training language model, any pre-training language model can be used for semantic feature extraction (module A), for example, large-scale training language models such as bert and roberta can be used. The executing entity may then extract the extracted semantic information through the B module (left side) and the C module (right side) in the multi-layer neural network model to extract a vector containing the semantic information (e.g., a vector whose vector is CLS token in the bert model). Specifically, the B module on the left side in the multi-layer neural network model is used for extracting semantic vectors of the title, and the C module on the right side is used for extracting the semantic vectors corresponding to each label. At this time, a vector of 1 × 768 of the title and a vector of each tag can be obtained (assuming that the input tag data is at the next layer of the topic tag "finance, and audit", and this layer includes 7 subclasses), and at this time, a vector of 7 × 768 tags can be obtained. Inputting the obtained title vector and label vector into a multilayer neural network, processing the above steps, wherein the feature vector on the title side is 1 × 768 and the feature vector on the label side is 7 × 768, and finally multiplying the two vector matrixes to obtain a 1 × 7 vector corresponding to the classification feature of each label category. In order to realize the goal of multi-classification, a sigmoid function is used in the last layer of the model (the sigmoid function is often used as an activation function of the neural network, and a variable is mapped between 0 and 1), so that each dimension of 7 labels is subjected to two-classification once, the multi-classification problem is converted into a plurality of two-classification problems, and the initial multi-layer neural network model is trained by using a two-classification cross entropy as a loss function to obtain a final multi-layer neural network model.
For example, as mentioned above, the C module part may be separately stored, and during the application process, the execution subject may calculate in advance the tag semantic information vector corresponding to the tag category after passing through the C module. For example, to predict the classification of the policy of "F tax law enforcement regulations" at the first level topic, and 22 tags { "general government", "finance, auditing", "C institution", … … } of the classification of the first level topic, the 22 tags can be calculated in advance through the a module and the C module, so that the size of the semantic information vector of the tag can be (22 × 768), and the vector can be calculated in advance and stored in advance. In this way, in the text processing process, only the text header needs to be calculated and passes through the module A and the module B (positioned at the left side of the model), and the text header semantic information vector (1 × 768) can be obtained. And then the classification of the labels can be obtained by multiplying the text header semantic information vector (which can be a vector matrix) and the label semantic information vector (which can be a vector matrix) and activating the function. The B module and the C module (positioned on the right side of the model) have the same structure and different weights. Therefore, after the semantic features of the pre-training language are based, the MLP (multi-layer neural network) network structure with different weights is connected to the titles and the labels, so that the model automatically learns to judge what features should be extracted from label category data with different dimensions, similarity judgment is carried out, and the label semantic information vector and the text title semantic information vector are accurately obtained.
By using only one model structure, the multi-dimensional and multi-level classification problem can be processed. The model reasoning is faster, the weight of the model is independently stored, and the vector of the pre-calculated label is used for calculation, so that the text classification processing speed can be improved. And optimizing a network output layer activation function on the basis of 2 according to the actual service requirements, and finally realizing a set of multi-level, multi-dimensional and transformable text multi-label classification model. By using one set of model and one-time fusion training, a plurality of problems can be solved, the training time is saved, and the labor cost is reduced.
In this embodiment, extracting semantic information from a text title, and generating a text title semantic information vector based on the extracted semantic information includes: inputting the text title into a language model (for example, a large-scale trained language model such as bert, roberta, etc. can be used) to output text title semantic information corresponding to the text title, wherein the language model is used for representing the corresponding relation between the text and the semantic information.
In this embodiment, extracting semantic information from a text title includes:
determining the preset character length corresponding to the label type; determining the length of a character corresponding to the text title; determining the length of a target character according to the preset character length and the character length corresponding to the text title; and expanding the character length corresponding to the text title to the target character length, and further extracting semantic information of the text title of the target character length.
That is, the least common multiple of the character lengths corresponding to the tag category and the text title is set as the target character length. And then expanding the text header to the length of the target character so as to facilitate the input language model to extract semantic information. For example, when a text title is input into a model, the input length is ensured to be consistent by padding the length of the text title (which corresponds to a padding operation for making the input length of the text title consistent and can be padded by 0 by default). Two titles in the input sample: "notification about the energy-saving new energy vehicle and vessel enjoyment vehicle and vessel tax preferential policy", "notification about speeding up the propulsion of the renewable energy power generation subsidy project list review related work", by unifying the title length to a target length such as 40 (where 40 denotes 40 sequence lengths), the processed titles are as follows: "a notice 0000000000000000000 about the energy-saving new energy vehicle and vessel enjoying the vehicle and vessel tax preferential policy" and "a notice 000000000000 about the accelerated propulsion renewable energy power generation subsidy project list review related work". And then semantic information extraction is carried out on the text title with the target word length of 40 sequence lengths for example.
Correspondingly, the data on one side of the label also needs to be extended (padding) to 40 sequence lengths so as to ensure the consistency of vector dimensions and the consistency of dimensions after matrix operation, thereby improving the accuracy of text classification.
In this embodiment, invoking a tag semantic information vector corresponding to a tag category includes:
a tag semantic information vector corresponding to a tag class of a corresponding target word length (e.g., consistent with the extended text length, e.g., also 40 sequence lengths) is called. The execution main body can calculate and store the label semantic information vectors corresponding to the label categories with different sequence lengths in advance, and then the execution main body can call the stored label semantic vectors with the same or similar length to the target characters according to the determined length of the target characters to be directly used, so that the speed of classifying the texts is improved.
In some optional implementation manners of this embodiment, the executing entity may also predict a longest text title and a longest tag category in advance, then take the least common multiple of the sequence length corresponding to the longest text title and the sequence length corresponding to the longest tag category and use the least common multiple as the target sequence length of the tag category, then extend the tag category to the target sequence length of the tag category, and calculate a tag semantic information vector corresponding to the extended tag category in advance. When the input text title has variation, the multi-layer neural network model can be called only to recalculate the text title semantic vector for the varied text title. The pre-stored tag semantic information vector may then be directly invoked for text classification.
In this embodiment, generating a text header semantic information vector based on the extracted semantic information includes:
determining a title identifier corresponding to the text title; and inputting the extracted semantic information into a model corresponding to the title identifier in the multilayer neural network model (namely, inputting the semantic information corresponding to the text title into a B module in the multilayer neural network corresponding to the predetermined title identifier, wherein the B module corresponds to the determined model weight and can be directly used) to output a corresponding text title semantic information vector.
In this embodiment, before invoking the multi-layer neural network model, the text processing method further includes:
acquiring an initial multilayer neural network model;
acquiring a training sample set, wherein the training sample set comprises a multi-dimensional text title, a multi-level label category, a subject identifier corresponding to a labeled text title and a category identifier corresponding to a labeled text title;
and taking the multi-dimensional text title and the multi-level label category as the input of the multilayer neural network model, taking the subject identification corresponding to the labeled text title and the category identification corresponding to the labeled text title as the expected output, and training the initial multilayer neural network model to obtain the multilayer neural network model.
Specifically, for training sample set preparation: from the XX news station, the categories of the acquired subjects include 22 major categories and 100 minor categories. In order to uniformly process the category labels of each hierarchy, the labels of the respective hierarchies are processed as follows. In the process, according to a preset label system, the following L1 represents the classification of the primary label, and L2 represents the secondary label. L1 and L2 are just one example; the classification standard is according to the national standard; between L1 is a flat stage, L2 is a lower stage of L1; l1 { "general government", "finance, audit", "a institution", … … }, L2 { "finance", "tax", … … } ("second level label for finance, audit"). For another dimension, classification of industry tags, do the same process, example: h1 { "agriculture, forestry, fishery, animal husbandry", "mining", … … }; h2 { "coal mining and washing industry", "oil and gas mining industry", … … } ("second grade label of mining industry"). The label data processing of the two dimensions of 'theme' and 'industry' is completed. For the input of the model, the title of the text and the corresponding label category are used as input, for example, under the subject classification label, the "F tax law enforcement regulation" belongs to both the label of "finance, and audit (finance)" and the label of "finance, and audit (tax)", and at this time, 2 training samples are generated:
1) x1, regulations on implementation of tax Law; x2 { "comprehensive government affairs", "finance, auditing", … … }; y [ [0, 1, 0. ] 1 denotes a category that hits the corresponding position in X2.
2) X1, regulations on implementation of tax Law; x2 { "finance", "tax", … … }; y: [1, 1, 0.. once. ].
Wherein 1) corresponds to classification data of a first layer under the subject label, and 2) corresponds to classification data of a second layer under the subject label.
The same processing is performed under the job label, for example, as follows:
1) x1, notice of rectification in mining industry; x2 { "agriculture, forestry, fishery, animal husbandry", "mining", … … }; y: [0, 1, 0. ] is.
2) X1, notice of rectification in mining industry; x2 { "coal mining and washing industry", "oil and gas mining industry", … … }; y: [1, 1, 0.. once. ].
When the model is trained, the executive body can simultaneously input multi-level data and multi-dimensional label data into the model for training and fitting the optimization model.
Specifically, in the initial multi-layer neural network model (hereinafter, referred to as a model), after an article title and label data are used as input and a pre-training language model (any pre-training language model can be used for semantic feature extraction (module a), and large-scale training language models such as bert and roberta can be used), vectors containing semantic information (such as vectors of CLS token in the bert model) can be extracted. The left side (module B) can obtain text title semantic information vectors, and the right side (module C) can obtain label semantic information vectors corresponding to each label; at this time, the execution body may obtain a 1 × 768 vector of the title and a vector of each tag (assuming that the input tag data is at the next layer of the topic tag "finance, and audit", and this layer includes 7 subclasses), at this time, a 7 × 768 vector of tags can be obtained. Inputting the obtained text header semantic information vector and label semantic information vector into a multilayer neural network, processing the above steps, wherein the feature vector on the header side is still 1 × 768 and the feature vector on the label side is 7 × 768, and finally multiplying the two vector matrixes to obtain 1 × 7 vectors corresponding to the classification features of each label category. In order to realize the goal of multi-classification, a sigmoid function is used in the last layer of the model (the sigmoid function is often used as an activation function of a neural network, and a variable is mapped between 0 and 1), so that each dimension of 7 labels is subjected to two-classification, the multi-classification problem is converted into a plurality of two-classifications, and the model is trained by using cross entropy of the two classifications as a loss function. For left-side title input of a model, in order to ensure the input length to be consistent, padding (corresponding to a padding operation, the input length of the title is made consistent, and 0 can be padded by default) is performed on the length of the title to ensure the input length to be consistent. Two titles in the input sample: the method comprises the following steps of 'informing about the vehicle and vessel tax preferential policy of energy-saving new energy vehicles and vessels' and 'informing about accelerating the propelling of renewable energy power generation subsidy project list auditing related work'. The title length is unified to 40, and the processed titles are as follows: "a notice 0000000000000000000 of the vehicle and ship tax preferential policy of energy-saving new energy vehicle and ship enjoyment", "a notice 000000000000 of relevant work of accelerating the review of the renewable energy power generation subsidy project list". The data on one side of the label also needs to be expanded (padding), because in training, labels of multiple levels and labels of multiple dimensions need to be input into the model for training at the same time, in order to ensure the consistency of vector dimensions and ensure the consistency of dimensions after matrix operation, padding needs to be performed on one side of the label, all the labels are padded to reach consistent length (assuming that at this time, under various classification label systems, the maximum number is 10), and meanwhile, the sequence length of the label category itself is padded to 8. When the initial multi-layer neural network model calculates the gradient descent and loss of loss at the end, it needs to record which part is extended (padding) (record how the label classes are respectively padded, and the part does not participate in calculating loss), and the part does not perform gradient descent and loss of loss. The following are exemplified: extension to tag class: tags 1: { "comprehensive government", "finance, auditing", "AB institute organization", … … } 8; tags2 { "finance", "tax", … … } 9; tags3 { "agriculture, forestry, fishery, animal husbandry", "mining", … … } 7. The tag data after processing is as follows: tags 1: { "integrated government 0000", "finance, auditing", "AB institute organization 0", … …, 000000000}10 sequences of 8 lengths; tags2 { "finance 000000", "tax 000000", … …, 00000000}10 sequences of 8 lengths; tags3 { "agriculture, forestry, fishery, animal husbandry", "mining 00000", … …, 00000000}10 sequences of 8 lengths. Various data are input into the model together for training, and meanwhile, an MLP (multi-layer neural network part) in the model is used for fitting multi-level and multi-dimensional data, so that the robustness of the model is improved.
Notably, the pre-trained language model of the model itself is not dependent on bert, and alternatives can be made. According to the embodiment of the application, only one pre-trained language model is needed to extract the semantic information of the labels and the titles, and other pre-trained semantic models can be adopted for replacement, so that the semantic information of the labels and the titles can be extracted.
Step S103, calling a tag semantic information vector corresponding to the tag category, and further generating a subject identifier and a category identifier corresponding to the text based on the tag semantic information vector, the text title semantic information vector and the classifier so as to process the text based on the subject identifier and the category identifier.
In this embodiment, the tag semantic information vector and the text title semantic information vector may be input into a classifier, and a topic identifier and a category identifier of a corresponding text are generated, so that the text is processed based on the topic identifier and the category identifier.
In the embodiment, a text title and a label category of a text corresponding to a text classification request are obtained by receiving the text classification request; calling a multilayer neural network model to extract semantic information of the text title, and further generating a text title semantic information vector based on the extracted semantic information; and calling a tag semantic information vector corresponding to the tag category, and further generating a subject identifier and a category identifier of the text based on the tag semantic information vector, the title semantic information vector and the classifier so as to process the text based on the subject identifier and the category identifier. The text title and the label category are input into the multilayer neural network model, so that a text title semantic information vector and a label semantic information vector can be generated. The method has the advantages that the topic identification and the category identification corresponding to the text are generated based on the classifier, the classification of the multi-dimensional and multi-level labels can be completed only through one multi-level neural network model, the text is processed based on the classification, and time and labor are saved.
Fig. 2 is a schematic main flow diagram of a text processing method according to a second embodiment of the present application, and as shown in fig. 2, the text processing method includes:
step S201, receiving a text classification request, and acquiring a text title and a label category of a text corresponding to the text classification request.
Step S202, calling a multilayer neural network model to extract semantic information of the text title, and further generating a text title semantic information vector based on the extracted semantic information.
Step S203, calling a tag semantic information vector corresponding to the tag category, and further generating a subject identifier and a category identifier corresponding to the text based on the tag semantic information vector, the text title semantic information vector and the classifier so as to process the text based on the subject identifier and the category identifier.
The principle of step S201 to step S203 is similar to that of step S101 to step S103, and is not described here again.
Specifically, step S203 may also be implemented by step S2031:
step S2031, multiply the tag semantic information vector (for example, a vector matrix of 1 × 768) with the header semantic information vector (for example, a vector matrix of 7 × 768), generate a vector matrix, and input the vector matrix into the classifier to generate a classification identifier of the corresponding text.
Fig. 3 is a schematic view of an application scenario of a text processing method according to a third embodiment of the present application. The text processing method is applied to multi-label classification of a text, the classification can have a plurality of levels, and the large class of each level comprises a plurality of small classes; in addition, the classification of the text in multiple dimensions, that is, scenes classified under multiple different label systems, needs to be completed. As shown in fig. 3, an execution subject (for example, a server) 304 receives a text classification request 301, and obtains a text title 303 and a tag category 302 of a text corresponding to the text classification request 301. An execution agent (which may be a server, for example) 304 invokes a multi-layer neural network model 305 to extract semantic information 306 for the text header 303, thereby generating a text header semantic information vector 307 based on the extracted semantic information 306. The execution subject (e.g., server) 304 calls the tag semantic information vector 308 corresponding to the tag category 302, and generates a subject identifier and a category identifier of the corresponding text based on the tag semantic information vector 308, the text title semantic information vector 307, and the classifier 309, so as to process the text based on the subject identifier and the category identifier.
The text processing method has universality, and the problem of multiple dimensions is processed by using a set of model structure, so that the method not only can be used for processing the classification problem of text subjects, but also can be used for processing the classification problems of text audiences and text industries. In the design, the weight parameters are stored in the model in a blocking mode, the online reasoning time can be optimized, the performance of an online system is improved, and the time and the labor are saved.
Fig. 4 is a schematic diagram of main blocks of a text processing apparatus according to an embodiment of the present application. As shown in fig. 4, the text processing apparatus includes a receiving unit 401, a text title semantic information vector generating unit 402, and a text classifying unit 403.
The receiving unit 401 is configured to receive a text classification request, and obtain a text title and a tag category of a text corresponding to the text classification request.
A text header semantic information vector generating unit 402 configured to invoke the multi-layer neural network model to perform semantic information extraction on the text header, and further generate a text header semantic information vector based on the extracted semantic information.
The text classification unit 403 is configured to call a tag semantic information vector corresponding to a tag category, and further generate a subject identifier and a category identifier of a corresponding text based on the tag semantic information vector, the text title semantic information vector, and the classifier, so as to process the text based on the subject identifier and the category identifier.
In some embodiments, the text title semantic information vector generation unit 402 is further configured to: and inputting the text title into a language model to output text title semantic information corresponding to the text title, wherein the language model is used for representing the corresponding relation between the text and the semantic information.
In some embodiments, the text title semantic information vector generation unit 402 is further configured to: determining a title identifier corresponding to the text title; and inputting the extracted semantic information into a model corresponding to the title identifier in the multilayer neural network model to output a corresponding text title semantic information vector.
In some embodiments, the text title semantic information vector generation unit 402 is further configured to: determining the preset character length corresponding to the label type; determining the length of a character corresponding to the text title; determining the length of a target character according to the preset character length and the character length corresponding to the text title; and expanding the character length corresponding to the text title to the target character length, and further extracting semantic information of the text title of the target character length.
In some embodiments, text classification unit 403 is further configured to: and calling a label semantic information vector corresponding to the label category corresponding to the target character length.
In some embodiments, text classification unit 403 is further configured to: and multiplying the label semantic information vector by the title semantic information vector to generate a vector matrix, and inputting the vector matrix into a classifier to generate a classification identifier of the corresponding text.
In some embodiments, the text processing apparatus further comprises a training unit configured to: acquiring an initial multilayer neural network model; acquiring a training sample set, wherein the training sample set comprises a multi-dimensional text title, a multi-level label category, a subject identifier corresponding to a labeled text title and a category identifier corresponding to a labeled text title; and taking the multi-dimensional text title and the multi-level label category as the input of the multilayer neural network model, taking the subject identification corresponding to the labeled text title and the category identification corresponding to the labeled text title as the expected output, and training the initial multilayer neural network model to obtain the multilayer neural network model.
In the present application, the text processing method and the text processing apparatus have corresponding relation in the specific implementation contents, and therefore, the repeated contents are not described again.
Fig. 5 shows an exemplary system architecture 500 to which the text processing method or the text processing apparatus of the embodiments of the present application can be applied.
As shown in fig. 5, the system architecture 500 may include terminal devices 501, 502, 503, a network 504, and a server 505 (i.e., an execution agent). The network 504 serves to provide a medium for communication links between the terminal devices 501, 502, 503 and the server 505. Network 504 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 501, 502, 503 to interact with a server 505 over a network 504 to receive or send messages or the like. The terminal devices 501, 502, 503 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 501, 502, 503 may be various electronic devices having text processing screens and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 505 may be a server providing various services, i.e. a background management server (for example only) providing support for text classification requests submitted by users using the terminal devices 501, 502, 503, i.e. an execution agent. The background management server (namely an execution main body) can receive the text classification request and acquire a text title and a label category of a text corresponding to the text classification request; calling a multilayer neural network model to extract semantic information of the text title, and further generating a text title semantic information vector based on the extracted semantic information; and calling a tag semantic information vector corresponding to the tag category, and further generating a subject identifier and a category identifier of the corresponding text based on the tag semantic information vector, the text title semantic information vector and the classifier so as to process the text based on the subject identifier and the category identifier. The classification of multi-dimensional and multi-level labels can be completed only through one multi-level neural network model, so that the text is processed based on classification, and time and labor are saved.
It should be noted that the text processing method provided in the embodiment of the present application is generally executed by the server 505, and accordingly, the text processing apparatus is generally disposed in the server 505.
It should be understood that the number of terminal devices, networks, and servers in fig. 5 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use in implementing a terminal device of an embodiment of the present application. The terminal device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the computer system 600 are also stored. The CPU601, ROM602, and RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output section 607 including a signal processing section such as a Cathode Ray Tube (CRT), a liquid crystal credit authorization inquiry processor (LCD), and the like, and a speaker and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.
In particular, according to embodiments disclosed herein, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments disclosed herein include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The above-described functions defined in the system of the present application are executed when the computer program is executed by the Central Processing Unit (CPU) 601.
It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a receiving unit, a text header semantic information vector generating unit, and a text classifying unit. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.
As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs, and when the one or more programs are executed by the device, the device receives a text classification request and acquires a text title and a label category of a text corresponding to the text classification request; calling a multilayer neural network model to extract semantic information of the text title, and further generating a text title semantic information vector based on the extracted semantic information; and calling a tag semantic information vector corresponding to the tag category, and further generating a subject identifier and a category identifier of the corresponding text based on the tag semantic information vector, the text title semantic information vector and the classifier so as to process the text based on the subject identifier and the category identifier. The classification of multi-dimensional and multi-level labels can be completed only through one multi-level neural network model, so that the text is processed based on classification, and time and labor are saved.
According to the technical scheme of the embodiment of the application, the classification of the multi-dimensional and multi-level labels can be completed only through one multi-level neural network model, so that the text is processed based on the classification, and time and labor are saved.
The above-described embodiments should not be construed as limiting the scope of the present application. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (12)

1. A method of text processing, comprising:
receiving a text classification request, and acquiring a text title and a label category of a text corresponding to the text classification request;
calling a multilayer neural network model to extract semantic information of the text title, and further generating a text title semantic information vector based on the extracted semantic information;
and calling a tag semantic information vector corresponding to the tag category, and further generating a subject identifier and a category identifier corresponding to the text based on the tag semantic information vector, the text title semantic information vector and a classifier so as to process the text based on the subject identifier and the category identifier.
2. The method of claim 1, wherein extracting semantic information from the text header and generating a text header semantic information vector based on the extracted semantic information comprises:
and inputting the text title into a language model to output text title semantic information corresponding to the text title, wherein the language model is used for representing the corresponding relation between the text and the semantic information.
3. The method of claim 1, wherein generating a text header semantic information vector based on the extracted semantic information comprises:
determining a title identifier corresponding to the text title;
and inputting the extracted semantic information into a model corresponding to the title identifier in the multilayer neural network model to output a corresponding text title semantic information vector.
4. The method of claim 1, wherein the extracting semantic information from the text header comprises:
determining the preset character length corresponding to the label type;
determining the character length corresponding to the text title;
determining a target character length according to the preset character length and the character length corresponding to the text title;
and expanding the character length corresponding to the text title to the target character length, and further extracting semantic information of the text title of the target character length.
5. The method of claim 4, wherein the invoking of the tag semantic information vector corresponding to the tag category comprises:
and calling a label semantic information vector corresponding to the label category corresponding to the target character length.
6. The method of claim 1, wherein generating a topic identification and a category identification corresponding to the text based on the tag semantic information vector, the text header semantic information vector, and a classifier comprises:
and multiplying the label semantic information vector and the title semantic information vector to generate a vector matrix, and inputting the vector matrix into a classifier to generate a theme identifier and a category identifier corresponding to the text.
7. The method of claim 1, wherein prior to said invoking the multi-layer neural network model, the method further comprises:
acquiring an initial multilayer neural network model;
acquiring a training sample set, wherein the training sample set comprises a multi-dimensional text title, a multi-level label category, a theme identifier corresponding to the labeled text title and a category identifier corresponding to the labeled text title;
and taking the multi-dimensional text title and the multi-level label category as the input of the multilayer neural network model, taking the subject identification corresponding to the labeled text title and the category identification corresponding to the labeled text title as the expected output, and training the initial multilayer neural network model to obtain the multilayer neural network model.
8. A text processing apparatus, comprising:
the receiving unit is configured to receive a text classification request and acquire a text title and a label category of a text corresponding to the text classification request;
the text header semantic information vector generating unit is configured to call a multilayer neural network model to extract semantic information of the text header and further generate a text header semantic information vector based on the extracted semantic information;
and the text classification unit is configured to call a tag semantic information vector corresponding to the tag category, and further generate a subject identifier and a category identifier corresponding to the text based on the tag semantic information vector, the text title semantic information vector and the classifier so as to process the text based on the subject identifier and the category identifier.
9. The apparatus of claim 8, wherein the text header semantic information vector generation unit is further configured to:
and inputting the text title into a language model to output text title semantic information corresponding to the text title, wherein the language model is used for representing the corresponding relation between the text and the semantic information.
10. The apparatus of claim 8, wherein the text header semantic information vector generation unit is further configured to:
determining a title identifier corresponding to the text title;
and inputting the extracted semantic information into a model corresponding to the title identifier in the multilayer neural network model to output a corresponding text title semantic information vector.
11. A text processing electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
12. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202110767855.7A 2021-07-07 2021-07-07 Text processing method and device, electronic equipment and computer readable medium Pending CN113486147A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110767855.7A CN113486147A (en) 2021-07-07 2021-07-07 Text processing method and device, electronic equipment and computer readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110767855.7A CN113486147A (en) 2021-07-07 2021-07-07 Text processing method and device, electronic equipment and computer readable medium

Publications (1)

Publication Number Publication Date
CN113486147A true CN113486147A (en) 2021-10-08

Family

ID=77941831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110767855.7A Pending CN113486147A (en) 2021-07-07 2021-07-07 Text processing method and device, electronic equipment and computer readable medium

Country Status (1)

Country Link
CN (1) CN113486147A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114547313A (en) * 2022-04-22 2022-05-27 阿里巴巴达摩院(杭州)科技有限公司 Resource type identification method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114547313A (en) * 2022-04-22 2022-05-27 阿里巴巴达摩院(杭州)科技有限公司 Resource type identification method and device

Similar Documents

Publication Publication Date Title
US9495345B2 (en) Methods and systems for modeling complex taxonomies with natural language understanding
CN110705301B (en) Entity relationship extraction method and device, storage medium and electronic equipment
CN110019742B (en) Method and device for processing information
CN111325022B (en) Method and device for identifying hierarchical address
US11423307B2 (en) Taxonomy construction via graph-based cross-domain knowledge transfer
CN108984650A (en) Computer readable recording medium and computer equipment
CN107291775B (en) Method and device for generating repairing linguistic data of error sample
CN107798622B (en) Method and device for identifying user intention
CN114330474B (en) Data processing method, device, computer equipment and storage medium
CN110659657A (en) Method and device for training model
CN112463968A (en) Text classification method and device and electronic equipment
CN112528654A (en) Natural language processing method and device and electronic equipment
CN111639162A (en) Information interaction method and device, electronic equipment and storage medium
CN109982272B (en) Fraud short message identification method and device
CN112686035A (en) Method and device for vectorizing unknown words
CN113486147A (en) Text processing method and device, electronic equipment and computer readable medium
CN110807097A (en) Method and device for analyzing data
CN110852057A (en) Method and device for calculating text similarity
CN115935958A (en) Resume processing method and device, storage medium and electronic equipment
CN113837216B (en) Data classification method, training device, medium and electronic equipment
CN109858745A (en) Transcription platform matching process and device
CN116127060A (en) Text classification method and system based on prompt words
CN113886543A (en) Method, apparatus, medium, and program product for generating an intent recognition model
CN113688232A (en) Method and device for classifying bidding texts, storage medium and terminal
CN113806536A (en) Text classification method and device, equipment, medium and product thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination