CN113609866A

CN113609866A - Text marking method, device, equipment and storage medium

Info

Publication number: CN113609866A
Application number: CN202110920440.9A
Authority: CN
Inventors: 铁瑞雪
Original assignee: Tenpay Payment Technology Co Ltd
Current assignee: Tenpay Payment Technology Co Ltd
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2021-11-05

Abstract

The application discloses a text marking method, a text marking device, text marking equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring a target text; splicing the text type marker to a target text to obtain a character sequence corresponding to the target text; extracting semantic features of the character sequence to obtain probability distribution of role labels corresponding to the characters in the character sequence; determining key information of a target text and a text type corresponding to the text type marker according to the probability distribution of the role label; and generating mark information of the target text based on the key information and the text type. According to the technical scheme, the semantic features of the character sequence formed after the classification marks are added are extracted, the probability distribution of the key information type and the text type corresponding to each character is obtained, and then the key information and the text type of the text are determined at the same time, so that the efficiency of text marking is effectively improved, and the complexity of text marking is reduced.

Description

Text marking method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a text labeling method, apparatus, device, and storage medium.

Background

With the rapid development of internet information technology, the data volume of text data is rapidly increased, and how to manage massive text data becomes a key point of attention in the industry.

In the related art, a text classification task and an information extraction task are generally performed separately. For the text classification task, a keyword triggering mode is usually adopted to perform the text classification task, or a common classification model is adopted to classify the complaint text into a corresponding category. For the information extraction task, keywords in the text are often extracted through a keyword extraction model, or the text is marked to a corresponding topic tag through a topic model, or corresponding entity information in the text is extracted through a named entity recognition model.

In the related art, the extracted text information has limited amount and high complexity.

Disclosure of Invention

The embodiment of the application provides a text marking method, a text marking device, text marking equipment and a storage medium, which can simultaneously determine the text type and key information of a text, improve the information content of text marking information and reduce the complexity of text marking.

According to an aspect of an embodiment of the present application, there is provided a text labeling method, including:

acquiring a target text;

splicing the text type marker to the target text to obtain a character sequence corresponding to the target text;

performing semantic feature extraction processing on the character sequence to obtain role label probability distribution corresponding to each character in the character sequence, wherein the role label probability distribution is used for representing the probability of each character corresponding to each role label, and each role label is used for representing key information attributes and text type attributes of the character;

determining key information of the target text and a text type corresponding to the text type marker according to the role label probability distribution;

and generating mark information of the target text based on the key information and the text type.

According to an aspect of an embodiment of the present application, there is provided a text marking apparatus, including:

the text acquisition module is used for acquiring a target text;

the mark splicing module is used for splicing the text type mark to the target text to obtain a character sequence corresponding to the target text;

a label probability determination module, configured to perform semantic feature extraction processing on the character sequence to obtain role label probability distribution corresponding to each character in the character sequence, where the role label probability distribution is used to represent probabilities of each character corresponding to each role label, and each role label is used to represent a key information attribute and a text type attribute of a character;

the information determining module is used for determining key information of the target text and a text type corresponding to the text type marker according to the role label probability distribution;

and the information marking module is used for generating marking information of the target text based on the key information and the text type.

According to an aspect of embodiments of the present application, there is provided a computer device comprising a processor and a memory, wherein at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the above text tagging method.

According to an aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions that is loaded and executed by a processor to implement the above-mentioned text labeling method.

According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the text labeling method described above.

The technical scheme provided by the embodiment of the application can bring the following beneficial effects:

the method comprises the steps of adding text classification marks on an original text, extracting semantic features of a character sequence formed after the text classification marks are added, obtaining role label probability distribution corresponding to each character, wherein the probability distribution is used for predicting whether each character is key information or not and predicting a text type corresponding to the text classification marks, finally determining the key information and the text type of a target text at the same time according to the role label probability distribution, and using the key information and the text type as structured mark information of the target text, combing unstructured text data through structured labels, effectively improving the efficiency of text marking, improving the information content of mark information and reducing the complexity of text marking.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an application execution environment provided by one embodiment of the present application;

FIG. 2 is a flow diagram of a text tagging method provided by one embodiment of the present application;

FIG. 3 is a flow diagram of a text tagging method provided by one embodiment of the present application;

FIG. 4 illustrates a schematic diagram of determining character embedding characteristics by a BERT model;

FIG. 5 is a diagram illustrating a basic structure of a transform (Transformer) model in a BERT model;

FIG. 6 illustrates a schematic diagram of determining a role label probability distribution based on a Conditional Random Field (CRF) model;

FIG. 7 is a schematic diagram illustrating a network structure of a complaint text label model;

FIG. 8 is a flow chart of a text labeling method provided by another embodiment of the present application;

FIG. 9 is a block diagram of a text marking apparatus provided in one embodiment of the present application;

fig. 10 is a block diagram of a computer device according to an embodiment of the present application.

Detailed Description

The text labeling method provided by the embodiment of the application relates to an artificial intelligence technology and a block chain technology, which are briefly described below to facilitate understanding of those skilled in the art.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Deep learning: the concept of deep learning stems from the study of artificial neural networks. A multi-layer perceptron with multiple hidden layers is a deep learning structure. Deep learning forms a more abstract class or feature of high-level representation properties by combining low-level features to discover a distributed feature representation of the data.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

The method provided by the embodiment of the application may relate to the field of cloud technology, for example, to the field of Big data (Big data). Big data is a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset which can have stronger decision-making power, insight discovery power and flow optimization capability only by a new processing mode. With the advent of the cloud era, big data has attracted more and more attention, and the big data needs special technology to effectively process a large amount of data within a tolerance elapsed time. The method is suitable for the technology of big data, and comprises a large-scale parallel processing database, data mining, a distributed file system, a distributed database, a cloud computing platform, the Internet and an extensible storage system.

The method provided by the embodiment of the present application may further involve a blockchain, that is, the method provided by the embodiment of the present application may be implemented based on the blockchain, or data involved in the method provided by the embodiment of the present application may be stored based on the blockchain, or an execution subject of the method provided by the embodiment of the present application may be located in the blockchain. The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

The block chain underlying platform can comprise processing modules such as user management, basic service, intelligent contract and operation monitoring. The user management module is responsible for identity information management of all blockchain participants, and comprises public and private key generation maintenance (account management), key management, user real identity and blockchain address corresponding relation maintenance (authority management) and the like, and under the authorization condition, the user management module supervises and audits the transaction condition of certain real identities and provides rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation monitoring module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation, such as: alarm, monitoring network conditions, monitoring node equipment health status, and the like.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, a schematic diagram of an application execution environment according to an embodiment of the present application is shown. The application execution environment may include: a terminal 10 and a server 20.

The terminal 10 may be an electronic device such as a mobile phone, a tablet Computer, a game console, an electronic book reader, a multimedia playing device, a wearable device, a PC (Personal Computer), and the like. A client of the application may be installed in the terminal 10.

In the embodiment of the present application, the application program may be any application program that generates text data. For example, the application may be a financial application, a news application, a social application, an interactive entertainment application, a browser application, a shopping application, a content sharing application, a Virtual Reality (VR) application, an Augmented Reality (AR) application, and the like, which is not limited in this embodiment.

The server 20 is used to provide background services for clients of applications in the terminal 10. For example, the server 20 may be a backend server for the application described above. The server 20 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform. Optionally, the server 20 provides background services for applications in multiple terminals 10 simultaneously.

Alternatively, the terminal 10 and the server 20 may communicate with each other through the network 30. The terminal 10 and the server 20 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited thereto.

Before describing the method embodiments provided in the present application, a brief description is given to the application scenarios, related terms, or terms that may be involved in the method embodiments of the present application, so as to facilitate understanding by those skilled in the art of the present application.

Named Entity Recognition (NER), also called "proper name Recognition", refers to Recognition of entities in text having specific meaning, mainly including names of people, places, organizations, proper nouns, etc.

Information extraction is a technique for extracting specific information from text data.

Bidirectional encoding Representation model (BERT) based on conversion model: the model is used for pre-training Language representation, a general Language understanding model is trained on the basis of text corpora, and a Natural Language Processing (NLP) task can be executed in an auxiliary mode through a BERT model.

Conditional Random Fields (CRF): is a probabilistic graphical model, which is commonly used to label or analyze sequence data, such as natural language text or biological sequences. The conditional random field is a conditional probability distribution model P (Y | X) representing a markov random field of another set of output random variables Y given a set of input random variables X, i.e., the CRF is characterized by assuming that the output random variables constitute a markov random field, which can be regarded as a generalization of the maximum entropy markov model on the labeling problem.

Long Short Term Memory artificial neural network (LSTM), Long Short Term Memory: the method is a recurrent neural network and is suitable for capturing position information before and after a sequence and predicting the sequence.

The abbreviation of Bi-directional Long Short-Term Memory artificial neural network (BilSTM) is composed of forward LSTM and backward LSTM.

Referring to fig. 2, a flowchart of a text labeling method according to an embodiment of the present application is shown. The method can be applied to a computer device, which refers to an electronic device with data calculation and processing capabilities, for example, the execution subject of each step can be the server 20 in the application program running environment shown in fig. 1. The method can comprise the following steps (210-250).

Step 210, obtaining a target text.

The target text in the embodiment of the present application may be comment text, dialog text, message text, and the like, and the source and the category of the text are not limited in the embodiment of the present application.

In some practical business scenes, the complaint data is valuable data, reflects the intuitive feedback of the user to the enterprise, and is an important clue source for finding the enterprise risk in time. The complaint data mostly exists in the form of texts, and the complaint texts generated in financial applications are typical. Therefore, the target text in the embodiment of the present application may be the above-mentioned complaint text.

Optionally, the target text is a data-washed text. In some application scenarios, the original text data often has the characteristics of large data volume, more text noise, overlong character description, complex semantic expression and the like. Therefore, the original text can be obtained first, and then the data cleaning processing is performed on the original text to obtain the target text. Optionally, the data cleaning process is a data preprocessing stage. In the data preprocessing stage, the target text is mainly cleaned, and the cleaning operation includes but is not limited to denoising operation, punctuation mark cleaning operation, wrongly written character correction operation, emoticon recognition and removal operation and the like. Optionally, the denoising operation includes removing illegal characters, stop words, and the like.

In one example, the raw text is complaint text uploaded by the user. The content of the original text is 'limited-level movie and television app payment without shipment, and small-video friend-making and strange'. After the data is cleaned, the obtained target text is 'limit level movie and television app payment without shipment, and small video profit-making fraud'. Obviously, redundant punctuation marks in the original text are deleted, and wrongly written characters are corrected, so that a high-quality target text is obtained, text noise is effectively removed, and the text quality is improved.

Step 220, splicing the text type markers to the target text to obtain a character sequence corresponding to the target text.

In order to extract the key information of the target text and judge the text type, the text type marker can be added to the target text as a classification marker of the target text. Alternatively, the text type marker may be denoted "[ CTL ]". The text type marker may be spliced to the end of the target text for predicting the text type of the target text.

In this embodiment, the target text spliced with the text type marker may be input into a machine learning model trained in advance, so that the model predicts the role label of each word in each target text, that is, the type of each word, and also predicts the type of the text type marker [ CTL ], where the type of the text type marker [ CTL ] is the final type of the target text.

Optionally, the text type marker is a placeholder. Optionally, the text type marker is a character having a preset encoding.

And each word in the target text has a corresponding code representation in a computer system, and the character sequence is a sequence formed by character codes corresponding to the target text after the text type marker is spliced. Optionally, the last character in said sequence of characters is said text type marker.

And step 230, performing semantic feature extraction processing on the character sequence to obtain the probability distribution of the role labels corresponding to the characters in the character sequence.

The role label probability distribution is used for representing the probability of each character corresponding to each role label, and each role label is used for representing the key information attribute and the text type attribute of the character.

The character tags include, but are not limited to, key information tags indicating key information attributes of the character, text type tags indicating text type attributes of the character, and other tags indicating that the character is not of interest to the current task.

In one possible embodiment, the role labels are divided according to the BIO (Begin Inside out) labeling format. Optionally, the key information tag includes a key information first word tag and a key information middle word tag. Optionally, the text type tags include first word tags and middle word tags corresponding to various text types. In one text type division manner, the text type may include valid complaint text and invalid complaint text. Correspondingly, the text type labels comprise a valid complaint text first character label, a valid complaint text middle character label, an invalid complaint text first character label and an invalid complaint text middle character label. The embodiment of the application does not limit the format of the sequence label, does not limit the type and the number of the role labels, and can determine the label format and the needed role labels according to the actual situation.

In some practical service scenarios, it is desirable to classify texts on one hand and extract key information in the texts as text labels on the other hand. Through the role label probability distribution corresponding to each character, the probability that each character belongs to each text type and the probability that each character belongs to key information can be determined, and then the text type and the key information of the text can be determined simultaneously, the two requirements can be met simultaneously, two calculation tasks are not needed, the complexity of text labeling is effectively reduced, and the efficiency of text labeling is improved.

In an exemplary embodiment, as shown in fig. 3, which shows a flowchart of a text labeling method provided in an embodiment of the present application, the step 230 can be implemented as the following step 231.

And 231, inputting the character sequence into a text classification extraction model to perform semantic feature extraction processing, so as to obtain the probability distribution of the role labels corresponding to the characters in the character sequence.

The text classification extraction model is a machine learning model obtained by performing combined training based on a text classification task and an information extraction task, wherein the text classification extraction model is a machine learning model which is obtained by taking a sample text as a training sample, taking the text type of the sample text and key information in the sample text as label information.

Optionally, the text type identifier corresponding to the text type of the sample text may be spliced to the sample text to serve as the text type tag corresponding to the sample text. Optionally, a key information tag is added to each word in the sample text. For example, a word in the sample text is the first word of the key information, i.e., the word may be tagged with the first word of the key information.

And simultaneously, taking a text type label and a key information label corresponding to the sample text as label information, and performing combined training on the machine learning model to obtain a text classification extraction model, so that the text classification extraction model can simultaneously execute the text classification task and the information extraction task, determine the text type and simultaneously extract the key information in the text. In one example, the key information tag is defined as KW based on the idea of the sequence tagging model. For example, the target text is a complaint text: "Limit level movie app payments not shipments, small video profit-making fraud", with KW labeled: "Limited-level movie-video little video fraud".

Optionally, the text classification extraction model may extract a key information phrase from the text based on a named entity recognition technology, and extend the entity to the length of the phrase. In addition, the text classification extraction model implicitly introduces prior knowledge through the joint extraction of the key information and the text type, and improves the accuracy of a text classification task and an information extraction task.

The text classification extraction model can be a BERT + CRF-based language processing model, can also be a BilSTM + CRF-based language processing model, or can be a BERT + BilSTM + CRF-based language processing model. The text classification task and the information extraction task can also be performed as two independent tasks, and a pipeline form can be referred to specifically.

In the embodiment of the application, a labeling scheme of a corresponding sample text is designed based on a sequence-to-sequence (Seq2Seq) model, and a text type marker is added to an original text, so that the model can predict a key information label corresponding to each character and also predict a text type label corresponding to the text type marker, and therefore the purpose of joint training based on a text classification task and an information extraction task can be achieved. In some application scenarios, the text classification and extraction model achieves a better effect on the task of marking the complaint text.

In a possible implementation mode, the target text is a complaint text, the BERT + CRF framework is further modified, a complaint classification label "[ CTL ]", namely the text type label, is added to the original text, the complaint text type label of the complaint classification label is predicted while the key information label of each character is predicted, so that the complaint type of the complaint text can be judged simultaneously by training the model, and the complaint key information in the complaint text can be extracted.

In some embodiments, text labeling is performed only by extracting keywords, and the labeling result depends on the word segmentation effect of the word segmentation device, and a general keyword extraction algorithm cannot achieve a good effect in a specific service scene. The key information label corresponding to the sample text is label information labeled at a character level, and the key information can be adjusted according to a service scene without depending on a word segmentation device. Therefore, the text classification extraction model can output character label probability distribution at a character level, is suitable for scenes with complex semantics, can output a section of key words instead of words as key information, and can effectively improve the accuracy of a text classification task and a key information extraction task. In other embodiments, text labeling is performed based on the topic model, so that a relatively obvious long tail effect exists, and most labels are inaccurate. And the role label is clearly and accurately marked, so that the long tail effect can be effectively avoided. In some embodiments, the text classification is performed through a named entity model, but the named entity model is suitable for extracting entities with obvious boundaries, such as persons, addresses and the like.

In an exemplary embodiment, the implementation of the above step 231 is as follows.

Firstly, embedding processing is carried out on the character sequence to obtain a character characteristic sequence.

For a target character in the character sequence, a word embedding feature, a sentence embedding feature, and a position embedding feature of the target character are determined.

In one possible implementation, the word embedding feature, the sentence embedding feature, and the position embedding feature corresponding to the target character may be obtained by performing the word embedding process, the sentence embedding process, and the position embedding process on the target character, respectively.

The word embedding feature can be a word embedding feature vector corresponding to the character, is used for representing a feature value of the character in a feature space, and has the characteristic of low dimension and density. The sentence embedding feature is used for characterizing a feature corresponding to a sentence where the character is located, and may be a sentence embedding feature vector corresponding to the sentence where the character is located. Optionally, sentence embedding features corresponding to different characters in the same sentence are the same. The position embedding features are used for characterizing the position features of the characters in the character sequence and reflecting the position difference between different characters.

And obtaining the character embedding characteristics of the target character based on the word embedding characteristics, the sentence embedding characteristics and the position embedding characteristics.

Optionally, the word embedding feature, the sentence embedding feature, and the position embedding feature of the target character are subjected to superposition processing to obtain the character embedding feature of the target character. The character embedding features represent the features of the characters from word dimensions, sentence dimensions and position dimensions, and fully reflect feature information carried by each character.

And arranging the character embedding features of the characters to obtain a character feature sequence.

And arranging the character embedding features of the characters according to the position sequence of the characters in the character sequence to obtain the character feature sequence.

In one example, as shown in FIG. 4, a schematic diagram illustrating the determination of character embedding features by a BERT model is illustrated. The BERT model may perform a word segmentation process on an input text, that is, the BERT model may perform a process using a word segmentation result as a unit to obtain a character embedding feature corresponding to each word segmentation result. In performing feature extraction, the BERT model may add additional labels to the results of the word, illustratively, [ CLS ]]Characterization class Mark, [ SEP]The statement token is characterized. On the basis of obtaining the word segmentation result, word embedding processing (Token embedding), sentence embedding (Segment embedding) and Position embedding (Position embedding, taking "mydogiscent hellikes playing" as an example of the input text, word-by-word segmentation processing is carried out on the input text, the word segmentation result of obtaining a plurality of "words" mydogiscent hellikes playing "is" my "," dog "," is "," cute "," he "," like "," play "," # ing ", and [ CLS ] is added]Classification Mark, [ SEP]After classification marking, a character sequence is obtained [ CLS]”、“my”、“dog”、“is”、“cute”、“[SEP]”、“he”、“likes”、“play”“##ing”、“[SEP]". And carrying out Token embedding, Segment embedding and Position embedding on each character in the character sequence to finally obtain the word embedding feature, sentence embedding feature and Position embedding feature corresponding to each character. Character sequence { "[ CLS {" ]]”、“my”、“dog”、“is”、“cute”、“[SEP]”、“he”、“likes”、“play”“##ing”、“[SEP]In the's, the character embedding characteristics corresponding to each character are E_[CLS]、E_my、E_dog、E_is、E_cute、E_[SEP]、E_he、E_likes、E_play、E_##ing、E_[SEP](ii) a The sentence embedding characteristics corresponding to each character are respectively E_A、E_A、E_A、E_A、E_A、E_A、E_B、E_B、E_B、E_B、E_B(ii) a The position embedding characteristics corresponding to each character are respectively E₀、E₁、E₂、E₃、E₄、E₅、E₆、E₇、E₈、E₉、E₁₀. Alternatively, the word embedding feature, the sentence embedding feature and the position embedding feature of the character may be added to obtain the character embedding feature of the character.

And then, performing bidirectional semantic feature extraction processing on the character feature sequence to obtain role label probability distribution corresponding to each character in the character sequence.

And performing bidirectional semantic feature extraction processing on the character feature sequence to obtain first role label probability distribution of each character corresponding to each key information label and second role label probability distribution of each character corresponding to each text type label.

The role label probability distribution comprises a first role label probability distribution and a second role label probability distribution, the key information labels comprise key information first character labels and key information middle character labels, and the text type labels comprise labels corresponding to at least one text type.

In an exemplary embodiment, a transformation (Transformer) model in which the character feature sequence is input into the BERT model is subjected to bidirectional semantic feature extraction processing, so that a first role label probability distribution of each character corresponding to each key information label and a second role label probability distribution of each character corresponding to each text type label are obtained.

In the above Transformer model, based on the multi-head attention mechanism, correlations between embedded features of each character can be mined and quantized into feature values in each feature dimension in a numerical form, and weight information of each embedded feature of each character in each feature dimension is determined, thereby determining probability distribution of each character as each character label. The role labels simultaneously comprise key information labels and text type labels, so that the role probability label distribution can be decomposed into a first role label probability distribution and a second role label probability distribution.

In one example, as shown in fig. 5, a schematic diagram of a basic structure of a transform (Transformer) model in a BERT model is exemplarily shown. The Transformer is a new architecture, which can replace the traditional recurrent neural network and convolutional neural network for machine learning. The structure of the transform, divided into left encoder and right decoder, not only adds Multi-Head Attention (Multi Head Attention), but also adds Self Attention (Self Attention) Attention and fusion normalization (Add & Norm) inside, finally passes through linearization (Linear) and activation layer. Wherein the activation layer takes Softmax as an activation function. The Transformer learns different characteristics from different dimensions and adds position information by position-based Encoding (Positional Encoding). The conversion model can extract the high-order semantic features of the input corpus.

And 240, determining key information of the target text and the text type corresponding to the text type marker according to the probability distribution of the role label.

The key information may be a short phrase that characterizes the textual description, acting similar to a keyword.

In one possible implementation, the key information is phrase level, not word level. Therefore, the information content of the key information is richer.

In an exemplary embodiment, the role labels include a key information label and a text type label. Accordingly, as shown in FIG. 3, the step 240 includes the following substeps (241-244).

And 241, determining label transition probability between adjacent characters in each character based on the character label probability distribution.

The label transition probability is used for representing the possibility that the role labels corresponding to the adjacent characters are correct labels. For example, the rationality of the label prediction combination formed by the preceding character as the first character of the key information and the following character as the intermediate character of the key information is higher than that of the label prediction combination formed by the preceding character as the intermediate character of the key information and the following character as the first character of the key information.

And 242, determining a role label sequence corresponding to the character sequence according to the label transition probability.

The role label sequence comprises role labels corresponding to the characters.

When each character in the character sequence corresponds to a different role label, different role label combinations can be formed to form different role label sequences. However, each character has a probability value corresponding to each character label, and the joint probabilities corresponding to the different character label sequences are different, and the joint probability corresponding to the same character label sequence can be determined through the label transition probability, and then the character label sequence corresponding to the character sequence is determined based on the joint probability corresponding to each character label sequence.

And 243, identifying characters corresponding to each key information label in the role label sequence to obtain key information of the target text.

In one example, the character sequence is "limit level movie app pay not to ship, small video profit-fraud [ CTL ]", and the characters corresponding to the key information tags in the corresponding character tag sequence are "limit, system, level, shadow, video, small, video, fraud", respectively, that is, the key information of the target text can be determined to be "limit level movie small video fraud". Further, "limited" and "small" correspond specifically to the first word label of the key information, and then the key information of the target text is "limited-level movie and" small video fraud ".

Step 244, identify the text type tag corresponding to the text type marker in the role tag sequence, and determine the text type of the target text.

In one example, the character sequence is "limit level movie app pay not to ship, small video profit-making fraud [ CTL ]", and the character corresponding to the text type tag in the character tag sequence is "[ CTL ]", that is, the text type of the target text can be determined according to the specific text type tag corresponding to [ CTL ]. For example, if the text type tag corresponding to [ CTL ] is specifically a valid complaint text tag, it can be determined that the target text is a valid complaint text and can be recorded as a complaint type: true (valid).

In one example, as shown in FIG. 6, a schematic diagram illustrating the determination of a role label probability distribution based on a Conditional Random Field (CRF) model is illustrated. As above, the role label probability distribution corresponding to each character (Token) can be output through the BERT model. For example, the probability that a character belongs to B KW, I KW … … I T, where B KW denotes a key information first word tag, I KW denotes a key information middle word tag, and I T denotes an effective complaint text middle word tag. The probability that a character corresponds to B KW represents the possibility that the character is the first character of the key information, the probability that a character corresponds to I KW represents the possibility that the character is the middle character of the key information, and the probability that a character corresponds to I T represents the possibility that the character is the first character of the effective complaint text. And inputting the probability distribution of the role labels corresponding to the characters (Token) into a CRF model, and finally outputting the optimal label prediction sequence corresponding to the characters by adding label transfer constraint on the CRF model. And for any character, the probability distribution of the role label corresponding to the character finally determines the role label corresponding to the character. For example, after one character passes through the CRF layer, it is determined as a B KW label, the next character is determined as an I KW label, and the last character is determined as an I T label. The last character is often a text type marker, and if the corresponding label is I T label, the text type of the text is a valid complaint text, and the end of the text is spliced with at least two [ CTL ] characters, so that a valid complaint type medium character label I T appears. If the end of the text is concatenated with only one [ CTL ] character, since the characters characterizing valid complaint text have only one [ CTL ] character, and there is no corresponding intermediate word, the last [ CTL ] character may be determined to be a valid complaint type first character tag B T or an invalid complaint type first character tag B F, rather than an I-T tag.

And step 250, generating marking information of the target text based on the key information and the text type.

And combining words in the key information to serve as a keyword label of the target text.

By extracting the key information as the text keyword tag, the core information of the text can be rapidly known, and the type of the text can be assisted to be judged.

And taking the text type as a type label of the target text.

Generating label information based on the keyword label and the type label.

In one example, the target text is "limit level movie app pay not to ship, small video profit-making fraud", and the structured markup information of the final output target text is { "complaint type": true, "key information": "restricted-level movie", "small video fraud" }.

In an exemplary embodiment, according to the identified key information, a subtype of a text type in which the target text is located may be further determined. For example, according to the extracted "limited-level movie small video fraud", it can be determined that the target text belongs to pornographic complaint texts. Furthermore, word cloud pictures with different dimensions can be drawn according to the key information. For example, after key information of all complaint texts of a certain merchant is collected, a word cloud graph corresponding to the merchant can be determined through a word cloud graph drawing technology, and the complaint overview of the merchant can be visually shown through the word cloud graph.

In one example, as shown in FIG. 7, a network architecture diagram of a complaint text labels model is illustrated. The complaint text marking model is a text marking model based on BERT and CRF, and is used for marking the complaint text. Firstly, adding a text classification mark to the complaint text to form a character sequence { [ CLS { []Tok1, Tok2, …, TokN }, input to the BERT layer. Wherein, [ CLS]To classify tags, Tok1, Tok2 denote the first character and the second character in a sequence of characters, respectively, and TokN denotes the added text type tag. The interior of the BERT layer determines the character characteristics corresponding to each character, namely the character sequence { [ CLS ]]Tok1, Tok2, … and TokN, the character characteristics corresponding to each character are respectively E_[CLS]、E₁、E₂、…、E_N. Then, BThe Transformer model in the ERT layer can carry out bidirectional semantic feature extraction processing on character features to obtain a high-level semantic recognition result corresponding to each character, namely a character sequence { [ CLS { []C, T high-level semantic recognition results for each of the characters Tok1, Tok2, … and TokN₁、T₂、…、T_N. Wherein, the high-level semantic recognition result is the probability distribution of characters corresponding to each role label. Illustratively, the role labels include a key information first word label B KW, a key information middle word label I KW, a valid complaint type first word label B T, a valid complaint type middle word label I T, an invalid complaint type first word label B F, an invalid complaint type middle word label B T, and other word labels O. And finally, inputting the high-level semantic recognition result corresponding to each character, namely the role label probability distribution corresponding to each character into a CRF layer, increasing label transfer probability constraint through the CRF layer, and finally outputting the optimal role label sequence. For example, the label corresponding to the character Tok1 is B KW, the character label corresponding to the character Tok2 is I KW, and the character label corresponding to the text type marker TokN is I T. That is, the characters Tok1 and Tok2 are the first character and the middle character of the key information, respectively, and the text type of the complaint text is the valid complaint text.

In summary, according to the technical scheme provided by the embodiment of the application, the text classification label is added to the target text, and the character sequence formed after the text classification label is added is input into the text classification extraction model for semantic feature extraction, so that the probability distribution of the role label corresponding to each character is obtained. The probability distribution is used for predicting whether each character is key information or not and predicting the text type corresponding to the text classification mark, further determining the optimal label sequence corresponding to the target text from the character label probability distribution based on the conditional random field model, determining the key information and the text type of the target text according to the characters corresponding to each label in the label sequence, and using the key information and the text type as the structural mark information of the target text. Through the text classification extraction model, unstructured text data can be combed through structured labels, so that auditors can be rapidly assisted to judge key information contained in the complaint text, and manual audit time is greatly saved. Meanwhile, advanced semantic knowledge is actively learned through the deep learning model, and a large-scale keyword library or feature library does not need to be maintained.

Compared with a keyword triggering mode, the accuracy and the coverage rate of the text marking scheme based on the text classification extraction model are obviously improved. In addition, the text type marker is added in the text, so that the combined training of the text classification task and the information extraction task is realized, compared with the single two tasks, semantic information between the tasks can be shared, the prior knowledge is implicitly introduced, and the overall effect is improved.

Referring to fig. 8, a flowchart of a text labeling method according to another embodiment of the present application is shown. The method can be applied to a computer device, which refers to an electronic device with data calculation and processing capabilities, for example, the execution subject of each step can be the server 20 in the application program running environment shown in fig. 1. The method can include the following steps (801-813).

Step 801, acquiring a target text.

And step 802, splicing the text type markers to the target text to obtain a character sequence corresponding to the target text.

Step 803, for the target character in the character sequence, determining the word embedding feature, sentence embedding feature and position embedding feature of the target character.

And step 804, obtaining the character embedding characteristics of the target character based on the word embedding characteristics, the sentence embedding characteristics and the position embedding characteristics.

Step 805, arranging the character embedding features of each character to obtain a character feature sequence.

Step 806, performing bidirectional semantic feature extraction processing on the character feature sequence to obtain a first role label probability distribution of each character corresponding to each key information label and a second role label probability distribution of each character corresponding to each text type label.

Step 807, determine label transition probabilities between adjacent characters in each character based on the first role label probability distribution and the second role label probability distribution.

And 808, determining a role label sequence corresponding to the character sequence according to the label transition probability.

The role label sequence comprises role labels corresponding to all characters;

and step 809, identifying characters corresponding to each key information label in the role label sequence to obtain key information of the target text.

Step 810, identifying a text type label corresponding to the text type marker in the role label sequence, and determining the text type of the target text.

And step 811, using the word combination in the key information as the keyword tag of the target text.

Step 812, the text type is used as the type label of the target text.

Step 813, generating structured markup information of the target text based on the keyword tag and the type tag.

Descriptions of the steps in this embodiment have been described in the above embodiment, and are not repeated here.

In summary, according to the technical scheme provided by the embodiment of the application, text classification marks are added to an original text, semantic feature extraction is performed on a character sequence formed after the text classification marks are added, so that role label probability distribution corresponding to each character is obtained, the probability distribution is not only used for predicting whether each character is key information or not, but also used for predicting a text type corresponding to the text classification marks, and finally, according to the role label probability distribution, the key information and the text type of a target text are determined at the same time and are used as structural mark information of the target text, unstructured text data are sorted through the structural marks, so that the efficiency of text marking is effectively improved, the information quantity of mark information is improved, and the complexity of text marking is reduced.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 9, a block diagram of a text labeling apparatus according to an embodiment of the present application is shown. The device has the function of realizing the text marking method, and the function can be realized by hardware or hardware executing corresponding software. The device can be a computer device and can also be arranged in the computer device. The apparatus 900 may include: a text acquisition module 910, a tag concatenation module 920, a tag probability determination module 930, an information determination module 940, and an information tagging module 950.

A text obtaining module 910, configured to obtain a target text.

And a tag splicing module 920, configured to splice the text type tag to the target text, so as to obtain a character sequence corresponding to the target text.

A label probability determining module 930, configured to perform semantic feature extraction processing on the character sequence to obtain role label probability distribution corresponding to each character in the character sequence, where the role label probability distribution is used to represent probabilities of each character corresponding to each role label, and each role label is used to represent a key information attribute and a text type attribute of a character.

An information determining module 940, configured to determine, according to the role label probability distribution, key information of the target text and a text type corresponding to the text type marker.

An information tagging module 950, configured to generate tagging information of the target text based on the key information and the text type.

In an exemplary embodiment, the tag probability determination module 930 includes: and a label probability determination unit.

And the label probability determining unit is used for inputting the character sequence into a text classification extraction model to perform semantic feature extraction processing to obtain role label probability distribution corresponding to each character in the character sequence.

In an exemplary embodiment, the tag probability determination unit includes: a character feature determination subunit and a bidirectional semantic feature extraction subunit.

And the character characteristic determining subunit is used for embedding the character sequence to obtain a character characteristic sequence.

And the bidirectional semantic feature extraction subunit is used for performing bidirectional semantic feature extraction processing on the character feature sequence to obtain role label probability distribution corresponding to each character in the character sequence.

In an exemplary embodiment, the bidirectional semantic feature extraction subunit is specifically configured to:

The role label probability distribution comprises the first role label probability distribution and the second role label probability distribution, the key information labels comprise key information first character labels and key information middle character labels, and the text type labels comprise labels corresponding to at least one text type.

In an exemplary embodiment, the character feature determination subunit is specifically configured to:

for a target character in the character sequence, determining a word embedding feature, a sentence embedding feature and a position embedding feature of the target character;

obtaining a character embedding feature of the target character based on the word embedding feature, the sentence embedding feature and the position embedding feature;

and arranging the character embedding features of the characters to obtain the character feature sequence.

In an exemplary embodiment, the information determining module 940 includes: the device comprises a transition probability determining unit, a label sequence determining unit, a key information identifying unit and a text type identifying unit.

And the transition probability determining unit is used for determining the label transition probability between adjacent characters in each character based on the role label probability distribution.

And the label sequence determining unit is used for determining the role label sequence corresponding to the character sequence according to the label transition probability, wherein the role label sequence comprises the role labels corresponding to the characters.

And the key information identification unit is used for identifying characters corresponding to each key information label in the role label sequence to obtain the key information of the target text.

And the text type identification unit is used for identifying a text type label corresponding to the text type marker in the role label sequence and determining the text type of the target text.

In an exemplary embodiment, the information tagging module 950 includes: a keyword tag determination unit, a type tag determination unit, and a tag information generation unit.

And the keyword tag determining unit is used for taking the word combination in the key information as the keyword tag of the target text.

And the type label determining unit is used for taking the text type as the type label of the target text.

A tag information generating unit configured to generate the tag information based on the keyword tag and the type tag.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Referring to fig. 10, a block diagram of a computer device according to an embodiment of the present application is shown. The computer device may be a server for performing the text marking method described above. Specifically, the method comprises the following steps:

the computer apparatus 1000 includes a Central Processing Unit (CPU) 1001, a system Memory 1004 including a Random Access Memory (RAM) 1002 and a Read Only Memory (ROM) 1003, and a system bus 1005 connecting the system Memory 1004 and the Central Processing Unit 1001. The computer device 1000 also includes a basic Input/Output system (I/O) 1006, which facilitates transfer of information between devices within the computer, and a mass storage device 1507, which stores an operating system 1013, application programs 1014, and other program modules 1015.

The basic input/output system 1006 includes a display 1008 for displaying information and an input device 1009, such as a mouse, keyboard, etc., for user input of information. Wherein a display 1008 and an input device 1009 are connected to the central processing unit 1001 via an input-output controller 1010 connected to the system bus 1005. The basic input/output system 1006 may also include an input/output controller 1010 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input-output controller 1010 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1007 is connected to the central processing unit 1001 through a mass storage controller (not shown) connected to the system bus 1005. The mass storage device 1007 and its associated computer-readable media provide non-volatile storage for the computer device 1000. That is, the mass storage device 1007 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROM (Compact disk Read-Only Memory) drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Video Disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1004 and mass storage device 1007 described above may be collectively referred to as memory.

According to various embodiments of the present application, the computer device 1000 may also operate as a remote computer connected to a network via a network, such as the Internet. That is, the computer device 1000 may be connected to the network 1012 through the network interface unit 1011 connected to the system bus 1005, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 1011.

The memory also includes a computer program stored in the memory and configured to be executed by the one or more processors to implement the text labeling method described above.

In an exemplary embodiment, there is also provided a computer readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions which, when executed by a processor, implement the above text tagging method.

Optionally, the computer-readable storage medium may include: ROM (Read Only Memory), RAM (Random Access Memory), SSD (Solid State drive), or optical disc. The Random Access Memory may include a ReRAM (resistive Random Access Memory) and a DRAM (Dynamic Random Access Memory).

In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the text labeling method described above.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. In addition, the step numbers described herein only exemplarily show one possible execution sequence among the steps, and in some other embodiments, the steps may also be executed out of the numbering sequence, for example, two steps with different numbers are executed simultaneously, or two steps with different numbers are executed in a reverse order to the order shown in the figure, which is not limited by the embodiment of the present application.

The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of text tagging, the method comprising:

acquiring a target text;

2. The method according to claim 1, wherein the performing semantic feature extraction processing on the character sequence to obtain a role label probability distribution corresponding to each character in the character sequence comprises:

inputting the character sequence into a text classification extraction model to perform semantic feature extraction processing to obtain role label probability distribution corresponding to each character in the character sequence;

3. The method according to claim 2, wherein the entering of the character sequence into a text classification extraction model for the semantic feature extraction processing to obtain a role label probability distribution corresponding to each character in the character sequence comprises:

embedding the character sequence to obtain a character characteristic sequence;

and performing bidirectional semantic feature extraction processing on the character feature sequence to obtain the probability distribution of the role labels corresponding to the characters in the character sequence.

4. The method according to claim 3, wherein the performing bi-directional semantic feature extraction processing on the character feature sequence to obtain a role label probability distribution corresponding to each character in the character sequence comprises:

performing bidirectional semantic feature extraction processing on the character feature sequence to obtain first role label probability distribution of each character corresponding to each key information label and second role label probability distribution of each character corresponding to each text type label;

5. The method according to claim 3, wherein the embedding the character sequence to obtain a character feature sequence comprises:

6. The method according to any one of claims 1 to 5, wherein the role labels comprise a key information label and a text type label, and the determining the key information of the target text and the text type corresponding to the text type marker according to the role label probability distribution comprises:

determining label transition probability between adjacent characters in each character based on the character label probability distribution;

determining a role label sequence corresponding to the character sequence according to the label transition probability, wherein the role label sequence comprises role labels corresponding to the characters;

identifying characters corresponding to all key information labels in the role label sequence to obtain key information of the target text;

and identifying a text type label corresponding to the text type marker in the role label sequence, and determining the text type of the target text.

7. The method of claim 6, wherein generating markup information for the target text based on the key information and the text type comprises:

combining words in the key information to serve as a keyword label of the target text;

taking the text type as a type label of the target text;

and generating the marking information based on the keyword label and the type label.

8. A text marking apparatus, characterized in that the apparatus comprises:

the text acquisition module is used for acquiring a target text;

9. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to implement a text tagging method according to any one of claims 1 to 7.

10. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions that is loaded and executed by a processor to implement the text tagging method of any one of claims 1 to 7.