CN114490935A - Abnormal text detection method and device, computer readable medium and electronic equipment - Google Patents

Abnormal text detection method and device, computer readable medium and electronic equipment Download PDF

Info

Publication number
CN114490935A
CN114490935A CN202210073277.1A CN202210073277A CN114490935A CN 114490935 A CN114490935 A CN 114490935A CN 202210073277 A CN202210073277 A CN 202210073277A CN 114490935 A CN114490935 A CN 114490935A
Authority
CN
China
Prior art keywords
abnormal
text
preset
feature
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210073277.1A
Other languages
Chinese (zh)
Inventor
岳天驰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210073277.1A priority Critical patent/CN114490935A/en
Publication of CN114490935A publication Critical patent/CN114490935A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a method and a device for detecting abnormal texts, a computer readable medium and electronic equipment, which can be used in scenes such as artificial intelligence and natural language processing. The method comprises the following steps: acquiring a text to be detected consisting of a plurality of words; extracting features of the text to be detected to obtain a feature sequence of the text to be detected, wherein the feature sequence comprises context features corresponding to a plurality of characters in the text to be detected; mapping the characteristic sequences through a plurality of preset models respectively to obtain processing results corresponding to the preset models; the processing result of the preset model comprises the abnormal probability of the characteristic segment in the characteristic sequence; the lengths of the characteristic segments corresponding to the processing results of different preset models are different; and determining abnormal fragments in the text to be detected according to the abnormal probability of the characteristic fragments indicated by the processing result of each preset model. The text to be detected is detected through the models with various granularities, and accuracy and precision of detection results are improved.

Description

Abnormal text detection method and device, computer readable medium and electronic equipment
Technical Field
The application belongs to the technical field of computers and artificial intelligence, and particularly relates to a method and a device for detecting abnormal texts, a computer readable medium and electronic equipment.
Background
In many cases, text data needs to be revised, and abnormal problems such as wrongly written characters and semantic grammar errors in the text are solved. Currently, in natural language processing, a common detection method is to identify an abnormal position in a text through a sequence labeling model. Sequence labeling, namely, a text sequence is given, each element in the text sequence is analyzed, the abnormal probability of each element is determined, and finally the element with higher abnormal probability is identified as an abnormal element. However, the sequence labeling method usually causes a problem of positioning offset, i.e. there is a certain gap between the identified abnormal element and the real abnormal element, and therefore, the detection accuracy of this detection method is not high and needs to be improved.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present application and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.
Disclosure of Invention
The application aims to provide a method and a device for detecting an abnormal text, a computer readable medium and electronic equipment, so as to solve the problem that the positioning accuracy of abnormal segments in the text is low in the related art.
Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.
According to an aspect of an embodiment of the present application, there is provided a method for detecting an abnormal text, including:
acquiring a text to be detected consisting of a plurality of words;
extracting the characteristics of the text to be detected to obtain a characteristic sequence of the text to be detected, wherein the characteristic sequence comprises context characteristics corresponding to a plurality of characters in the text to be detected;
mapping the characteristic sequence through a plurality of preset models respectively to obtain processing results corresponding to the preset models; the processing result of the preset model comprises the abnormal probability of a characteristic segment in the characteristic sequence, wherein the characteristic segment comprises the context characteristics of at least one word; the lengths of the characteristic segments corresponding to the processing results of different preset models are different;
and determining abnormal fragments in the text to be detected according to the abnormal probability of the characteristic fragments indicated by the processing results of each preset model.
According to an aspect of the embodiments of the present application, there is provided an apparatus for detecting an abnormal text, including:
the text acquisition module is used for acquiring a text to be detected consisting of a plurality of words;
the feature extraction module is used for extracting features of the text to be detected to obtain a feature sequence of the text to be detected, wherein the feature sequence comprises context features corresponding to a plurality of characters in the text to be detected;
the mapping processing module is used for respectively mapping the characteristic sequences through a plurality of preset models to obtain processing results corresponding to the preset models; the processing result of the preset model comprises the abnormal probability of different feature segments in the feature sequence, and the feature segments comprise the context features of at least one word; the lengths of the characteristic segments corresponding to the processing results of different preset models are different;
and the abnormal segment determining module is used for determining the abnormal segments in the text to be detected according to the abnormal probability of the characteristic segments indicated by the processing results of the preset models.
In one embodiment of the present application, the apparatus further comprises:
the system comprises a sample data acquisition module, a data processing module and a data processing module, wherein the sample data acquisition module is used for acquiring sample data consisting of a plurality of words, and the words in the sample data have first labels indicating abnormal states;
the second label generation module is used for determining a plurality of sample fragments in the sample data according to each preset fragment length based on a plurality of preset fragment lengths and endowing the sample fragments with second labels indicating abnormal states according to the first labels corresponding to the sample fragments;
and the model training module is used for taking the sample data with the second label corresponding to each preset segment length as a training sample, and training the neural network model through the training sample to obtain the preset model corresponding to each preset segment length.
In an embodiment of the application, the second tag generating module is specifically configured to:
setting a window with the preset segment length as the window width, and taking all words contained in the sample data in the window as sample segments, wherein the window slides from the start bit to the end bit of the sample data according to the set step length.
In one embodiment of the present application, the first tag includes a normal tag and an abnormal tag; the second tag generation module is further configured to: :
and generating a second label of the sample segment according to the total quantity of the abnormal labels in the window and the window width.
In one embodiment of the application, in the training process of the neural network model, cross entropy of the neural network model between the predicted value of the training sample and the second label of the training sample is used as a loss function, and model parameters of the neural network model are updated based on the loss function.
In one embodiment of the present application, the feature extraction module includes:
the word segmentation unit is used for performing word segmentation processing on the text to be detected to obtain a plurality of words arranged in sequence, and converting each word in the plurality of words arranged in sequence into a corresponding word label according to a preset dictionary to obtain a word sequence of the text to be detected;
and the feature extraction unit is used for extracting the context features of the character sequence to obtain the feature sequence of the text to be detected.
In an embodiment of the application, the feature extraction unit is specifically configured to:
determining semantic vectors and position vectors corresponding to word labels according to the word labels in the word sequence;
generating a vector to be subjected to feature extraction according to word labels in the word sequence and semantic vectors and position vectors corresponding to the word labels;
and extracting the context characteristics of the characteristic extraction vector to obtain a characteristic sequence of the text to be detected.
In one embodiment of the present application, the mapping processing module includes:
the characteristic segment determining unit is used for determining the characteristic segment in the characteristic sequence by a sliding window method according to the preset segment length corresponding to the preset model;
the abnormal feature representation obtaining unit is used for obtaining the abnormal feature representation of the feature segment through the convolution layer of the preset model;
and the abnormal probability obtaining unit is used for mapping the abnormal feature representation through a full connection layer of the preset model to obtain the abnormal probability of the feature segment.
In an embodiment of the application, the abnormal feature representation acquiring unit is specifically configured to:
and fusing all the context characteristics in the characteristic segments according to the model parameters of the convolution layers of the preset model to obtain the abnormal characteristic representation of the characteristic segments.
In one embodiment of the present application, the model parameters of the convolutional layer include a first weight parameter and a first base value parameter; the abnormal feature representation acquiring unit is further configured to:
carrying out weighted summation on all the context characteristics in the characteristic segment through the first weight parameter to obtain weight characteristics;
and superposing the weight characteristic and the first basic value parameter to obtain the abnormal characteristic representation of the characteristic segment.
In one embodiment of the present application, the fully connected layer of the preset model includes a second weight parameter and a second base value parameter; the anomaly probability obtaining unit is specifically configured to:
multiplying the abnormal feature representation by the second weight parameter and then adding the abnormal feature representation to the second basic value parameter to obtain a feature to be activated;
and processing the feature to be activated through a preset activation function to obtain the abnormal probability of the feature segment.
In an embodiment of the present application, the abnormal segment determining module is specifically configured to:
determining the maximum abnormal probability in the abnormal probabilities of the characteristic segments indicated by the processing results of the preset models;
and taking a plurality of words indicated by the characteristic segment corresponding to the maximum abnormal probability as abnormal segments in the text to be detected.
According to an aspect of the embodiments of the present application, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the method for detecting an abnormal text as in the above technical solutions.
According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the detection method of the abnormal text as in the above technical solution by executing the executable instruction.
According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instruction from the computer readable storage medium, and the processor executes the computer instruction, so that the computer device executes the detection method of the abnormal text as in the above technical solution.
In the technical scheme provided by the embodiment of the application, the feature sequences are processed through the plurality of preset models respectively to obtain processing results, wherein the processing results comprise the abnormal probability of the feature fragments, namely, the text to be detected is divided into the plurality of fragments for abnormal detection, rather than the whole sentence of the text to be detected is directly detected, so that the local features in the text to be detected are fully considered in the abnormal detection process, and the detection precision is improved; meanwhile, due to the fact that the lengths of the characteristic segments corresponding to the processing results of different preset models are different, the detection of the text to be detected is equivalently performed through models with various granularities, and the accuracy and precision of the detection result are further improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 schematically shows a block diagram of an exemplary system architecture to which the solution of the present application applies.
Fig. 2 schematically shows a flowchart of a method for detecting an abnormal text according to an embodiment of the present application.
Fig. 3 schematically illustrates a flowchart of feature extraction performed on a text to be detected according to an embodiment of the present application.
Fig. 4 schematically illustrates a schematic diagram for determining a preset segment length according to an embodiment of the present application.
Fig. 5 schematically shows a flowchart of a method for constructing a preset model according to an embodiment of the present application.
Fig. 6 schematically shows a model structure diagram to which the technical solution of the present application is applied.
Fig. 7 schematically shows a flowchart of an application of the technical solution of the present application in one scenario.
Fig. 8 schematically shows a block diagram of a structure of an abnormal text detection apparatus according to an embodiment of the present application.
FIG. 9 schematically illustrates a block diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present application.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
Fig. 1 schematically shows a block diagram of an exemplary system architecture to which the solution of the present application applies.
As shown in fig. 1, system architecture 100 may include terminal device 110, network 120, and server 130. Terminal device 110 may include a smart phone, a tablet computer, a notebook computer, an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, and so on. The server 130 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. Network 120 may be a communication medium of various connection types capable of providing a communication link between terminal device 110 and server 130, such as a wired communication link or a wireless communication link.
The system architecture in the embodiments of the present application may have any number of terminal devices, networks, and servers, according to implementation needs. For example, the server 130 may be a server group composed of a plurality of server devices. In addition, the technical solution provided in the embodiment of the present application may be applied to the terminal device 110, or may be applied to the server 130, or may be implemented by both the terminal device 110 and the server 130, which is not particularly limited in this application.
The method for detecting the abnormal text provided by the embodiment of the application is executed by the server 130, and accordingly, a device for detecting the abnormal text is disposed in the server 130. However, it is easily understood by those skilled in the art that the method for detecting an abnormal text provided in the embodiment of the present application may also be executed by the terminal device 110, and accordingly, a device for detecting an abnormal text may also be disposed in the terminal device 110, which is not particularly limited in the exemplary embodiment.
For example, the server 130 obtains a text to be detected composed of a plurality of words, and then performs feature extraction on the text to be detected to obtain a feature sequence of the text to be detected, where the feature sequence includes context features corresponding to the plurality of words in the text to be detected. Next, the server 130 performs mapping processing on the feature sequences through a plurality of preset models respectively to obtain processing results corresponding to the preset models; and the lengths of the characteristic segments corresponding to the processing results of different preset models are different. And in the processing result of a preset model, the abnormal probability of a plurality of characteristic segments in the characteristic sequence is included, and the characteristic segments comprise the context characteristics of at least one word. Finally, the server 130 determines the abnormal segment in the text to be detected according to the abnormal probability of the feature segment indicated by each preset model processing result.
In an embodiment of the application, after determining the abnormal segment in the text to be detected, the server 130 may return the abnormal segment in the text to be detected to the terminal device 110 through the network 120, and the terminal device 110 may mark the abnormal segment in the text to be detected in the display interface, for example, highlight the abnormal segment in the text to be detected, and further may quickly and conveniently learn the abnormal segment in the text to be detected through the display interface of the terminal device 110.
The technical scheme provided by the embodiment of the application can be realized through an artificial intelligence technology, for example, a preset model is generated through the artificial intelligence technology. Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.
The following describes the method for detecting an abnormal text in detail with reference to specific embodiments.
Fig. 2 schematically illustrates a flowchart of a method for detecting an abnormal text, which may be implemented by a terminal device, such as the terminal device 110 illustrated in fig. 1, according to an embodiment of the present application; the method may also be implemented by a server, such as server 130 shown in FIG. 1. As shown in fig. 2, the method for detecting an abnormal text provided in the embodiment of the present application includes steps 210 to 240, which are specifically as follows:
and step 210, acquiring a text to be detected consisting of a plurality of characters.
Specifically, the text to be detected is composed of a plurality of words, and may be one sentence or a plurality of sentences. The text to be detected is text data which is determined to be abnormal but the abnormal position is unclear, and the abnormal condition of the text data comprises the conditions of wrongly written characters, grammar errors, semantic errors and the like in the text data. The text to be detected may be content obtained from text data, for example, a title or a body sentence recognized as abnormal in an article. The text to be detected may also be abnormal text data obtained by performing speech recognition on the speech data, or abnormal text data obtained by performing character recognition on the image data, which is not limited in the embodiment of the present application. The confirmation that the text data is the abnormal text can be realized through a trained text detection model.
Step 220, performing feature extraction on the text to be detected to obtain a feature sequence of the text to be detected, wherein the feature sequence comprises context features corresponding to a plurality of characters in the text to be detected.
Specifically, feature extraction is performed on the text to be detected, that is, context semantic feature extraction is performed on the text to be detected, the text to be detected is converted from characters into feature vectors, context features corresponding to a plurality of characters in the text to be detected are obtained, and a feature sequence is formed. The context feature of a word includes semantic information of the word in the text to be detected, that is, includes abnormal information of the word in the text to be detected.
In an embodiment of the present application, as shown in fig. 3, the process of extracting features of a text to be detected includes steps 310 to 320, specifically:
and 310, performing word segmentation processing on the text to be detected to obtain a plurality of words arranged in sequence, and converting each word in the plurality of words arranged in sequence into a corresponding word tag according to a preset dictionary to obtain a word sequence of the text to be detected.
Specifically, the word segmentation processing is to segment the words in the text to be detected to obtain a plurality of words arranged in sequence, and the arrangement sequence of the plurality of words is the arrangement sequence of the words in the text to be detected. Because the characters cannot be directly processed, after the characters are segmented to obtain a plurality of characters, each character needs to be converted into a corresponding character label, so that a plurality of character labels which are ordered in sequence, namely, a character sequence of the text to be detected, are obtained. The word label is an identification of the word, corresponding to the ID of the word, and is denoted as Token.
In one embodiment of the present application, a word may be converted into a corresponding word tag according to a preset dictionary. The preset dictionary contains a large number of characters and character labels corresponding to the characters. Traversing a plurality of words arranged in sequence, searching a word which is the same as the word in a preset dictionary aiming at each word, and taking a word label corresponding to the same word as the word label of the word.
In one embodiment of the present application, when the text to be detected includes a plurality of sentences, in order to recognize the sentences, a sentence head identifier [ CLS ] may be set at the head of the text to be detected, and a sentence tail identifier [ SEP ] may be set at the tail of each sentence. Generally, the head of the text to be detected is before the initial word of the text to be detected. The sentence end of the sentence has punctuation marks, so the punctuation marks in the sentence can be recognized firstly, and then the sentence end identifier is arranged at the punctuation marks. In one case, when a word is followed by neither a punctuation mark nor other words, the word may be considered as an end of a sentence, after which an end of sentence identifier is set. Of course, when the text to be detected has only one sentence, the sentence end identifier is set after the last word tag. The resulting word sequence then consists of a sentence start identifier, a word tag and a sentence end identifier.
Illustratively, the text to be detected is "i am a chinese person, i love china", the header of the text to be detected is "i am", and thus a period head identifier [ CLS ] is set before the word label of "i am". In the text to be detected, the's' are regarded as sentence tails, and the 'states' in 'i love China' are regarded as sentence tails, so that two sentence tail identifiers [ SEP ] need to be set. After transformation, the word sequence is obtained: [ CLS ] Token1Token 2Token3 Token4Token5[ SEP ] Token6 Token7Token8 Token9[ SEP ], wherein the number of Token only represents the sorting number of the corresponding word in the text to be detected.
And 320, extracting the context characteristics of the character sequence to obtain a characteristic sequence of the text to be detected.
Specifically, the context feature is a feature that can represent information such as a context and a semantic meaning where a word is located, and can naturally reflect abnormal information of a position where the word is located.
In one embodiment of the present application, the word sequence may be context feature extracted by a pre-trained language model, which may be a BERT (Bidirectional Encoder Representation of transformer) model.
A pre-trained language model is a model in natural language processing. Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language people use daily, so it has a close relation with the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question answering, knowledge mapping, and the like.
In an embodiment of the present application, the process of extracting the context feature of the word sequence specifically includes: determining semantic vectors and position vectors corresponding to word labels according to the word labels in the word sequence; generating a vector to be subjected to feature extraction according to word labels in the word sequence and semantic vectors and position vectors corresponding to the word labels; and extracting the context characteristics of the extracted vector with the characteristics to obtain a characteristic sequence of the text to be detected.
Specifically, the semantic vector of the word represents the global semantic information of the text to be detected and the information obtained by fusing the semantic information of the word, for example, the semantic vector of the word may indicate a sentence in which the word is located (for example, the text to be detected includes a sentence a and a sentence B, and the semantic vector of the word may indicate whether the word is in the sentence a or the sentence B), and the type of the sentence in which the word is located (such as a title or a body), and the semantic vector is determined by the pre-training language model according to the word tag and the text to be detected. The position vector of the word represents the position information of the word in the text to be detected, because the semantic information carried by the word at different positions in the text to be detected is different, the position vector of the word is added, so that the position information of the word is considered in the context feature extraction process, and the context feature extraction is more accurate. The position vector is determined by the pre-training language model according to the word label and the text to be detected.
After determining the semantic vector and the position vector corresponding to the word label in the word sequence, superposing the word label with the corresponding semantic vector and position vector to obtain the vector to be subjected to feature extraction. Illustratively, for a word sequence: [ CLS ] Token1Token 2Token3 Token4Token5[ SEP ], the semantic vector corresponding to each word label (according to the sequence in the word sequence) is: EC E1E 2E 3E 4E5 ES, the position vector (in the order in the word sequence) corresponding to each word label is: and PC P1P 2P 3P 4P5 PS, the generated vector to be extracted with the characteristics is: [ CLS ] + EC + PC Token1+ E1+ P1 Token2+ E2+ P2Token3+ E3+ P3 Token4+ E4+ P4Token5+ E5+ P5[ SEP ] + E6+ P6.
And finally, extracting the context features of the vector to be subjected to feature extraction to obtain a feature sequence. The characteristic sequence comprises the context characteristic of each word, and the context characteristic of the ith word is recorded as hiThen the feature sequence corresponding to the text to be detected with n characters is h1h2h3…hi…hn
Continuing to refer to fig. 2, step 230, performing mapping processing on the feature sequences through a plurality of preset models respectively to obtain processing results corresponding to the preset models; the processing result of the preset model comprises the abnormal probability of a characteristic segment in the characteristic sequence, wherein the characteristic segment comprises the context characteristic of at least one word; the lengths of the characteristic segments corresponding to the processing results of different preset models are different.
Specifically, the preset model is used for predicting the abnormal condition of the feature sequence, and the processing result obtained by mapping the feature sequence by the preset model includes the abnormal probability of a plurality of feature segments, that is, the preset model divides the feature sequence into a plurality of feature segments and then predicts the abnormal probability of each feature segment. A feature segment corresponds to a portion of a sequence of features, such that a feature segment includes at least one word of contextual features.
In the embodiment of the present application, the feature sequences are mapped by a plurality of preset models, and the length of the feature segments in the processing result obtained by each preset model is different, where the length of the feature segment refers to the number of context features (which is equivalent to the number of words corresponding to the feature segment) constituting the feature segment. Illustratively, in the present application, 5 preset models are used to respectively map the feature sequences, the length of the feature segment corresponding to the 1 st preset model is 1, the length of the feature segment corresponding to the 2 nd preset model is 2, the length of the feature segment corresponding to the 3 rd preset model is 3, the length of the feature segment corresponding to the 4 th preset model is 4, and the length of the feature segment corresponding to the 5 th preset model is 5.
In an embodiment of the present application, a processing procedure of the preset model on the feature sequence is as follows: and determining the characteristic segments in the characteristic sequence by a sliding window method according to the preset segment lengths corresponding to the preset models, and mapping the characteristic segments to obtain the abnormal probability of the characteristic segments.
Specifically, the preset model requires that the length of the feature segment into which the feature sequence is divided is preset, that is, the length of the preset segment. The method comprises the steps of taking the length of a preset segment as the width of a window, then sliding the window along a feature sequence, wherein in the window sliding process, the segment of the feature sequence in the window is the feature segment, and therefore, the feature sequence is divided into a plurality of feature segments through a sliding window method. And after the characteristic fragments are obtained through division, mapping the characteristic fragments to obtain the abnormal probability of the characteristic fragments.
In the window sliding process, the window slides by a set step length, the window header is taken as a calculation starting point, and then the distance between the current window header and the previous window header is the set step length. Generally, the steps are setThe length is 1, that is, the window is shifted back by a distance of one word each time, and the window header is slid from the start bit to the end bit of the feature sequence. Then, when the preset segment length is k (k > 1) in dividing the feature sequence, the feature sequence is set to h1h2h3…hi…hnThen the characteristic segment is [ h ]i:hi+k]Context feature h representing that the feature segment is from the ith wordiContext feature h to the (i + k) th wordi+kWherein i takes on a value from 1to n. It can be seen that when i ═ n-k, hi+kIs hnWhen i increases again, i + k will be greater than n, hi+k0 may be substituted. When the preset segment length is 1, the context features in the feature sequence are actually segmented one by one, that is, the text to be detected is divided into single words, and the obtained feature segment is the context feature corresponding to one word in the text to be detected. Illustratively, as shown in fig. 4, taking the preset segment length as 2 as an example, the characteristic sequence is h1h2h3h4h5And the set step length of window sliding is 1, and a characteristic segment can be obtained: [ h ] of1:h2]、[h2:h3]、[h3:h4]、[h4:h5]、[h5:0]。
In an embodiment of the present application, the process of mapping the feature segments is as follows: acquiring abnormal feature representation of the feature segment through the convolution layer of the preset model; and mapping the abnormal feature representation through a full connection layer of a preset model to obtain the abnormal probability of the feature segment.
Specifically, the preset model is provided with a convolution layer and a full-link layer, the convolution layer is used for extracting abnormal feature representation of the feature segment, and the full-link layer is used for calculating abnormal probability according to the abnormal feature representation. Specifically, the convolutional layer fuses all context features in the feature segment through the model parameters to obtain an abnormal feature representation. The model parameters of the convolutional layer include a first weight parameter and a first base value parameter, which are obtained in the model training process, but are differentThe predetermined models of (2) have different model parameters of the convolution layer. When fusion processing is carried out, firstly, weighting and summing are carried out on all context characteristics in the characteristic segment through a first weight parameter to obtain weight characteristics; and then, superposing the weight characteristics and the first basic value parameters to obtain the abnormal characteristic representation of the characteristic segment. Recording a first weight parameter of the convolution layer of the preset model corresponding to the preset segment length k as W1kThe first basic value parameter is recorded as b1kAll context features in the ith feature fragment are denoted as [ h ]i:hi+k]Then the abnormal feature of the ith feature segment represents rkiAs shown in the following formula:
rki=(W1k[hi:hi+k]+b1k)
wherein k represents a predetermined fragment length, rkiFor the abnormal feature representation of the ith feature segment extracted at the preset segment length k, [ h [ [ h ]i:hi+k]Denotes the ith feature fragment, W1k、b1kThe model parameters of the convolution layer in the preset model corresponding to the preset segment length k. The preset segment length is also equivalent to the convolution kernel width of the convolution layer, and the characteristic sequences are actually processed by convolution models with various granularities respectively.
And after the abnormal feature representation is obtained, mapping the abnormal feature representation through a full connection layer to obtain the abnormal probability of the feature fragment. Specifically, the full connection layer has a second weight parameter and a second basic value parameter, the abnormal feature representation is multiplied by the second weight parameter, and then the abnormal feature representation is added to the second basic value parameter to obtain the feature to be activated; and finally, processing the to-be-activated features through a preset activation function arranged on the full connection layer to obtain the abnormal probability of the feature fragments. The preset activation function can be a ReLU function, a Sigmoid function, a Softmax function, a Linear function, and the like, and can be selected according to actual requirements during setting. Illustratively, the second weight parameter of the full link layer of the preset model corresponding to the preset segment length k is denoted as W2kAnd the second basic value parameter is recorded as b2kIf the Softmax function is adopted as the preset activation function, the abnormal probability of the characteristic segment is as followsIs represented by the formula:
pki=softmax(W2krki+b2k)
wherein p iskiIs the abnormal probability of the ith characteristic segment extracted under the preset segment length k, rkiFor the anomalous feature representation of the ith feature segment extracted at a preset segment length k, W2k、b2kThe model parameters of the full connection layer in the preset model corresponding to the preset fragment length k are obtained.
In an embodiment of the present application, before the mapping processing is performed on the feature sequence through the preset model, a construction process of the preset model is further included, as shown in fig. 5, the process includes steps 510 to 530, specifically:
step 510, sample data composed of a plurality of words is obtained, and the words in the sample data have first labels indicating abnormal states.
Specifically, the sample data is abnormal text data with an abnormal information label, and the abnormal text data is also composed of a plurality of words, each having a first label indicating an abnormal state. The abnormal state of a word refers to whether the word is in an abnormal state, and the abnormal state can be represented by different marking information of the first tag, for example, the first tag is marked as 0, which represents that the word is not in an abnormal state (i.e. the word is in a normal state), so this type of first tag can be marked as a normal tag. The first label is labeled 1, which represents that the word is in an abnormal state, so this type of first label can be denoted as an abnormal label.
In one embodiment of the present application, for the same sample data, different labeling methods may cause the first label of each word in the sample data to be different. For example, for sample data "all pets have high status", the annotating person 1 may consider that the "all people" is not smooth (i.e. abnormal), the two characters of "all people" are all marked as 1, and the rest characters are marked as 0; if the annotator 2 possibly considers that the 'man' word is redundant, the 'man' word is marked as 1, and the rest words are marked as 0; the annotator 3 may consider the "all" word unordered, marking the "all" word as 1 and the remaining words as 0. Thus, for sample data "the pet status is high", three labeling conditions can be obtained as shown in the following table:
TABLE 1
Figure BDA0003482884030000131
Figure BDA0003482884030000141
Step 520, based on the lengths of the preset fragments, determining a plurality of sample fragments in the sample data according to each preset fragment length, and assigning a second label indicating an abnormal state to each sample fragment according to the first label corresponding to the sample fragment.
Specifically, a plurality of preset segment lengths are set, sample data is divided according to each preset segment length to obtain a plurality of sample segments corresponding to the sample data, and a second label is given to each sample segment according to a first label of each word in the sample segments. The second label is calculated according to the first label and is used for indicating the abnormal condition of the sample fragment.
The process of obtaining a plurality of sample fragments of sample data according to the preset fragment length is as follows: setting a window with preset segment length as window width, and taking all characters contained in the sample data in the window as sample segments, wherein the window slides from the start bit to the end bit of the sample data according to the set step length. That is, a window with a preset segment length as a window width is set, then the window header is slid from the start bit to the end bit of the sample data according to the set step length, and every time the window header is slid, all characters contained in the sample data in the window form a sample segment, thereby obtaining a plurality of sample segments. When the number of words in the window is insufficient, it is filled in by 0. Typically, the step size is set to 1. The process of obtaining the sample fragment can refer to the process of obtaining the feature fragment, which is similar to the above process. For example, taking the preset segment length as 2 and taking the sample data in table 1 as an example, the sample segment can be obtained: the height of the pet, the object, the ground, the status, the position, the general, the height of the human being and the height of the human being are all 0.
Because the first label of each character in sample data has only 0 and 1, when the first label is directly used for model training, a loss function calculated by the predicted value of the model and the first label has a large error. For example, taking the label of person 2 in table 1 as an example, the first label of the "human" word is 1, but the probability of the "all" word is predicted by the model to be relatively large, which results in a large loss of the model. Then when the model gradient backpropagation update parameters are retrained, the misleading model makes the "all" word probability prediction very small. This can cause model confusion and thus reduce the prediction accuracy of the model.
In consideration of the above situation, the second label is newly assigned to the sample segment, and the second label of the sample segment is generated according to the total quantity of the abnormal labels in the window and the window width in the window moving process. The second label of the sample fragment comprises two parts: the method comprises normal identification and abnormal identification, wherein the normal identification is used for representing the probability that the sample fragment is in a normal state, the abnormal identification is used for representing the probability that the sample fragment is in an abnormal state, and the sum of the normal identification and the abnormal identification is 1, so that the abnormal identification is determined, and the normal identification is also determined.
In the embodiment of the application, in the moving process of the window, the ratio of the total number of the abnormal labels in the window to the width of the window is used as the abnormal identifier in the second label of the sample segment, and the normal identifier of the sample segment is obtained by subtracting the abnormal identifier from 1. When the window width is 1 (i.e. the preset segment length is 1), the sample segment is a single word in the sample data, so that the second label of the sample segment is the same as the first label of each word. When the window width is 2, taking the label of the label person 2 in table 1 as an example, the sample fragment and the corresponding second label (expressed in the form of (abnormal label, normal label)) can be obtained as follows: and (3) pets: (0,1) preparation of (1): ground of (0, 1): (0,1), status: (0,1), bit all: (0,1), everyone: (0.5 ), human height: (0.5 ), high 0: (0,1), wherein the pet: (0,1) indicates that the abnormality flag of the sample fragment "pet" is 0 and the normal flag is 1. When the window width is 3, taking the label of the label 2 in table 1 as an example, the sample fragment and the corresponding second label can be obtained as follows: the method comprises the following steps of: (0,1), ground of the object: position (0, 1): (0,1), status: (0,1), all human: (0.33,0.67), homo: (0.33,0.67), human height 0: (0.33,0.67), height 00: (0,1). It can be seen that the data in the second label has no more two values, the data in the second label is more gentle, and the influence of the error in the first label on the model can be effectively relieved.
Step 530, taking the sample data with the second label corresponding to each preset segment length as a training sample, and training the neural network model through the training sample to obtain the preset model corresponding to each preset segment length.
Specifically, after a second label is given to a sample segment in the sample data, the sample data can be used as a training sample to train the neural network model. Through the processing of the steps, sample data with a second label can be obtained for each preset segment length, namely each preset segment length corresponds to one training sample, and in the training process, the training samples corresponding to the preset segment lengths are used for training the neural network model of the preset segment lengths, so that the preset model corresponding to the preset segment lengths is obtained.
In one embodiment of the application, in the training process of the neural network model, the cross entropy of the neural network model between the predicted value of the training sample and the second label of the training sample is used as a loss function, and the model parameter of the neural network model is updated based on the loss function, wherein the model parameter is W in the previous step1k、b1k、W2k、b2kAnd the like.
In particular, the Loss function LosskThe calculation method of (c) is as follows:
Figure BDA0003482884030000161
wherein, y0kiRepresenting preset segmentsNormal tag in ith sample fragment at length k, y1kiRepresents an abnormal label in the ith sample segment under the preset segment length k, y0ki+y1ki=1;p0kiRepresenting the probability that the ith sample segment predicted by the neural network model with the preset segment length k is normal, p1kiAnd representing the probability that the ith sample segment predicted by the neural network model with the preset segment length k is abnormal.
In an embodiment of the present application, when training the neural network model corresponding to each preset segment length, an iterative training method may be adopted, that is, each model is trained sequentially according to the sequence from small to large of the preset segment length.
In one embodiment of the present application, the preset model may also be obtained by training using other suitable machine learning models. Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
Continuing to refer to fig. 2, step 240, determining abnormal segments in the text to be detected according to the abnormal probabilities of the feature segments indicated by the processing results of the preset models.
Specifically, the processing result of one preset model comprises the abnormal probabilities of a plurality of characteristic segments with the same length, the processing results of a plurality of preset models comprise the abnormal probabilities of a plurality of characteristic segments with a plurality of lengths, the maximum abnormal probability in the plurality of abnormal probabilities is determined, a plurality of characters indicated by the characteristic segments corresponding to the maximum abnormal probability are used as the abnormal segments in the text to be detected, and therefore the length and the position of the abnormal segments in the text to be detected are determined. For example, if the feature segment corresponding to the maximum anomaly probability is the ith feature segment with the preset segment length k, it is determined that the anomaly segment in the text to be detected is a segment formed by the ith character to the (i + k) th character.
In the technical scheme provided by the embodiment of the application, the feature sequences are processed through the plurality of preset models respectively to obtain processing results, wherein the processing results comprise the abnormal probability of the feature fragments, namely, the text to be detected is divided into the plurality of fragments for abnormal detection, rather than the whole sentence of the text to be detected is directly detected, so that the local features in the text to be detected are fully considered in the abnormal detection process, and the detection precision is improved; meanwhile, due to the fact that the lengths of the characteristic segments corresponding to the processing results of different preset models are different, the detection of the text to be detected is equivalently performed through models with various granularities, and the accuracy and precision of the detection result are further improved.
Fig. 6 schematically shows a model structure diagram to which the technical solution of the present application is applied. As shown in fig. 6, the model structure includes:
a text embedding module (token embedding)610, configured to perform word segmentation on a text to be detected to convert the text to be detected 611 into a word sequence formed by word labels (tokens), which may specifically refer to the related description of step 310 and is not described herein again.
And a vector superposition module (TASKEMBEDDING)620, configured to superpose word labels in the word sequence with corresponding semantic vectors and position vectors, and generate a to-be-feature-extracted vector 621. Reference may be made to the related description of the foregoing step 320, which is not repeated herein.
A BERT MODEL (BERT MODEL)630, which is a pre-training language MODEL, configured to perform context feature extraction on a vector to be feature extracted, and output a feature sequence 631.
A multi-granularity convolution module 640 that includes a convolution model of 5 granularities (grams), where granularity is the size of the convolution kernel, i.e., the preset segment length. In the embodiment of the present application, the particle sizes of the 5 convolution models are: 1. 2, 3, 4 and 5. The convolution model of each granularity performs mapping processing on the output feature sequence 631 respectively to obtain the abnormal probability of the feature segment in the feature sequence, and the length of the feature segment is the same as the granularity of the corresponding convolution model. The abnormal probability of the feature segment is equivalent to a prediction score (score) of the convolution model for the feature segment, and finally, a maximum value (MAX score) is selected from all the scores, so that the abnormal segment in the text 611 to be detected can be determined.
Fig. 7 schematically shows a flowchart of an application of the technical solution of the present application in one scenario. As shown in fig. 7, the process includes:
and S710, acquiring the unordinary text. The discordance text is the abnormal text, and the abnormal segment in the discordance text can be determined through the technical scheme of the application.
And S720, inputting the unordinary text into the unordinary fragment detection model. The obstructed sequence fragment detection model is a model for implementing the technical scheme of the application, namely the obstructed sequence fragment detection model carries out feature extraction on the obstructed sequence fragment to obtain a feature sequence; then, mapping the feature sequences through a plurality of preset models respectively to obtain processing results corresponding to the preset models, wherein the processing results of the preset models comprise the abnormal probability of feature segments in the feature sequences, and the feature segments comprise the context features of at least one word; the lengths of the characteristic segments corresponding to the processing results of different preset models are different. And finally, determining the abnormal fragments in the text to be detected according to the maximum value of the abnormal probability of the characteristic fragments indicated by the processing results of each preset model.
And S730, positioning the obstructed segments according to the model prediction result. And determining the specific position of the unaccustomed segment according to the output result of the unaccustomed segment detection model.
And S740, a machine auditing system. And inputting the positioning result into a machine auditing system, wherein the system can carry out manual auditing.
And S750, highlighting the disconnected segment. Highlighting enables an object to more quickly determine a noncompliant segment in noncompliant text.
It should be noted that although the various steps of the methods in this application are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the shown steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken into multiple step executions, etc.
The following describes embodiments of the apparatus of the present application, which may be used to perform the method for detecting an abnormal text in the above embodiments of the present application. Fig. 8 schematically shows a block diagram of a structure of an abnormal text detection apparatus according to an embodiment of the present application. As shown in fig. 8, the apparatus for detecting an abnormal text according to the embodiment of the present application includes:
a text acquisition module 810, configured to acquire a text to be detected that is composed of multiple words;
the feature extraction module 820 is configured to perform feature extraction on the text to be detected to obtain a feature sequence of the text to be detected, where the feature sequence includes context features corresponding to a plurality of words in the text to be detected;
the mapping processing module 830 is configured to perform mapping processing on the feature sequence through a plurality of preset models respectively to obtain processing results corresponding to the preset models; the processing result of the preset model comprises the abnormal probability of different feature segments in the feature sequence, and the feature segments comprise the context features of at least one word; the lengths of the characteristic segments corresponding to the processing results of different preset models are different;
the abnormal segment determining module 840 is configured to determine an abnormal segment in the text to be detected according to the abnormal probability of the feature segment indicated by each preset model processing result.
In one embodiment of the present application, the apparatus further comprises:
the system comprises a sample data acquisition module, a data processing module and a data processing module, wherein the sample data acquisition module is used for acquiring sample data consisting of a plurality of words, and the words in the sample data have first labels indicating abnormal states;
the second label generation module is used for determining a plurality of sample fragments in the sample data according to each preset fragment length based on a plurality of preset fragment lengths and endowing the sample fragments with second labels indicating abnormal states according to the first labels corresponding to the sample fragments;
and the model training module is used for taking the sample data with the second label corresponding to each preset segment length as a training sample, and training the neural network model through the training sample to obtain the preset model corresponding to each preset segment length.
In an embodiment of the application, the second tag generating module is specifically configured to:
setting a window with the preset segment length as the window width, and taking all words contained in the sample data in the window as sample segments, wherein the window slides from the start bit to the end bit of the sample data according to the set step length.
In one embodiment of the present application, the first tag includes a normal tag and an abnormal tag; the second tag generation module is further configured to: :
and generating a second label of the sample segment according to the total quantity of the abnormal labels in the window and the window width.
In one embodiment of the application, in the training process of the neural network model, cross entropy of the neural network model between the predicted value of the training sample and the second label of the training sample is used as a loss function, and model parameters of the neural network model are updated based on the loss function.
In one embodiment of the present application, the feature extraction module 820 includes:
the word segmentation unit is used for performing word segmentation processing on the text to be detected to obtain a plurality of words arranged in sequence, and converting each word in the plurality of words arranged in sequence into a corresponding word label according to a preset dictionary to obtain a word sequence of the text to be detected;
and the feature extraction unit is used for extracting the context feature of the character sequence to obtain the feature sequence of the text to be detected.
In an embodiment of the application, the feature extraction unit is specifically configured to:
determining semantic vectors and position vectors corresponding to word labels according to the word labels in the word sequence;
generating a vector to be subjected to feature extraction according to word labels in the word sequence and semantic vectors and position vectors corresponding to the word labels;
and extracting the context characteristics of the characteristic extraction vector to obtain a characteristic sequence of the text to be detected.
In one embodiment of the present application, the mapping processing module 830 includes:
the characteristic segment determining unit is used for determining the characteristic segment in the characteristic sequence by a sliding window method according to the preset segment length corresponding to the preset model;
the abnormal feature representation obtaining unit is used for obtaining the abnormal feature representation of the feature segment through the convolution layer of the preset model;
and the abnormal probability obtaining unit is used for mapping the abnormal feature representation through a full connection layer of the preset model to obtain the abnormal probability of the feature segment.
In an embodiment of the application, the abnormal feature representation acquiring unit is specifically configured to:
and fusing all the context characteristics in the characteristic segments according to the model parameters of the convolution layers of the preset model to obtain the abnormal characteristic representation of the characteristic segments.
In one embodiment of the present application, the model parameters of the convolutional layer include a first weight parameter and a first base value parameter; the abnormal feature representation acquiring unit is further configured to:
carrying out weighted summation on all the context characteristics in the characteristic segment through the first weight parameter to obtain weight characteristics;
and superposing the weight characteristic and the first basic value parameter to obtain the abnormal characteristic representation of the characteristic segment.
In one embodiment of the present application, the fully connected layer of the preset model includes a second weight parameter and a second base value parameter; the anomaly probability obtaining unit is specifically configured to:
multiplying the abnormal feature representation by the second weight parameter and then adding the abnormal feature representation to the second basic value parameter to obtain a feature to be activated;
and processing the feature to be activated through a preset activation function to obtain the abnormal probability of the feature segment.
In an embodiment of the present application, the abnormal segment determining module 840 is specifically configured to:
determining the maximum abnormal probability in the abnormal probabilities of the characteristic segments indicated by the processing results of the preset models;
and taking a plurality of words indicated by the characteristic segment corresponding to the maximum abnormal probability as abnormal segments in the text to be detected.
The specific details of the detection apparatus for the abnormal text provided in the embodiments of the present application have been described in detail in the corresponding method embodiments, and are not described herein again.
Fig. 9 schematically shows a computer system configuration block diagram of an electronic device for implementing the embodiment of the present application.
It should be noted that the computer system 900 of the electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments.
As shown in fig. 9, the computer system 900 includes a Central Processing Unit 901 (CPU) that can perform various appropriate actions and processes in accordance with a program stored in a Read-Only Memory 902 (ROM) or a program loaded from a storage section 908 into a Random Access Memory 903 (RAM). In the random access memory 903, various programs and data necessary for system operation are also stored. The cpu 901, the rom 902 and the ram 903 are connected to each other via a bus 904. An Input/Output interface 905(Input/Output interface, i.e., I/O interface) is also connected to the bus 904.
The following components are connected to the input/output interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a local area network card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The driver 910 is also connected to the input/output interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary so that a computer program read out therefrom is mounted into the storage section 908 as necessary.
In particular, according to embodiments of the present application, the processes described in the various method flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The computer program, when executed by the central processor 901, performs various functions defined in the system of the present application.
It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (15)

1. A method for detecting an abnormal text, comprising:
acquiring a text to be detected consisting of a plurality of words;
extracting features of the text to be detected to obtain a feature sequence of the text to be detected, wherein the feature sequence comprises context features corresponding to a plurality of characters in the text to be detected;
mapping the characteristic sequence through a plurality of preset models respectively to obtain processing results corresponding to the preset models; the processing result of the preset model comprises the abnormal probability of a characteristic segment in the characteristic sequence, wherein the characteristic segment comprises the context characteristics of at least one word; the lengths of the characteristic segments corresponding to the processing results of different preset models are different;
and determining abnormal fragments in the text to be detected according to the abnormal probability of the characteristic fragments indicated by the processing result of each preset model.
2. The method for detecting the abnormal text according to claim 1, wherein before the feature sequence is mapped by a plurality of preset models respectively to obtain the processing result corresponding to each preset model, the method further comprises:
acquiring sample data consisting of a plurality of words, wherein the words in the sample data have first labels indicating abnormal states;
determining a plurality of sample fragments in the sample data according to each preset fragment length based on a plurality of preset fragment lengths, and endowing the sample fragments with second labels indicating abnormal states according to the first labels corresponding to the sample fragments;
and taking the sample data with the second label corresponding to each preset segment length as a training sample, and training the neural network model through the training sample to obtain the preset model corresponding to each preset segment length.
3. The method according to claim 2, wherein determining a plurality of sample segments in the sample data according to each preset segment length comprises:
setting a window with the preset segment length as the window width, and taking all words contained in the sample data in the window as sample segments, wherein the window slides from the start bit to the end bit of the sample data according to the set step length.
4. The method according to claim 3, wherein the first label includes a normal label and an abnormal label; according to the first label corresponding to the sample segment, endowing the sample segment with a second label indicating an abnormal state, and the method comprises the following steps:
and generating a second label of the sample segment according to the total quantity of the abnormal labels in the window and the window width.
5. The method according to claim 3, wherein in the training process of the neural network model, cross entropy of the neural network model between the predicted values of the training samples and the second labels of the training samples is used as a loss function, and model parameters of the neural network model are updated based on the loss function.
6. The method for detecting the abnormal text according to claim 1, wherein performing feature extraction on the text to be detected to obtain a feature sequence of the text to be detected comprises:
performing word segmentation processing on the text to be detected to obtain a plurality of words arranged in sequence, and converting each word in the plurality of words arranged in sequence into a corresponding word tag according to a preset dictionary to obtain a word sequence of the text to be detected;
and extracting the context characteristics of the word sequence to obtain the characteristic sequence of the text to be detected.
7. The method for detecting the abnormal text according to claim 6, wherein the step of performing context feature extraction on the word sequence to obtain a feature sequence of the text to be detected comprises:
determining semantic vectors and position vectors corresponding to word labels according to the word labels in the word sequence;
generating a vector to be subjected to feature extraction according to word labels in the word sequence and semantic vectors and position vectors corresponding to the word labels;
and extracting the context characteristics of the characteristic extraction vector to obtain a characteristic sequence of the text to be detected.
8. The method for detecting the abnormal text according to claim 1, wherein the mapping processing is performed on the feature sequence through a plurality of preset models respectively to obtain a processing result corresponding to each preset model, and the method comprises the following steps:
determining a characteristic segment in the characteristic sequence by a sliding window method according to a preset segment length corresponding to the preset model;
acquiring abnormal feature representation of the feature segment through the convolution layer of the preset model;
and mapping the abnormal feature representation through a full connection layer of the preset model to obtain the abnormal probability of the feature segment.
9. The method for detecting the abnormal text according to claim 8, wherein obtaining the abnormal feature representation of the feature segment through the convolution layer of the preset model comprises:
and fusing all the context characteristics in the characteristic segments according to the model parameters of the convolution layers of the preset model to obtain the abnormal characteristic representation of the characteristic segments.
10. The method for detecting an abnormal text according to claim 9, wherein the model parameters of the convolutional layer include a first weight parameter and a first base value parameter; fusing all the context features in the feature segment according to the model parameters of the convolution layer of the preset model to obtain the abnormal feature representation of the feature segment, which comprises the following steps:
carrying out weighted summation on all the context characteristics in the characteristic segment through the first weight parameter to obtain weight characteristics;
and superposing the weight characteristic and the first basic value parameter to obtain the abnormal characteristic representation of the characteristic segment.
11. The method for detecting the abnormal text according to claim 8, wherein the fully connected layer of the preset model comprises a second weight parameter and a second base value parameter; mapping the abnormal feature representation through the full-link layer of the preset model to obtain the abnormal probability of the feature segment, wherein the mapping comprises the following steps:
multiplying the abnormal feature representation by the second weight parameter and then adding the abnormal feature representation to the second basic value parameter to obtain a feature to be activated;
and processing the feature to be activated through a preset activation function to obtain the abnormal probability of the feature segment.
12. The method for detecting the abnormal text according to claim 1, wherein determining the abnormal segment in the text to be detected according to the abnormal probability of the feature segment indicated by each preset model processing result comprises:
determining the maximum abnormal probability in the abnormal probabilities of the characteristic segments indicated by the processing results of the preset models;
and taking a plurality of words indicated by the characteristic segment corresponding to the maximum abnormal probability as abnormal segments in the text to be detected.
13. An apparatus for detecting an abnormal text, comprising:
the text acquisition module is used for acquiring a text to be detected consisting of a plurality of words;
the feature extraction module is used for extracting features of the text to be detected to obtain a feature sequence of the text to be detected, wherein the feature sequence comprises context features corresponding to a plurality of characters in the text to be detected;
the mapping processing module is used for respectively mapping the characteristic sequences through a plurality of preset models to obtain processing results corresponding to the preset models; the processing result of the preset model comprises the abnormal probability of different feature segments in the feature sequence, and the feature segments comprise the context features of at least one word; the lengths of the characteristic segments corresponding to the processing results of different preset models are different;
and the abnormal segment determining module is used for determining the abnormal segments in the text to be detected according to the abnormal probability of the characteristic segments indicated by the processing results of the preset models.
14. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out a method for detecting an abnormal text according to any one of claims 1to 12.
15. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein execution of the executable instructions by the processor causes the electronic device to perform the method of detecting an abnormal text according to any one of claims 1to 12.
CN202210073277.1A 2022-01-21 2022-01-21 Abnormal text detection method and device, computer readable medium and electronic equipment Pending CN114490935A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210073277.1A CN114490935A (en) 2022-01-21 2022-01-21 Abnormal text detection method and device, computer readable medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210073277.1A CN114490935A (en) 2022-01-21 2022-01-21 Abnormal text detection method and device, computer readable medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN114490935A true CN114490935A (en) 2022-05-13

Family

ID=81473081

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210073277.1A Pending CN114490935A (en) 2022-01-21 2022-01-21 Abnormal text detection method and device, computer readable medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN114490935A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093853A (en) * 2023-10-18 2023-11-21 腾讯科技(深圳)有限公司 Time sequence data processing method and device, computer readable medium and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093853A (en) * 2023-10-18 2023-11-21 腾讯科技(深圳)有限公司 Time sequence data processing method and device, computer readable medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN111222317B (en) Sequence labeling method, system and computer equipment
CN110188202B (en) Training method and device of semantic relation recognition model and terminal
CN110287480B (en) Named entity identification method, device, storage medium and terminal equipment
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN111062217B (en) Language information processing method and device, storage medium and electronic equipment
CN113672708B (en) Language model training method, question-answer pair generation method, device and equipment
CN109214006B (en) Natural language reasoning method for image enhanced hierarchical semantic representation
CN110197279B (en) Transformation model training method, device, equipment and storage medium
CN114676234A (en) Model training method and related equipment
CN112163596B (en) Complex scene text recognition method, system, computer equipment and storage medium
CN113723105A (en) Training method, device and equipment of semantic feature extraction model and storage medium
CN110674642B (en) Semantic relation extraction method for noisy sparse text
CN113392209A (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN110472248A (en) A kind of recognition methods of Chinese text name entity
CN114462418B (en) Event detection method, system, intelligent terminal and computer readable storage medium
CN113743101A (en) Text error correction method and device, electronic equipment and computer storage medium
CN116245097A (en) Method for training entity recognition model, entity recognition method and corresponding device
CN114490935A (en) Abnormal text detection method and device, computer readable medium and electronic equipment
CN113705207A (en) Grammar error recognition method and device
CN113657092B (en) Method, device, equipment and medium for identifying tag
CN114692615B (en) Small sample intention recognition method for small languages
CN112131879A (en) Relationship extraction system, method and device
CN113704466B (en) Text multi-label classification method and device based on iterative network and electronic equipment
CN114298032A (en) Text punctuation detection method, computer device and storage medium
CN115796141A (en) Text data enhancement method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination