CN115859176A - Text processing method and device, computer equipment and storage medium - Google Patents

Text processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN115859176A
CN115859176A CN202211410457.0A CN202211410457A CN115859176A CN 115859176 A CN115859176 A CN 115859176A CN 202211410457 A CN202211410457 A CN 202211410457A CN 115859176 A CN115859176 A CN 115859176A
Authority
CN
China
Prior art keywords
word
text
processed
abnormal
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211410457.0A
Other languages
Chinese (zh)
Inventor
郑子彬
周越洲
林昊
蔡倬
王耀南
刘伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Merchants Union Consumer Finance Co Ltd
Sun Yat Sen University
Original Assignee
Merchants Union Consumer Finance Co Ltd
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Merchants Union Consumer Finance Co Ltd, Sun Yat Sen University filed Critical Merchants Union Consumer Finance Co Ltd
Priority to CN202211410457.0A priority Critical patent/CN115859176A/en
Publication of CN115859176A publication Critical patent/CN115859176A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a text processing method, a text processing apparatus, a computer device, a storage medium and a computer program product. The method comprises the following steps: acquiring a text to be processed, and performing word vector representation based on the text to be processed to obtain each word representation vector; semantic feature extraction is carried out on the basis of each word representation vector to obtain each word sense feature, and time sequence feature extraction is carried out according to the word sequence of the text to be processed and each word sense feature to obtain each word time sequence feature; calculating attention weight corresponding to each word time sequence feature, and weighting each word time sequence feature by using the attention weight to obtain each word attention feature; and detecting abnormal texts based on the attention characteristics of all words to obtain the possibility of the abnormal texts corresponding to the texts to be processed. The method can improve the accuracy of abnormal text detection.

Description

Text processing method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of internet technologies, and in particular, to a text processing method, apparatus, computer device, storage medium, and computer program product.
Background
With the development of internet technology, users can publish any information on the internet, such as communication information, videos, pictures, texts and the like. In order to maintain information security, it is often necessary to perform anomaly detection on information in the internet, for example, detection of an anomalous text, so that information leakage or false information streaming caused by the anomalous text can be avoided, and information security is threatened.
The existing abnormal text detection method is used for detecting the abnormality of a text according to an empirical rule, and has the problem of low accuracy of abnormal text detection.
Disclosure of Invention
In view of the above, it is necessary to provide an information processing method, an apparatus, a computer device, a computer readable storage medium, and a computer program product capable of improving the detection accuracy of the emotional state of information in view of the above technical problems.
In a first aspect, the present application provides a text processing method. The method comprises the following steps:
acquiring a text to be processed, and performing word vector representation based on the text to be processed to obtain each word representation vector;
semantic feature extraction is carried out on the basis of each word representation vector to obtain each word sense feature, and time sequence feature extraction is carried out according to the word sequence of the text to be processed and each word sense feature to obtain each word time sequence feature;
calculating attention weight corresponding to each word time sequence feature, and weighting each word time sequence feature by using the attention weight to obtain each word attention feature;
and detecting abnormal texts based on the attention characteristics of all words to obtain the possibility of the abnormal texts corresponding to the texts to be processed.
In a second aspect, the present application further provides a text processing apparatus. The device comprises:
the vector characterization module is used for acquiring the text to be processed, and performing word vector characterization on the basis of the text to be processed to obtain each word characterization vector;
the feature extraction module is used for extracting semantic features based on the word feature vectors to obtain word sense features, and extracting time sequence features according to the word sequence of the text to be processed and the word sense features to obtain word time sequence features;
the attention module is used for calculating attention weights corresponding to the word time sequence characteristics and weighting the word time sequence characteristics by using the attention weights to obtain the attention characteristics of the words;
and the detection module is used for detecting the abnormal text based on the attention characteristics of all words to obtain the possibility of the abnormal text corresponding to the text to be processed.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:
acquiring a text to be processed, and performing word vector representation based on the text to be processed to obtain each word representation vector;
semantic feature extraction is carried out on the basis of each word representation vector to obtain each word sense feature, and time sequence feature extraction is carried out according to the word sequence of the text to be processed and each word sense feature to obtain each word time sequence feature;
calculating attention weight corresponding to each word time sequence feature, and weighting each word time sequence feature by using the attention weight to obtain each word attention feature;
and detecting abnormal texts based on the attention characteristics of all words to obtain the possibility of the abnormal texts corresponding to the texts to be processed.
In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:
acquiring a text to be processed, and performing word vector representation based on the text to be processed to obtain each word representation vector;
semantic feature extraction is carried out on the basis of the word representation vectors to obtain word sense features, and time sequence feature extraction is carried out according to the word sequence of the text to be processed and the word sense features to obtain word time sequence features;
calculating attention weight corresponding to each word time sequence feature, and weighting each word time sequence feature by using the attention weight to obtain each word attention feature;
and detecting abnormal texts based on the attention characteristics of all words to obtain abnormal text possibility corresponding to the texts to be processed.
In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:
acquiring a text to be processed, and performing word vector representation based on the text to be processed to obtain each word representation vector;
semantic feature extraction is carried out on the basis of each word representation vector to obtain each word sense feature, and time sequence feature extraction is carried out according to the word sequence of the text to be processed and each word sense feature to obtain each word time sequence feature;
calculating attention weight corresponding to each word time sequence feature, and weighting each word time sequence feature by using the attention weight to obtain each word attention feature;
and detecting abnormal texts based on the attention characteristics of all words to obtain the possibility of the abnormal texts corresponding to the texts to be processed.
According to the text processing method, the text processing device, the computer equipment, the storage medium and the computer program product, semantic feature extraction is carried out on each word feature vector to obtain each word sense feature, and time sequence feature extraction is carried out according to the word sequence of the text to be processed and each word sense feature to obtain each word time sequence feature, so that the accuracy of the word time sequence feature is improved. And then, calculating the attention weight corresponding to each word time sequence feature, weighting each word time sequence feature according to the attention weight to obtain each word attention feature, determining the importance degree of each word attention feature, and further using each word attention feature to detect the abnormal text to enable the obtained detection result to be more accurate, thereby improving the accuracy of abnormal text detection.
Drawings
FIG. 1 is a diagram of an application environment of a text processing method in one embodiment;
FIG. 2 is a flow diagram that illustrates a method for text processing in one embodiment;
FIG. 3 is a schematic flow chart illustrating obtaining a text to be processed according to an embodiment;
FIG. 4 is a flow diagram that illustrates exception text processing in one embodiment;
FIG. 5 is a flow diagram illustrating abnormal text detection in one embodiment;
FIG. 6 is a diagram illustrating an exemplary structure of an abnormal text detection model;
FIG. 7 is a block diagram showing a configuration of a text processing apparatus according to an embodiment;
FIG. 8 is a diagram illustrating an internal structure of a computer device in one embodiment;
fig. 9 is an internal structural diagram of a computer device in another embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The text processing method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104, or may be located on the cloud or other network server. The server 104 acquires the text to be processed through the terminal 102, and performs word vector representation based on the text to be processed to obtain each word representation vector; the server 104 performs semantic feature extraction based on each word representation vector to obtain each word sense feature, and performs time sequence feature extraction according to the word sequence of the text to be processed and each word sense feature to obtain each word time sequence feature; the server 104 calculates attention weights corresponding to the word time sequence characteristics, and weights the word time sequence characteristics by using the attention weights to obtain the word attention characteristics; the server 104 performs abnormal text detection based on the attention characteristics of each word to obtain the possibility of abnormal text corresponding to the text to be processed. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and the like. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.
In one embodiment, as shown in fig. 2, a text processing method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
step 202, obtaining a text to be processed, and performing word vector representation based on the text to be processed to obtain each word representation vector.
The text to be processed refers to a text to be subjected to anomaly detection. Word vector characterization refers to the process of converting data in text form into data in vector form. The word representation vector refers to a vector of the representation word corresponding to the text to be processed.
Specifically, the server responds to a keyword search instruction sent by the terminal, visits each website through each preset website address, searches in each website according to keywords carried by the keyword search instruction to obtain search webpage results corresponding to each website, and obtains search webpage content corresponding to each website as a text to be processed. The server can also search in each pre-stored text according to the keywords carried by the keyword search instruction, and the text containing the keywords is used as the text to be processed. The server can also obtain the text to be processed through the terminal.
And then the server acquires a preset vector conversion algorithm, inputs the text to be processed into the vector conversion algorithm for word vector representation, and obtains word representation vectors corresponding to all words in the text to be processed.
And 204, extracting semantic features based on the word representation vectors to obtain word sense features, and extracting time sequence features according to the word sequence of the text to be processed and the word sense features to obtain word time sequence features.
The term meaning feature refers to a feature for representing semantic characteristics of a word. The word sequence refers to the semantic order among the words in the text to be processed. Word temporal features refer to features that characterize the word's sequence and semantic properties.
Specifically, the server may input each word feature vector to a pre-trained semantic feature extraction network, and extract word sense features corresponding to each word feature vector through the network. And then the server extracts the word sequence of each word according to the text to be processed, namely the word sequence corresponding to each word meaning characteristic. And then the server can acquire a pre-trained relationship extraction network, input each word meaning characteristic and the word sequence corresponding to each word meaning characteristic into the relationship extraction network, and extract the dependency relationship among the word meaning characteristics through the relationship extraction network, wherein the dependency relationship refers to the combined relationship of the representation semantics among the words and includes the dependency relationship among the long-distance words. And the deep learning network outputs the time sequence characteristics of each word according to the dependency relationship among the word characteristic vectors.
And step 206, calculating the attention weight corresponding to each word time sequence feature, and weighting each word time sequence feature by using the attention weight to obtain each word attention feature.
The attention weight refers to the importance degree corresponding to each word time sequence feature. The word attention feature refers to attention-weighted word temporal features.
Specifically, the server calculates an attention weight corresponding to each word time sequence feature by using a preset attention weight calculation parameter, and then weights the corresponding word time sequence feature by using the attention weight corresponding to each word time sequence feature to obtain each word attention feature.
And step 208, detecting abnormal texts based on the attention characteristics of the words to obtain the possibility of the abnormal texts corresponding to the texts to be processed.
The abnormal text possibility refers to the possible degree of the text to be processed being the abnormal text.
Specifically, the server inputs each word attention characteristic into a preset deep learning network for abnormal text detection, and obtains the possibility of abnormal text corresponding to the text to be processed.
And when the server detects that the possibility of the abnormal text exceeds a preset abnormal threshold, judging that the text to be processed is the abnormal text. The server can preset at least two abnormal threshold values, and corresponding abnormal levels are divided through the abnormal threshold values. For example, when the server detects that the possibility of the abnormal text exceeds a first preset abnormal threshold, the server determines that the text to be processed is a first-level abnormal text, and when the server detects that the possibility of the abnormal text exceeds a second preset abnormal threshold, the server determines that the text to be processed is a second-level abnormal text, and the higher the level is, the more serious the abnormal degree is.
In the text processing method, the text processing device, the computer equipment, the storage medium and the computer program product, semantic feature extraction is performed on each word expression vector to obtain each word meaning feature, and time sequence feature extraction is performed according to the word sequence of the text to be processed and each word meaning feature to obtain each word time sequence feature, so that the accuracy of the word time sequence feature is improved. And then, calculating the attention weight corresponding to each word time sequence feature, weighting each word time sequence feature according to the attention weight to obtain each word attention feature, determining the importance degree of each word attention feature, and further using each word attention feature to enable the obtained abnormal text detection result to be more accurate, thereby improving the accuracy of abnormal text detection.
In one embodiment, before obtaining the text to be processed in step 202, the method further includes:
responding to a text search request, wherein the text search request carries keywords, and acquiring communication interfaces corresponding to all information sources;
searching in each information source by using keywords based on the communication interface corresponding to each information source to obtain keyword search information corresponding to each information source;
and searching information based on the keywords corresponding to the information sources to obtain the text to be processed.
The text search request refers to a request for searching a text to be processed. Keywords refer to words used to search for text to be processed. An information source refers to a collection of information used to conduct a text search. The keyword search information refers to information searched according to keywords in an information source.
Specifically, the server acquires preset communication interfaces corresponding to the information sources, performs integrated processing on the communication interfaces corresponding to the information sources, and obtains and stores a communication interface set. And then the server responds to a text search request sent by the terminal to acquire the communication interfaces corresponding to the information sources in the communication interface set. And the server accesses each information source through the communication interface corresponding to each information source, then searches in each information source according to the keywords carried by the text search request, and acquires the keyword search information returned by each information source according to the keywords. Extracting text data in the keyword search information corresponding to each information source to obtain a text to be processed. The keyword search information also comprises an image, the server can identify characters in the image and extract the characters to obtain an image text corresponding to the image, and the image text is used as a text to be processed.
In one particular embodiment, the information source may be the Web (World Wide Web). The server obtains an Application Programming Interface (API) of each website by using Mashup technology, and collects the APIs of each website to obtain an API set. Then the server calls a search program to search in each website according to the keywords in the text search request, and each website returns each webpage according to the keyword search, namely the keyword search information. The server collects data of each webpage content, and stores the collected data in the form of html (hypertext markup language) documents to obtain texts to be processed.
In one embodiment, as shown in fig. 3, a flowchart for acquiring a text to be processed is provided. The server responds to the text search request, accesses each specified website through the API of each specified website and searches according to keywords, such as forums, blogs, news websites and the like, then collects the searched webpage information at each specified website, gathers the webpage information searched by each specified website through a mashup algorithm, and stores the gathered webpage information into a text database to be processed.
In this embodiment, the communication interface set is obtained by collecting the communication interfaces corresponding to the information sources, so that the designated information sources are quickly connected to obtain information by using the communication interface set subsequently, and the obtaining efficiency of the file to be processed is improved.
In one embodiment, step 202, performing word vector characterization on the text to be processed to obtain each word characterization vector, includes:
performing standardization processing based on the text to be processed to obtain a standard text to be processed;
filtering interference fields in the standard text to be processed to obtain a filtered text to be processed;
and performing word vector representation based on the filtered text to be processed to obtain each word representation vector.
The normalization process refers to a process of normalizing a text format. The standard text to be processed refers to the text with uniform format. The interference field refers to a field having no semantics. The filtered text to be processed refers to the text with the interference fields filtered out.
Specifically, the server performs standardized processing on the text to be processed according to the preset format requirement information, for example, unifying the capital and small cases of english letters in the text to be processed, simplifying complex characters in the text to be processed, and the like, to obtain the representation text to be processed. And then the server imports an interference word stock, filters the text to be processed according to each interference field in the interference word stock, and filters and deletes the interference field in the text to be processed to obtain the filtered text to be processed. The distracter may be a stop-word, meaning a functional word without an actual meaning, such as "and," "general," etc.
Then the server can input the filtered text to be processed into a Word2Vec model (Word embedding, word vector model) trained in advance for vector conversion, wherein the structure of the Word2Vec model is a three-layer shallow neural network and comprises an input layer, a hidden layer and an output layer. Wherein the hidden layer does not contain an activation function, and the dimensions of the input layer and the output layer are the same. The Word2Vec model outputs a Word token vector corresponding to each Word in the filtered text to be processed, and the Word token vector can be a high-dimensional vector. And then the server establishes a mapping relation according to each word in the text to be processed and the corresponding word token vector thereof and generates a word vector mapping table. The server can also directly obtain a word vector mapping table stored in advance, and corresponding word token vectors are searched in the word vector mapping table according to all words in the filtered text to be processed.
In the embodiment, each word token vector is obtained by preprocessing the text to be processed, so that the subsequent word token vectors meet the data requirement of abnormal text detection, and the accuracy of abnormal text detection is improved.
In one embodiment, as shown in FIG. 4, a flow diagram of exception text handling is provided; in step 208, after obtaining the abnormal text possibility corresponding to the text to be processed, the method further includes:
step 402, when the possibility of detecting the abnormal text exceeds a preset abnormal possibility threshold, determining that the text to be processed is the abnormal text;
step 404, generating abnormal alarm and abnormal text clustering confirmation information based on the abnormal text, and sending the abnormal alarm and abnormal text clustering confirmation information to the management terminal;
step 406, receiving a confirmation result corresponding to the abnormal text clustering confirmation information, and clustering related words in the text to be processed based on the clustering central words to obtain target related words corresponding to the clustering central words;
and step 408, taking the target associated word as an abnormal text clustering result, and returning the abnormal text clustering result to the management terminal.
The preset abnormal possibility threshold is a preset abnormal possibility threshold and is used for judging whether the text to be processed is an abnormal text. The abnormal alarm refers to an alarm generated when abnormal text is detected. The abnormal text clustering confirmation information is confirmation information which is sent to the terminal to confirm whether abnormal text clustering is performed or not when the abnormal text is detected. Abnormal text clustering refers to searching and gathering related words in a text to be processed according to clustering central words, and can represent retrieval of high-frequency events. The cluster central word refers to a word representing the current hot event in the text to be detected and is used for quickly identifying the abnormal text. The target related word refers to a word related to the clustering central word.
Specifically, when the server detects that the abnormal text exceeds a preset abnormal possibility threshold, the server determines that the text to be processed is the abnormal text, and generates an abnormal alarm and abnormal text clustering confirmation information according to the determination result of the abnormal text. And the server sends the abnormal text clustering confirmation information to the management terminal for displaying, and waits for a confirmation result corresponding to the abnormal text clustering confirmation information.
When the server receives a confirmation result sent by the management terminal and confirms that abnormal text clustering is executed, the keywords are used as clustering central words, the server can also count the occurrence frequency of each word in the text to be processed, the word with the highest occurrence frequency is used as the clustering central word, and the word with the occurrence frequency reaching a preset frequency threshold value can also be used as the clustering central word. And the server clusters the associated words in the text to be processed according to the clustering central words to obtain target associated words corresponding to the clustering central words, wherein the target associated words are at least two. And then the server takes the target associated word as an abnormal text clustering result and sends the abnormal text clustering result to the associated terminal.
In the embodiment, the abnormal alarm and the abnormal text clustering confirmation information are generated according to the abnormal text, and when the confirmation result corresponding to the abnormal text clustering confirmation information is received, the related words are clustered in the text to be processed according to the clustering central words, the target related words corresponding to the clustering central words are obtained, and the abnormal text clustering result is generated, so that the management terminal processes the abnormal text clustering result, and the management efficiency of the abnormal text is improved.
In one embodiment, step 306, performing relevant word clustering in the text to be processed based on the clustering central word to obtain a target relevant word corresponding to the clustering central word, includes:
performing word vector representation based on the clustering central words to obtain clustering central word feature vectors;
respectively calculating vector distances between the clustering central word feature vectors and the word feature vectors, and determining related word feature vectors in the word feature vectors based on the vector distances;
and obtaining the target related word based on the related word representation vector.
The cluster headword representation vector refers to vector data obtained after vector conversion of cluster headwords.
Specifically, the server performs word vector representation on the clustering central words to obtain clustering central word feature vectors, and the server can also directly obtain word vector representations corresponding to the keywords in each word feature vector and use the word vector representations as the clustering central word feature vectors.
And then the server respectively calculates the vector distance between the clustering central word characteristic vector and each word characteristic vector, and determines the word characteristic vector of which the vector distance is smaller than a preset vector distance threshold value as a related word characteristic vector. And then acquiring the target associated words corresponding to the associated word expression vectors according to the word vector mapping table.
In one embodiment, the server may calculate the occurrence frequency and weight of each word in the text to be processed by using a TF-IDF (term-inverse document frequency, a statistical method) algorithm, and output a keyword, i.e., a cluster headword, corresponding to the text to be processed. And the server acquires the clustering center word token vector from each word token vector. Then the server takes the clustering center word characterization vectors as initial clustering centers of all word characterization vectors in the text to be processed, wherein the number of the initial clustering centers can be at least two, and different initial clustering centers represent different classes.
The server can calculate the vector distance between each word expression vector and the initial clustering center by using a K-means algorithm, classifies the class of the initial clustering center closest to the word expression vector according to the vector distance corresponding to each word expression vector, and completes the adjustment of the class of the initial clustering center. And calculating a new clustering center by the adjusted new class, judging whether the clustering criterion is converged, and if the clustering criterion is converged or the clustering centers adjacent twice are not changed, indicating that the class adjustment of the clustering center is finished, and obtaining the associated word representation vector corresponding to each clustering center. If the clustering criterion is not converged or the clustering centers adjacent twice are changed, the class where the clustering center is located is continuously adjusted according to the vector clustering of each word token vector until the clustering criterion is converged or the clustering centers adjacent twice are not changed. And then the server acquires each category output by the K-means algorithm and the associated word characterization vector corresponding to each category.
In this embodiment, the vector distance between the clustering central word representation vector and each word representation vector is calculated, and the associated word representation vector is determined in each word representation vector based on the vector distance, so that the clustering accuracy of the associated word is improved.
In one embodiment, after obtaining the abnormal text possibility corresponding to the text to be processed in step 208, the method further includes:
acquiring a preset alarm word, and performing word matching in the text to be processed based on the preset alarm word;
when detecting that a text alarm word matched with a preset alarm word is present in the text to be processed, generating a text alarm based on the text alarm word;
the text alert is sent to the management terminal.
The preset alarm words refer to preset sensitive words. Text alerts refer to alerts generated by the detection of an alert word. The text alarm words refer to words in the text to be processed, wherein the words are the same as the preset alarm words.
Specifically, the server obtains a preset alarm word, performs word matching in the text to be processed according to the preset alarm word, represents that the processed text has sensitive information when detecting that the text alarm word which is consistent with the preset alarm word is in the text to be processed, generates a text alarm according to the text alarm word, and sends the text alarm to the management terminal.
In the embodiment, by setting the preset alarm words and performing word matching in the text to be processed according to the preset alarm words, whether the text alarm words exist can be quickly detected, so that the monitoring efficiency of the sensitive information of the text to be processed is improved.
In one embodiment, as shown in fig. 5, a flow diagram of abnormal text detection is provided. The server obtains the key words in the text search request, searches related web pages in each appointed website according to the key words, obtains information of each web page, and stores the information of each web page in a text database to be processed.
And the server performs data preprocessing on each webpage information to obtain a text to be processed. And then inputting the text to be processed into an abnormal text detection model to perform abnormal text detection, so as to obtain the abnormal text possibility corresponding to the text to be processed.
In one embodiment, as shown in fig. 6, a schematic structural diagram of an abnormal text detection model is provided. The abnormal text detection model comprises an input layer, a CNN-LSTM model (space-time network), an Attention layer, an activation layer and a full connection layer. The server inputs each word representation vector after data preprocessing into an abnormal text detection model, and specifically inputs the word representation vector into a convolution layer in a CNN-LSTM network, semantic feature extraction is carried out on each word representation vector through the convolution layer, each word meaning feature is output, and each word meaning feature can be a high-level abstract feature hidden in text data, such as gender orientation implied by names.
Then, each term meaning characteristic output by the convolutional neural network is input into an LSTM hidden layer (long-short term memory network) in the CNN-LSTM network, the dependency relationship among the term meaning characteristics is extracted through the long-short term memory network, the time sequence characteristic extraction is carried out on each term semantic characteristic according to the dependency relationship, and each term time sequence characteristic is output.
And then inputting the word time sequence characteristics output by the long-short term memory network into the attention layer, calculating the weight corresponding to each word time sequence characteristic, weighting, outputting each word attention characteristic, sequentially inputting each word attention characteristic into the activation layer and the full-connection layer, and outputting the abnormal text possibility corresponding to the text to be processed through the full-connection layer.
It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.
Based on the same inventive concept, the embodiment of the present application further provides a text processing apparatus for implementing the above-mentioned text processing method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme described in the method, so the specific limitations in one or more embodiments of the text processing device provided below may refer to the limitations on the text processing method in the foregoing, and are not described herein again.
In one embodiment, as shown in fig. 7, there is provided a text processing apparatus 700 comprising: vector characterization module 702, feature extraction module 704, attention module 706, and detection module 708, wherein:
the vector characterization module 702 is configured to obtain a text to be processed, and perform word vector characterization based on the text to be processed to obtain each word characterization vector;
a feature extraction module 704, configured to perform semantic feature extraction based on each word feature vector to obtain each word feature, and perform time sequence feature extraction according to the word sequence of the text to be processed and each word feature to obtain each word time sequence feature;
the attention module 706 is configured to calculate an attention weight corresponding to each word timing feature, and weight each word timing feature by using the attention weight to obtain each word attention feature;
the detecting module 708 is configured to perform abnormal text detection based on the attention characteristics of each word, and obtain an abnormal text possibility corresponding to the text to be processed.
In one embodiment, the text processing apparatus 700 further includes:
the text search unit is used for responding to a text search request which carries keywords and acquiring communication interfaces corresponding to all information sources; searching in each information source by using keywords based on the communication interface corresponding to each information source to obtain keyword search information corresponding to each information source; and searching information based on the keywords corresponding to the information sources to obtain the text to be processed.
In one embodiment, vector characterization module 702 includes:
the preprocessing unit is used for carrying out standardization processing on the text to be processed to obtain a standard text to be processed; filtering interference fields in the standard text to be processed to obtain a filtered text to be processed; and performing word vector representation based on the filtered text to be processed to obtain each word representation vector.
In one embodiment, the text processing apparatus 700 further includes:
the text clustering unit is used for determining the text to be processed as the abnormal text when the possibility of detecting the abnormal text exceeds a preset abnormal possibility threshold; generating abnormal alarm and abnormal text clustering confirmation information based on the abnormal text, and sending the abnormal alarm and abnormal text clustering confirmation information to the management terminal; receiving a confirmation result corresponding to abnormal text clustering confirmation information, and clustering associated words in the text to be processed based on the clustering central words to obtain target associated words corresponding to the clustering central words; and taking the target associated word as an abnormal text clustering result, and returning the abnormal text clustering result to the management terminal.
In one embodiment, the text processing apparatus 700 further includes:
the clustering unit is used for performing word vector representation based on clustering central words to obtain clustering central word representation vectors; respectively calculating vector distances between the clustering central word feature vectors and the word feature vectors, and determining related word feature vectors in the word feature vectors based on the vector distances; and obtaining the target related word based on the related word representation vector.
In one embodiment, the text processing apparatus 700 further includes:
the alarm unit is used for acquiring a preset alarm word and performing word matching in the text to be processed based on the preset alarm word; when detecting that a text alarm word matched with a preset alarm word is present in the text to be processed, generating a text alarm based on the text alarm word; the text alert is sent to the management terminal.
The respective modules in the text processing apparatus described above may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, an Input/Output interface (I/O for short), and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the text to be processed. The input/output interface of the computer device is used for exchanging information between the processor and an external device. The communication interface of the computer device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a text processing method.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 9. The computer apparatus includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input device. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The input/output interface of the computer device is used for exchanging information between the processor and an external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a text processing method. The display unit of the computer equipment is used for forming a visual and visible picture, and can be a display screen, a projection device or a virtual reality imaging device, the display screen can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the configurations illustrated in fig. 8-9 are merely block diagrams of portions of configurations related to aspects of the present application, and do not constitute limitations on the computing devices to which aspects of the present application may be applied, as particular computing devices may include more or less components than those illustrated, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
In an embodiment, a computer program product is provided, comprising a computer program which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant countries and regions.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), magnetic Random Access Memory (MRAM), ferroelectric Random Access Memory (FRAM), phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.
All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (10)

1. A method of text processing, the method comprising:
acquiring a text to be processed, and performing word vector representation based on the text to be processed to obtain each word representation vector;
semantic feature extraction is carried out on the basis of the word representation vectors to obtain word sense features, and time sequence feature extraction is carried out according to the word sequence of the text to be processed and the word sense features to obtain word time sequence features;
calculating attention weights corresponding to the word time sequence characteristics, and weighting the word time sequence characteristics by using the attention weights to obtain the word attention characteristics;
and detecting abnormal texts based on the attention characteristics of the words to obtain the possibility of the abnormal texts corresponding to the texts to be processed.
2. The method according to claim 1, further comprising, before the obtaining the text to be processed:
responding to a text search request, wherein the text search request carries keywords, and acquiring communication interfaces corresponding to all information sources;
searching in each information source by using the keywords based on the communication interface corresponding to each information source to obtain keyword search information corresponding to each information source;
and searching information based on the keywords corresponding to the information sources to obtain the text to be processed.
3. The method of claim 1, wherein the performing word vector characterization based on the text to be processed to obtain each word characterization vector comprises:
performing standardization processing on the text to be processed to obtain a standard text to be processed;
filtering the interference fields in the standard text to be processed to obtain a filtered text to be processed;
and performing word vector representation on the basis of the filtered text to be processed to obtain each word representation vector.
4. The method according to claim 1, further comprising, after obtaining the abnormal text likelihood corresponding to the text to be processed, the following steps:
when the abnormal text possibility is detected to exceed a preset abnormal possibility threshold, determining that the text to be processed is an abnormal text;
generating abnormal alarm and abnormal text clustering confirmation information based on the abnormal text, and sending the abnormal alarm and the abnormal text clustering confirmation information to a management terminal;
receiving a confirmation result corresponding to the abnormal text clustering confirmation information, and clustering associated words in the text to be processed based on clustering central words to obtain target associated words corresponding to the clustering central words;
and taking the target associated word as an abnormal text clustering result, and returning the abnormal text clustering result to the management terminal.
5. The method according to claim 4, wherein the clustering of the relevant words in the text to be processed based on the clustering central words to obtain the target relevant words corresponding to the clustering central words comprises:
performing word vector representation based on the clustering central words to obtain clustering central word feature vectors;
respectively calculating the vector distance between the clustering center word token vector and each word token vector, and determining a related word token vector in each word token vector based on the vector distance;
and obtaining the target associated word based on the associated word characterization vector.
6. The method according to claim 1, further comprising, after obtaining the abnormal text likelihood corresponding to the text to be processed, the step of:
acquiring a preset alarm word, and performing word matching in the text to be processed based on the preset alarm word;
when detecting that a text alarm word matched with the preset alarm word is present in the text to be processed, generating a text alarm based on the text alarm word;
and sending the text alarm to a management terminal.
7. A text processing apparatus, characterized in that the apparatus comprises:
the vector characterization module is used for acquiring a text to be processed, and performing word vector characterization on the basis of the text to be processed to obtain each word characterization vector;
the feature extraction module is used for extracting semantic features based on the word representation vectors to obtain word sense features, and extracting time sequence features according to the word sequence of the text to be processed and the word sense features to obtain word time sequence features;
the attention module is used for calculating attention weights corresponding to the word time sequence characteristics and weighting the word time sequence characteristics by using the attention weights to obtain the word attention characteristics;
and the detection module is used for detecting abnormal texts based on the attention characteristics of the words to obtain the possibility of the abnormal texts corresponding to the texts to be processed.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 6.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 6 when executed by a processor.
CN202211410457.0A 2022-11-11 2022-11-11 Text processing method and device, computer equipment and storage medium Pending CN115859176A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211410457.0A CN115859176A (en) 2022-11-11 2022-11-11 Text processing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211410457.0A CN115859176A (en) 2022-11-11 2022-11-11 Text processing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115859176A true CN115859176A (en) 2023-03-28

Family

ID=85663140

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211410457.0A Pending CN115859176A (en) 2022-11-11 2022-11-11 Text processing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115859176A (en)

Similar Documents

Publication Publication Date Title
US11475143B2 (en) Sensitive data classification
CA3083729C (en) Domain-specific natural language understanding of customer intent in self-help
JP6759844B2 (en) Systems, methods, programs and equipment that associate images with facilities
AU2017221945B2 (en) Method and device of identifying network access behavior, server and storage medium
CA3083723C (en) Method and apparatus for providing personalized self-help experience
US20210248149A1 (en) Computer-Based Systems for Data Entity Matching Detection Based on Latent Similarities in Large Datasets and Methods of Use Thereof
CN106874253A (en) Recognize the method and device of sensitive information
CN110516210B (en) Text similarity calculation method and device
CN114692593B (en) Network information safety monitoring and early warning method
Jayawardhana et al. An ontology-based framework for extracting spatio-temporal influenza data using Twitter
CN108763961B (en) Big data based privacy data grading method and device
CN113722484A (en) Rumor detection method, device, equipment and storage medium based on deep learning
KR20170060958A (en) Method and system for preventing bank fraud
CN108875050B (en) Text-oriented digital evidence-obtaining analysis method and device and computer readable medium
US12056201B2 (en) Systems and methods for automatic and adaptive browser bookmarks
Subramani et al. Text mining and real-time analytics of twitter data: A case study of australian hay fever prediction
Wu et al. Sub-event discovery and retrieval during natural hazards on social media data
US20190318223A1 (en) Methods and Systems for Data Analysis by Text Embeddings
Hendrickson et al. Identifying exceptional descriptions of people using topic modeling and subgroup discovery
CN115859176A (en) Text processing method and device, computer equipment and storage medium
US20220222300A1 (en) Systems and methods for temporal and visual feature driven search utilizing machine learning
CN117278298A (en) Domain name detection method, device, equipment and storage medium based on artificial intelligence
CN116975083A (en) Information searching method, information searching device, computer equipment and storage medium
CN118170905A (en) Method, device, equipment, storage medium and product for constructing article knowledge base
CN116955751A (en) Crawler identification method, crawler identification device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Country or region after: China

Address after: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Applicant after: Zhaolian Consumer Finance Co.,Ltd.

Applicant after: SUN YAT-SEN University

Address before: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Applicant before: MERCHANTS UNION CONSUMER FINANCE Co.,Ltd.

Country or region before: China

Applicant before: SUN YAT-SEN University

CB02 Change of applicant information