CN112580329B - Text noise data identification method, device, computer equipment and storage medium - Google Patents

Text noise data identification method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN112580329B
CN112580329B CN201910944355.9A CN201910944355A CN112580329B CN 112580329 B CN112580329 B CN 112580329B CN 201910944355 A CN201910944355 A CN 201910944355A CN 112580329 B CN112580329 B CN 112580329B
Authority
CN
China
Prior art keywords
data
sentence
carrying
training
training data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910944355.9A
Other languages
Chinese (zh)
Other versions
CN112580329A (en
Inventor
韩旭红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201910944355.9A priority Critical patent/CN112580329B/en
Publication of CN112580329A publication Critical patent/CN112580329A/en
Application granted granted Critical
Publication of CN112580329B publication Critical patent/CN112580329B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The text noise data recognition method comprises the steps of performing sentence segmentation on text data, converting a task of complex text data processing into a simple sentence data processing task by taking a segmented sentence as a data processing base point, and performing dropout processing on neurons by adopting a dropout mechanism in different past.

Description

Text noise data identification method, device, computer equipment and storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a computer device, and a storage medium for recognizing text noise data.
Background
Natural language processing is an important direction in the fields of computer science and artificial intelligence, and its research can implement various theories and methods for effective communication between human and computer by natural language. Text data processing can be regarded as the basis of natural language processing, and is an important ring.
When analyzing text data, some noise data have a great adverse effect on the data analysis work, so that a method for identifying noise data in the text data by using a machine learning or deep learning algorithm appears, and the traditional text noise data identification method mostly marks different types of text data such as sentences or phrase data, and then identifies the noise data by identifying marked data.
Although the method can identify noise data to a certain extent, more marking work is needed, labor is consumed, a large amount of identification operation is needed to be executed when a computer is identified, the processing speed of data identification is greatly reduced due to the fact that hardware consumption of the computer is greatly reduced, meanwhile, the accuracy of noise data identification is influenced by a large amount of marking data, and therefore the problem of low identification efficiency exists in the traditional text noise data identification method.
Disclosure of Invention
Based on this, it is necessary to provide an efficient text noise data recognition method, apparatus, computer device and storage medium, aiming at the problem that the existing text noise data recognition efficiency is low.
A text noise data recognition method, the method comprising:
acquiring text data;
dividing sentences of the text data to obtain segmented sentences and extracting position vectors of the segmented sentences;
inputting the segmented sentences into a trained sentence correlation classification model, adding tag data into the segmented sentences to obtain sentence correlation vectors, wherein the sentence correlation vectors are feature vectors which are output by a hidden layer of the trained sentence correlation classification model and used for representing sentence information, and the sentence correlation classification model is obtained by performing dropout processing training on training data carrying tag data by adopting a dropout mechanism;
and splicing the sentence correlation vector and the position vector to obtain a splicing matrix, and carrying out noise prediction on the text data based on the splicing matrix to obtain a noise identification result.
In one embodiment, sentence processing of text data includes:
dividing the text data into a plurality of sentences by adopting a preset sentence dividing algorithm;
dividing or splicing the segmented sentences according to a preset sentence length threshold value to ensure that the length of the segmented sentences meets the preset sentence length threshold value.
In one embodiment, before inputting the segmented sentence into the trained sentence correlation classification model, the method further comprises:
collecting historical text data, wherein the historical text data carries labeling information;
according to the labeling information, sentence segmentation and labeling are carried out on the historical text data, so that training data carrying label data are obtained;
setting corresponding dropout probability for training data carrying label data;
based on the dropout probability, carrying out dropout processing on training data carrying tag data, and updating the training data;
and training the initial sentence correlation classification model by using the updated training data to obtain a trained sentence correlation classification model.
In one embodiment, according to the labeling information, sentence segmentation and labeling are performed on the historical text data, and obtaining training data carrying label data includes:
dividing the historical text data into a plurality of sentences;
identifying annotation information of the historical text data;
if the labeling information of the historical text data is noise data, marking the labels of sentences cut from the historical text data as irrelevant labels to obtain training data carrying relevant labels;
if the labeling information of the historical text data is non-noise data, labeling the labels of sentences cut from the historical text data as related labels to obtain training data carrying irrelevant labels.
In one embodiment, setting the corresponding dropout probability for the training data carrying the tag data includes:
respectively inputting training data carrying related labels and training data carrying uncorrelated labels into an initial sentence correlation classification model;
and setting a first dropoff probability for training data carrying relevant labels by adopting a dropoff mechanism, and setting a second dropoff probability for training data carrying irrelevant labels by adopting the dropoff mechanism.
In one embodiment, based on the dropout probability, dropout processing is performed on training data carrying tag data, and updating the training data includes:
based on the first dropout probability, randomly discarding part of training data carrying the relevant labels to obtain a first training set;
based on the second dropout probability, randomly discarding training data of a part carrying irrelevant labels to obtain a second training set;
and combining the first training set and the second training set to be used as new training data, inputting the new training data into the initial sentence correlation classification model again, returning to the step of randomly discarding part of training data carrying the correlation labels based on the first dropout probability until the return times reach a preset time threshold.
A text noise data recognition device, the device comprising:
the data acquisition module is used for acquiring text data;
the sentence dividing processing module is used for dividing the text data to obtain divided sentences and extracting position vectors of the divided sentences;
the sentence correlation processing module is used for inputting the segmented sentences into a trained sentence correlation classification model to obtain sentence correlation vectors, wherein the sentence correlation vectors are feature vectors which are output by a hidden layer of the trained sentence correlation classification model and used for representing sentence information, and the sentence correlation classification model is obtained by performing dropout processing training on training data carrying tag data by adopting a dropout mechanism;
and the noise prediction module is used for splicing the sentence correlation vector and the position vector to obtain a splicing matrix, and carrying out noise prediction on the text data based on the splicing matrix to obtain a noise identification result.
In one embodiment, the apparatus further comprises:
the model training module is used for collecting historical text data, carrying labeling information by the historical text data, carrying out sentence segmentation and labeling processing on the historical text data according to the labeling information to obtain training data carrying label data, setting corresponding dropout probability for training carrying label data, carrying out dropout processing on the training data carrying label data based on the dropout probability, updating the training data, training an initial sentence correlation classification model by adopting the updated training data, and obtaining a trained sentence correlation classification model.
A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:
acquiring text data;
dividing sentences of the text data to obtain segmented sentences and extracting position vectors of the segmented sentences;
inputting the segmented sentences into a trained sentence correlation classification model, adding tag data into the segmented sentences to obtain sentence correlation vectors, wherein the sentence correlation vectors are feature vectors which are output by a hidden layer of the trained sentence correlation classification model and used for representing sentence information, and the sentence correlation classification model is obtained by performing dropout processing training on training data carrying tag data by adopting a dropout mechanism;
and splicing the sentence correlation vector and the position vector to obtain a splicing matrix, and carrying out noise prediction on the text data based on the splicing matrix to obtain a noise identification result.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
acquiring text data;
dividing sentences of the text data to obtain segmented sentences and extracting position vectors of the segmented sentences;
inputting the segmented sentences into a trained sentence correlation classification model, adding tag data into the segmented sentences to obtain sentence correlation vectors, wherein the sentence correlation vectors are feature vectors which are output by a hidden layer of the trained sentence correlation classification model and used for representing sentence information, and the sentence correlation classification model is obtained by performing dropout processing training on training data carrying tag data by adopting a dropout mechanism;
and splicing the sentence correlation vector and the position vector to obtain a splicing matrix, and carrying out noise prediction on the text data based on the splicing matrix to obtain a noise identification result.
According to the text noise data identification method, the device, the computer equipment and the storage medium, sentence segmentation is carried out on text data, segmented sentences are taken as data processing base points, complex text data processing tasks are converted into simple sentence data processing tasks, and different from the conventional method of carrying out dropout processing on neurons by adopting a dropout mechanism, the method is characterized in that dropout processing is carried out on training data carrying tag data by adopting the dropout mechanism, the problem of over fitting of model training is prevented, corresponding tag data can be added to input text data through a sentence correlation classification model trained by the training data, a large amount of text data does not need to be marked, labor cost is saved, meanwhile, data processing speed is improved, noise prediction is carried out based on a splicing matrix obtained by splicing sentence correlation vectors and position vectors, and accuracy of noise data identification can be improved.
Drawings
FIG. 1 is a diagram of an application environment for a text noise data recognition method in one embodiment;
FIG. 2 is a flow diagram of a method of text noise data identification in one embodiment;
FIG. 3 is a flow diagram of the steps of building a model in one embodiment;
FIG. 4 is a flow chart of a model building step in another embodiment;
FIG. 5 is a block diagram of a text noise data recognition device in one embodiment;
FIG. 6 is a block diagram of a text noise data recognition device in another embodiment;
fig. 7 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
The text noise data identification method provided by the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. Specifically, the user may upload text data to be processed at the terminal 102, then send a noise data identification request (the noise data identification request carries the text data) to the server 104 through the terminal 102, the server 104 responds to the noise data identification request, obtain the text data, process the text data in terms of sentences, obtain segmented sentences, extract position vectors of the segmented sentences, input the segmented sentences into a trained sentence correlation classification model (the sentence correlation classification model is obtained by conducting dropout processing training on training data carrying tag data by adopting a dropout mechanism), add tag data to the segmented sentences, obtain sentence correlation vectors (which can be regarded as intermediate results of models) which are output by a hidden layer of the trained sentence correlation classification model and used for representing sentence information, splice the sentence correlation vectors and the position vectors, obtain a splice matrix, and conduct noise prediction on the text data based on the splice matrix, so as to obtain the noise identification result. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices, and the server 104 may be implemented by a stand-alone server or a server cluster composed of a plurality of servers. In order to more clearly describe the text noise data recognition method provided in the present application, text data will be explained below by taking chapter data as an example.
In one embodiment, as shown in fig. 2, a text data noise recognition method is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:
step S200, obtaining text data.
In natural language processing tasks, text data includes words, phrases, sentences, and chapter data. In this embodiment, the text data is exemplified by chapter data, and in practical application, when receiving a text noise data identification request sent by a terminal, the text data (chapter data) to be identified in the database may be obtained. The chapter data refers to the fact that the information of entities, events and the like is organized according to a certain structure to convey the semantics to be expressed, the chapter data comprises sentences, words or phrases and the like, and chapter analysis is also an important ring in natural language processing tasks.
Step S400, sentence segmentation is carried out on the text data, so that segmented sentences are obtained, and position vectors of the segmented sentences are extracted.
The position vector position of the sentence after the segmentation is a vector for characterizing the position of the sentence in the original text data, such as information characterizing what line or what segment the sentence is in the text data. After the text data is obtained, the text data needs to be subjected to data preprocessing, in this embodiment, the text data takes chapter data as an example, the preprocessing includes sentence segmentation processing on the chapter data by adopting a sentence segmentation algorithm, so as to obtain segmented sentences, and meanwhile, in order to improve the accuracy of noise data prediction, a position vector of the segmented sentences needs to be extracted, and the position vector represents the specific number of lines of the segmented sentences in the original chapter data.
In one embodiment, sentence processing of text data includes: and dividing the text data into a plurality of sentences by adopting a preset sentence dividing algorithm, and dividing or splicing the divided sentences according to a preset sentence length threshold value to ensure that the length of the divided sentences meets the preset sentence length threshold value.
Because the sentence after the sentence dividing process may have too complex (long) sentences, which are not suitable for the actual test task, the sentence length control strategy is adopted in this embodiment, unlike the previous sentence dividing process. Firstly, a preset sentence dividing algorithm such as a jentenceend algorithm is adopted, for example, chapter data is segmented according to sentences, then, secondary processing is carried out on the sentences according to the lengths of the segmented sentences, specifically, the method comprises the steps of comma-based segmentation and re-segmentation of sentences with longer sentence lengths, and for sentences with shorter lengths, namely, the sentences with shorter lengths at present are spliced with the next sentences, and in principle, the lengths of the segmented sentences are ensured to be smaller than or equal to a preset sentence length threshold value. In the embodiment, the sentence length control strategy is adopted, so that the situation that the sentences are too long or too short is avoided, and the effective utilization rate of the sentences in the text noise data prediction is improved.
Step S600, inputting the segmented sentences into a trained sentence correlation classification model, adding tag data for the segmented sentences to obtain sentence correlation vectors, wherein the sentence correlation vectors are feature vectors which are output by a hidden layer of the trained sentence correlation classification model and used for representing sentence information, and the sentence correlation classification model is obtained by performing dropout processing training on training data carrying tag data by adopting a dropout mechanism.
The sentence correlation classification model is a model for completing classification of whether noise data is correlated for input sentence data, and in this embodiment, the sentence correlation classification model may be an LSTM (Long Short-Term Memory network) +attention model, and it is understood that in other embodiments, the sentence correlation classification model may also be other classification models. Specifically, the sentence correlation classification model adopts a supervised learning mode, namely a machine learning mode of deducing a function from a labeled training data set, and the trained sentence correlation classification model can automatically determine whether class labels of noise data are correlated for an input sentence. In addition, in the machine learning model, if the parameters of the model are too many and the training samples are too few, the trained model is easy to generate the phenomenon of over fitting. Therefore, in the sentence correlation classification model training process, a dropout mechanism is further adopted for dropout processing on training data carrying labels, so that input data of each time are different, and model overfitting is effectively prevented. In this embodiment, the segmented sentences are input into a trained sentence correlation classification model, the model predicts the input sentences, tags of whether the sentences are correlated with noise data are added, and sentence correlation vectors are obtained from the results output by a hidden layer (middle layer) of the model, wherein the sentence correlation vectors are feature vectors for representing sentence information.
Step S800, splicing the sentence correlation vector and the position vector to obtain a splicing matrix, and carrying out noise prediction on the text data based on the splicing matrix to obtain a noise identification result.
According to the embodiment, after the sentence correlation vector of the intermediate layer result is obtained, the sentence correlation vector and the extracted sentence Position vector Position are spliced to obtain a splicing matrix, and then noise prediction is performed based on the splicing matrix to obtain a noise identification result. For example, the Position vector is [ -6-5-4-3-2-1 0 1 2 3 4 56 ], and 0 represents the Position of the current sentence. In this embodiment, the splicing matrix may be input as input data to a trained noise prediction model, the noise prediction model may be a bidirectional lstm+attribute model, the input splicing matrix is output through a dense full-connection layer to obtain vector logits, then the vector logits may be converted into a probability distribution through softmax based on the vector logits, the cross entropy may be calculated by adopting softmax_cross_entry_with_logits, the parameter input by the method may be logits, the probability value of the probability distribution is the largest or the first two is taken as a classification label, a noise recognition result is obtained, the recognition result may be binarized 0 or 1,0 represents that the input chapter data is not noise data, and 1 represents that the input chapter data is noise data. It is appreciated that in other embodiments, sigmiod may be used to bi-classify vector logits to obtain noise recognition results.
According to the text data noise recognition method, sentence segmentation is carried out on text data, segmented sentences are used as data processing base points, complex text data processing tasks are converted into simple sentence data processing tasks, and different from the conventional method of carrying out dropout processing on neurons by adopting a dropout mechanism, the method is characterized in that the dropout mechanism is adopted to carry out dropout processing on training data carrying tag data, so that the problem of fitting of model training is prevented, corresponding tag data can be added to input text data through a sentence correlation classification model trained by the training data, a large amount of text data does not need to be marked, labor cost is saved, meanwhile, the data processing speed is also improved, noise prediction is carried out based on a splicing matrix obtained by splicing sentence correlation vectors and position vectors, and the accuracy of noise data recognition can be improved.
The text data noise recognition scheme has great application value in the fields of comment text semantic analysis, emotion analysis, text retrieval, text clustering, text recommendation, text management and the like. Identifying noise in text is an upstream task in these areas. Noise in the text is accurately identified, more reasonable data support can be provided for a downstream text processing task, and the accuracy of subsequent processing is higher. For example, in analyzing the semantics of a text, if noise in the text can be accurately recognized, adverse effects of noise data on the semantic analysis result can be avoided.
In one embodiment, before inputting the segmented sentence into the trained sentence correlation classification model, the method further comprises:
step S500, collecting historical text data, wherein the historical text data carries labeling information;
step S520, sentence segmentation and labeling are carried out on the historical text data according to the labeling information, so that training data carrying label data are obtained;
step S540, setting corresponding dropoff probability for training data carrying label data;
step S560, based on the dropout probability, dropout processing is carried out on the training data carrying the tag data, and the training data is updated;
in step S580, the updated training data is used to train the initial sentence correlation classification model, so as to obtain a trained sentence correlation classification model.
In practical application, historical chapter data carrying marking information is collected as sample data, wherein the marking information is added to the chapter data on the premise that whether the chapter data is noise data or not is known, if the chapter is the noise data, the marking information is marked as noise data, and if the chapter is non-noise data, the marking information is marked as non-noise data. After the historical chapter data of the labeling information is collected, sentence dividing processing is carried out on the chapter dataAnd marking the segmented sentences according to the marking information to obtain training data carrying the tag data. Because the labeling information is that the related chapters possibly contain a plurality of irrelevant sentences, and the irrelevant chapters basically do not contain the characteristic of the related sentences, in order to prevent the phenomenon of overfitting of a sentence relevance classification model, a deep learning dropout mechanism is introduced, except that the traditional deep learning dropout mechanism performs dropout processing on neuron nodes of word vectors, in the application, different dropout probabilities are set on training data, dropout processing is performed on the training data, namely, the training data with part of tag data is randomly discarded, the training data is updated, and the loss function of the model is also adjusted according to the dropout probabilities, so that the method is obtainedAnd training the initial sentence correlation classification model by using the updated training data until the loss function is smaller and smaller, and completing the training of the initial sentence correlation classification model. In this embodiment, the dropout is used to perform dropout processing on the training data, so that the model can be effectively prevented from generating the over-fitting phenomenon.
As shown in fig. 4, in one embodiment, performing sentence segmentation and labeling processing on the historical text data according to the labeling information, to obtain training data carrying label data includes: step S522, the historical text data is segmented into a plurality of sentences, the labeling information of the historical text data is identified, if the labeling information of the historical text data is noise data, the labels of the sentences segmented from the historical text data are marked as irrelevant labels, the training data carrying relevant labels are obtained, and if the labeling information of the historical text data is non-noise data, the labels of the sentences segmented from the historical text data are marked as relevant labels, and the training data carrying irrelevant labels are obtained.
Similarly, the historical text data is segmented into a plurality of sentences by adopting a preset sentence segmentation algorithm, a sentence length control strategy is still adopted, the longer sentences are segmented again, the shorter sentences and the later sentences are spliced, the sentence length is guaranteed to be controlled within a preset length threshold, then the labeling information of the historical text data is identified, if the labeling information is noise data, the labels of the sentences segmented from the historical text data are marked as irrelevant labels, training data carrying relevant labels is obtained, and if the labeling information is non-noise data, the labels of the sentences segmented from the historical text data are marked as relevant labels, and training data carrying irrelevant labels is obtained. In the embodiment, the chapter mark information is used as the sentence label, so that a large number of sentence marks are not required to be added, and the data processing efficiency is improved.
As shown in fig. 4, in one embodiment, setting the corresponding dropout probability for the training data carrying the tag data includes: step S542, respectively inputting training data carrying relevant labels and training data carrying irrelevant labels into an initial sentence relevance classification model, setting a first dropoff probability for the training data carrying relevant labels by adopting a dropoff mechanism, and setting a second dropoff probability for the training data carrying irrelevant labels by adopting the dropoff mechanism.
In this embodiment, a dropout mechanism is used to perform dropout processing on input training data (sentences) carrying tag data, and different dropout probabilities are used for different tags, specifically, if the input training data carries a relevant tag, the dropout probability of the training data is set to be a first dropout probability, if the input training data carries an irrelevant tag, the dropout probability of the training data is set to be a second dropout probability, in this embodiment, the first dropout probability is set to be 0.4, and the second dropout probability is set to be 0.8, and it is noted that, because the irrelevant tag corresponds to noise data, the relevant tag corresponds to non-noise data, where the first dropout probability needs to be smaller than the second dropout probability. In this embodiment, different dropout probabilities are set for different tags, so that the influence of sentence tag errors on the sentence correlation classification model can be effectively reduced.
In one embodiment, based on the dropout probability, dropout processing is performed on training data carrying tag data, and updating the training data includes: based on the first dropout probability, randomly discarding part of training data carrying relevant labels to obtain a first training set, based on the second dropout probability, randomly discarding part of training data carrying irrelevant labels to obtain a second training set, combining the first training set and the second training set to serve as new training data to be input into the initial sentence correlation classification model again, returning to the step of randomly discarding part of training data carrying relevant labels based on the first dropout probability until the return times reach a preset frequency threshold.
Different from a dropout mechanism of traditional deep learning, in the method, instead of stopping the operation of an activation value of a certain neuron with a certain probability p or deleting half of hidden neurons in a neural network randomly (temporarily), when training data carrying relevant labels is input, based on a first dropout probability, a part of training data carrying relevant labels is randomly discarded, remaining data after discarding (filtering) is reserved to obtain a first training set, when training data carrying irrelevant labels is input, based on a second dropout probability, a part of training data carrying irrelevant labels is randomly discarded, remaining data after discarding is reserved to obtain a second training set, the first training set and the second training set are combined to serve as new training data to be input into an initial sentence relevance classification model again, dropout processing is circularly performed according to the mode until the number of times of iteration (return) reaches a preset frequency threshold, and the cycle is ended to obtain finally updated training data. In this embodiment, the drop mechanism is combined to process the training data, so that the training data input each time are different, and the model training effect is comprehensively improved.
It should be understood that, although the steps in the flowcharts of fig. 2-4 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2-4 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily occur sequentially, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or steps.
In one embodiment, as shown in fig. 5, there is provided a text noise data recognition apparatus, including: a data acquisition module 510, a clause processing module 520, a sentence relevance processing module 530, and a noise prediction module 540, wherein:
the data acquisition module 510 is configured to acquire text data.
The sentence processing module 520 is configured to perform sentence processing on the text data, obtain a segmented sentence, and extract a position vector of the segmented sentence.
The sentence correlation processing module 530 is configured to input the segmented sentence into a trained sentence correlation classification model to obtain a sentence correlation vector, where the sentence correlation vector is a feature vector that is output by a hidden layer of the trained sentence correlation classification model and is used to characterize sentence information, and the sentence correlation classification model is obtained by performing dropout processing training on training data carrying tag data by using a dropout mechanism.
The noise prediction module 540 is configured to splice the sentence correlation vector and the position vector to obtain a splice matrix, and perform noise prediction on the text data based on the splice matrix to obtain a noise recognition result.
As shown in fig. 6, in one embodiment, the apparatus further includes a model training module 550, configured to collect historical text data, where the historical text data carries labeling information, perform sentence segmentation and labeling processing on the historical text data according to the labeling information, obtain training data carrying tag data, set a corresponding dropoff probability for training carrying tag data, perform dropoff processing on the training data carrying tag data based on the dropoff probability, update the training data, and train the initial sentence correlation classification model with the updated training data to obtain the trained sentence correlation classification model.
In one embodiment, the sentence dividing module 520 is further configured to divide the text data into a plurality of sentences by using a preset sentence dividing algorithm, and divide or splice the divided sentences according to a preset sentence length threshold, so as to ensure that the length of the divided sentences meets the preset sentence length threshold.
As shown in fig. 6, in one embodiment, the model training module 550 further includes a sentence-cutting and labeling unit 552 configured to cut the historical text data into a plurality of sentences, identify labeling information of the historical text data, label the labels of the sentences cut from the historical text data as irrelevant labels if the labeling information of the historical text data is noise data, obtain training data carrying relevant labels, and label the labels of the sentences cut from the historical text data as relevant labels if the labeling information of the historical text data is non-noise data, so as to obtain training data carrying irrelevant labels.
As shown in fig. 6, in one embodiment, the model training module 550 further includes a probability setting unit 554, configured to input training data carrying a relevant tag and training data carrying an irrelevant tag into the initial sentence relevance classification model respectively, set a first dropout probability for the training data carrying the relevant tag using a dropout mechanism, and set a second dropout probability for the training data carrying the irrelevant tag using the dropout mechanism.
As shown in fig. 6, in one embodiment, the model training module 550 further includes a training data updating unit 556, configured to randomly discard a portion of training data carrying relevant labels based on the first dropout probability, obtain a first training set, randomly discard a portion of training data carrying irrelevant labels based on the second dropout probability, obtain a second training set, combine the first training set and the second training set as new training data, and input the new training data to the initial sentence relevance classification model again, and return to the step of randomly discarding a portion of training data carrying relevant labels based on the first dropout probability until the number of returns reaches a preset number of times threshold.
For specific limitations on the text noise data recognition apparatus, reference may be made to the above limitations on the text noise data recognition method, and no further description is given here. The respective modules in the above text noise data recognition apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is operative to provide computing and control capabilities, a computer program for invoking a memory. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing text data and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text noise data recognition method.
It will be appreciated by those skilled in the art that the structure shown in fig. 7 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided that includes at least one processor, at least one memory, and a bus; the processor and the memory complete communication with each other through a bus; the processor is operative to invoke program instructions in the memory, which processor, when executing the computer program, implements the steps in the text noise data identification method described above.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, implements the steps of the text noise data identification method described above. Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (10)

1. A method for identifying text noise data, the method comprising:
acquiring text data;
dividing sentences of the text data to obtain segmented sentences, and extracting position vectors of the segmented sentences;
inputting the segmented sentences into a trained sentence correlation classification model, and adding tag data for the segmented sentences to obtain sentence correlation vectors, wherein the sentence correlation vectors are feature vectors which are output by a hidden layer of the trained sentence correlation classification model and used for representing sentence information;
splicing the sentence correlation vector and the sentence position vector to obtain a splicing matrix, and carrying out noise prediction on text data based on the splicing matrix to obtain a noise recognition result;
the sentence correlation classification model is trained based on the following ways:
acquiring training data carrying tag data, respectively inputting the training data carrying a relevant tag and the training data carrying an irrelevant tag in the training data into an initial sentence correlation classification model, setting a first dropoff probability for the training data carrying the relevant tag by adopting a dropoff mechanism, setting a second dropoff probability for the training data carrying the irrelevant tag by adopting the dropoff mechanism, carrying out dropoff processing on the training data carrying the relevant tag and the training data carrying the irrelevant tag based on the first dropoff probability and the second dropoff probability, updating the training data, training the initial sentence correlation classification model by adopting the updated training data, and obtaining the trained sentence correlation classification model.
2. The text noise data recognition method of claim 1, wherein said sentence processing of said text data comprises:
dividing the text data into a plurality of sentences by adopting a preset sentence dividing algorithm;
dividing or splicing the segmented sentences according to a preset sentence length threshold value to ensure that the length of the segmented sentences meets the preset sentence length threshold value.
3. The text noise data identification method of claim 1, wherein the acquiring training data carrying tag data comprises:
collecting historical text data, wherein the historical text data carries labeling information;
and carrying out sentence dividing and labeling processing on the historical text data according to the labeling information to obtain training data carrying label data.
4. The text noise data identification method of claim 3, wherein the sentence segmentation and labeling of the historical text data according to the labeling information to obtain training data carrying label data comprises:
splitting the historical text data into a plurality of sentences;
identifying annotation information of the historical text data;
if the labeling information of the historical text data is noise data, marking the labels of sentences cut from the historical text data as irrelevant labels to obtain training data carrying relevant labels;
if the labeling information of the historical text data is non-noise data, labeling the labels of sentences cut from the historical text data as related labels to obtain training data carrying irrelevant labels.
5. The text noise data identification method of any one of claims 1 to 4, wherein the performing a dropout process on training data carrying a relevant tag and training data carrying an irrelevant tag based on the first dropout probability and the second dropout probability, and updating the training data includes:
based on the first dropout probability, randomly discarding part of training data carrying relevant labels to obtain a first training set;
based on the second dropout probability, training data of a part carrying irrelevant labels are randomly discarded, and a second training set is obtained;
and combining the first training set and the second training set to be used as new training data, inputting the new training data into the initial sentence correlation classification model again, and returning to the step of randomly discarding part of training data carrying the relevant labels based on the first dropout probability until the return times reach a preset time threshold.
6. A text noise data recognition device, the device comprising:
the data acquisition module is used for acquiring text data;
the sentence dividing module is used for dividing the text data to obtain a divided sentence and extracting a position vector of the divided sentence;
the sentence correlation processing module is used for inputting the segmented sentences into a trained sentence correlation classification model to obtain sentence correlation vectors, wherein the sentence correlation vectors are feature vectors which are output by a hidden layer of the trained sentence correlation classification model and used for representing sentence information;
the noise prediction module is used for splicing the sentence correlation vector and the position vector to obtain a splicing matrix, and carrying out noise prediction on text data based on the splicing matrix to obtain a noise identification result;
the model training module is used for acquiring training data carrying a label, respectively inputting the training data carrying a relevant label and the training data carrying an irrelevant label in the training data into an initial sentence correlation classification model, setting a first dropoff probability for the training data carrying the relevant label by adopting a dropoff mechanism, setting a second dropoff probability for the training data carrying the irrelevant label by adopting the dropoff mechanism, carrying out dropoff processing on the training data carrying the relevant label and the training data carrying the irrelevant label based on the first dropoff probability and the second dropoff probability, updating the training data, training the initial sentence correlation classification model by adopting the updated training data, and obtaining the trained sentence correlation classification model.
7. The text noise data recognition device of claim 6, wherein the model training module is configured to collect historical text data, the historical text data carries labeling information, and perform sentence segmentation and labeling processing on the historical text data according to the labeling information to obtain training data carrying label data.
8. The text noise data recognition device of claim 6, wherein the sentence processing module is further configured to segment the text data into a plurality of sentences using a preset sentence segmentation algorithm, and segment or splice the segmented sentences according to a preset sentence length threshold, so as to ensure that the length of the segmented sentences meets the preset sentence length threshold.
9. A computer device comprising at least one processor, at least one memory, and a bus; wherein the processor and the memory complete communication with each other through the bus; the processor is configured to invoke program instructions in the memory to perform the method of any of claims 1 to 5.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 5.
CN201910944355.9A 2019-09-30 2019-09-30 Text noise data identification method, device, computer equipment and storage medium Active CN112580329B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910944355.9A CN112580329B (en) 2019-09-30 2019-09-30 Text noise data identification method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910944355.9A CN112580329B (en) 2019-09-30 2019-09-30 Text noise data identification method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112580329A CN112580329A (en) 2021-03-30
CN112580329B true CN112580329B (en) 2024-02-20

Family

ID=75116648

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910944355.9A Active CN112580329B (en) 2019-09-30 2019-09-30 Text noise data identification method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112580329B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113139628B (en) * 2021-06-22 2021-09-17 腾讯科技(深圳)有限公司 Sample image identification method, device and equipment and readable storage medium
CN113283218A (en) * 2021-06-24 2021-08-20 中国平安人寿保险股份有限公司 Semantic text compression method and computer equipment
CN113571052A (en) * 2021-07-22 2021-10-29 湖北亿咖通科技有限公司 Noise extraction and instruction identification method and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107683469A (en) * 2015-12-30 2018-02-09 中国科学院深圳先进技术研究院 A kind of product classification method and device based on deep learning
CN108009228A (en) * 2017-11-27 2018-05-08 咪咕互动娱乐有限公司 A kind of method to set up of content tab, device and storage medium
CN108628974A (en) * 2018-04-25 2018-10-09 平安科技(深圳)有限公司 Public feelings information sorting technique, device, computer equipment and storage medium
CN109635207A (en) * 2018-12-18 2019-04-16 上海海事大学 A kind of social network user personality prediction technique based on Chinese text analysis
CN109885832A (en) * 2019-02-14 2019-06-14 平安科技(深圳)有限公司 Model training, sentence processing method, device, computer equipment and storage medium
CN110084733A (en) * 2019-04-19 2019-08-02 中国科学院自动化研究所 The embedding grammar and system of text image watermark, extracting method and system
CN110232395A (en) * 2019-03-01 2019-09-13 国网河南省电力公司电力科学研究院 A kind of fault diagnosis method of electric power system based on failure Chinese text

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220231A (en) * 2016-03-22 2017-09-29 索尼公司 Electronic equipment and method and training method for natural language processing

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107683469A (en) * 2015-12-30 2018-02-09 中国科学院深圳先进技术研究院 A kind of product classification method and device based on deep learning
CN108009228A (en) * 2017-11-27 2018-05-08 咪咕互动娱乐有限公司 A kind of method to set up of content tab, device and storage medium
CN108628974A (en) * 2018-04-25 2018-10-09 平安科技(深圳)有限公司 Public feelings information sorting technique, device, computer equipment and storage medium
CN109635207A (en) * 2018-12-18 2019-04-16 上海海事大学 A kind of social network user personality prediction technique based on Chinese text analysis
CN109885832A (en) * 2019-02-14 2019-06-14 平安科技(深圳)有限公司 Model training, sentence processing method, device, computer equipment and storage medium
CN110232395A (en) * 2019-03-01 2019-09-13 国网河南省电力公司电力科学研究院 A kind of fault diagnosis method of electric power system based on failure Chinese text
CN110084733A (en) * 2019-04-19 2019-08-02 中国科学院自动化研究所 The embedding grammar and system of text image watermark, extracting method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于深度学习的文本分类技术研究;石逸轩;中国优秀硕士学位论文全文数据库;全文 *

Also Published As

Publication number Publication date
CN112580329A (en) 2021-03-30

Similar Documents

Publication Publication Date Title
CN110765265B (en) Information classification extraction method and device, computer equipment and storage medium
US10606949B2 (en) Artificial intelligence based method and apparatus for checking text
KR102304673B1 (en) Keyword extraction method, computer device, and storage medium
CN109583325B (en) Face sample picture labeling method and device, computer equipment and storage medium
US11941366B2 (en) Context-based multi-turn dialogue method and storage medium
CN111062215B (en) Named entity recognition method and device based on semi-supervised learning training
CN109858010B (en) Method and device for recognizing new words in field, computer equipment and storage medium
CN109858041B (en) Named entity recognition method combining semi-supervised learning with user-defined dictionary
CN109992664B (en) Dispute focus label classification method and device, computer equipment and storage medium
CN110147445A (en) Intension recognizing method, device, equipment and storage medium based on text classification
CN112580329B (en) Text noise data identification method, device, computer equipment and storage medium
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN109886554B (en) Illegal behavior discrimination method, device, computer equipment and storage medium
CN112766319A (en) Dialogue intention recognition model training method and device, computer equipment and medium
CN111191032A (en) Corpus expansion method and device, computer equipment and storage medium
CN114647741A (en) Process automatic decision and reasoning method, device, computer equipment and storage medium
CN113449489B (en) Punctuation mark labeling method, punctuation mark labeling device, computer equipment and storage medium
CN115982403B (en) Multi-mode hash retrieval method and device
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN114298035A (en) Text recognition desensitization method and system thereof
CN112100377A (en) Text classification method and device, computer equipment and storage medium
CN112632258A (en) Text data processing method and device, computer equipment and storage medium
CN112860919A (en) Data labeling method, device and equipment based on generative model and storage medium
Yang et al. Bidirectional LSTM-CRF for biomedical named entity recognition
CN113343711B (en) Work order generation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant