CN108304387B - Method, device, server group and storage medium for recognizing noise words in text - Google Patents

Method, device, server group and storage medium for recognizing noise words in text Download PDF

Info

Publication number
CN108304387B
CN108304387B CN201810195233.XA CN201810195233A CN108304387B CN 108304387 B CN108304387 B CN 108304387B CN 201810195233 A CN201810195233 A CN 201810195233A CN 108304387 B CN108304387 B CN 108304387B
Authority
CN
China
Prior art keywords
word
text
training
noise
recognized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810195233.XA
Other languages
Chinese (zh)
Other versions
CN108304387A (en
Inventor
金宝宝
杨帆
张成松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN201810195233.XA priority Critical patent/CN108304387B/en
Publication of CN108304387A publication Critical patent/CN108304387A/en
Application granted granted Critical
Publication of CN108304387B publication Critical patent/CN108304387B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Abstract

The application provides a method, a device, a server group and a storage medium for recognizing noise words in texts, wherein the method comprises the following steps: the method comprises the steps of obtaining a text to be recognized, sequentially converting each character in the text to be recognized into a word vector, obtaining a word vector set corresponding to the text to be recognized, inputting the word vector set corresponding to the text to be recognized into a pre-established noise word recognition model, and obtaining a recognition result of a noise word in the text to be recognized, wherein the recognition result of the noise word is output by the noise word recognition model, and the noise word recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with the noise word. According to the method for recognizing the noise words in the text, the text to be recognized can be recognized through the pre-established noise word recognition model, and the noise word recognition model is obtained through training based on the training text marked with the noise words, so that the noise words can be recognized from the text to be recognized through the noise word recognition model.

Description

Method, device, server group and storage medium for recognizing noise words in text
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for identifying noise words in a text, a server group and a storage medium.
Background
Natural language processing is one of the most important sub-fields in the field of artificial intelligence, and is the technical core of current popular translation systems, man-machine conversation systems and question-answering systems. The non-normalization of text generated in the real world is one of the most important factors affecting the performance of natural language processing, and the non-normalization caused by noise words is particularly significant.
Where a noisy word refers to a word that is not in the stop word range, but is meaningless in the current context. Unlike the relatively fixed stop words, which are not fixed, some text may not be noise words in other text, such as the number 12 in "12 th school" which is a meaningless noise word, but not in "12 th middle of the month", which makes the noise words difficult to recognize.
Disclosure of Invention
In view of the above, the present invention provides a method, an apparatus, a server group and a storage medium for recognizing a noise word in a text, so as to solve the problem in the prior art that the noise word is difficult to recognize, and the technical solution is as follows:
a method for recognizing noise words in text comprises the following steps:
acquiring a text to be identified;
sequentially converting each character in the text to be recognized into a word vector to obtain a word vector set corresponding to the text to be recognized;
and inputting a word vector set corresponding to the text to be recognized into a pre-established noise word recognition model to obtain a recognition result of the noise words in the text to be recognized, which is output by the noise word recognition model, wherein the noise word recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with the noise words.
The text to be recognized comprises target words;
the training text comprises the target word;
the noise identification model is obtained by training a training sample by using a word vector set corresponding to a training text marked with the target word as a noise word and a word vector set corresponding to a training text marked with the target word as a non-noise word;
and the recognition result of the noise word in the text to be recognized is used for indicating whether the target word is the noise word.
The process of pre-establishing the noise word recognition model includes:
acquiring a plurality of texts marked with noise words to form a training text set;
sequentially converting each character in the training texts in the training text set into a character vector to obtain a character vector set corresponding to the training texts, wherein the distance between different character vectors represents the relevance between the characters corresponding to the different character vectors;
and training a recurrent neural network by using the word vector set corresponding to the training text as input, and taking the recurrent neural network obtained by training as the noise word recognition model.
Wherein, the sequentially converting each word in the training texts in the training text set into a word vector comprises:
and sequentially processing each word in the training text set into vector data, and converting the vector data into a word vector to obtain a word vector set corresponding to the training text.
The method for recognizing the noise words in the text further comprises the following steps:
acquiring the mapping relation between vector data corresponding to each character appearing in the training text set and a corresponding character vector;
the sequentially converting each character in the text to be recognized into a word vector comprises:
and sequentially converting each character in the text to be recognized into vector data serving as target vector data, and converting the target vector data into a word vector based on the mapping relation between the vector data corresponding to each character appearing in the training text set and the corresponding word vector.
An apparatus for recognizing noise words in text, comprising: the device comprises a text to be recognized acquisition module, a text to be recognized conversion module and a noise recognition module;
the text to be recognized acquisition module is used for acquiring a text to be recognized;
the text to be recognized conversion module is used for sequentially converting each character in the text to be recognized into a word vector to obtain a word vector set corresponding to the text to be recognized;
and the noise recognition module is used for inputting the word vector set corresponding to the text to be recognized into a pre-established noise word recognition model to obtain a recognition result of the noise words in the text to be recognized, which is output by the noise word recognition model, wherein the noise word recognition model is obtained by taking the word vector set corresponding to the training text marked with the noise words as a training sample for training.
The text to be recognized comprises target words;
the training text comprises the target word;
the noise identification model is obtained by training a training sample by using a word vector set corresponding to a training text marked with the target word as a noise word and a word vector set corresponding to a training text marked with the target word as a non-noise word;
and the recognition result of the noise word in the text to be recognized is used for indicating whether the target word is the noise word.
The device for recognizing the noise words in the text further comprises: the training text conversion module comprises a training text acquisition module, a training text conversion module and a training module;
the training text acquisition module is used for acquiring a plurality of texts marked with noise words to form a training text set;
the training text conversion module is used for sequentially converting each character in the training text set into a word vector to obtain a word vector set corresponding to the training text, wherein the distance between different word vectors represents the relevance between the corresponding characters;
and the training module is used for training a recurrent neural network by taking the word vector set corresponding to the training text as input, and taking the recurrent neural network obtained by training as the noise word recognition model.
A server group, comprising: a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to:
acquiring a text to be identified;
sequentially converting each character in the text to be recognized into a word vector to obtain a word vector set corresponding to the text to be recognized;
and inputting a word vector set corresponding to the text to be recognized into a pre-established noise word recognition model to obtain a recognition result of the noise words in the text to be recognized, which is output by the noise word recognition model, wherein the noise word recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with the noise words.
A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for recognizing noise words in text as described.
The technical scheme has the following beneficial effects:
the invention provides a method, a device, a server group and a storage medium for recognizing noise words in a text, which are characterized in that firstly, a text to be recognized is obtained, then each character in the text to be recognized is sequentially converted into a character vector, a character vector set corresponding to the text to be recognized is obtained, finally, the character vector set corresponding to the text to be recognized is input into a pre-established noise word recognition model, a recognition result of the noise words in the text to be recognized, which is output by the noise word recognition model, is obtained, and the noise words can be recognized from the text to be recognized through the noise word recognition model as a result of training by taking the character vector set corresponding to a training text marked with the noise words as a training sample. The method for recognizing the noise words in the text provided by the invention ensures that a user does not need strong industry knowledge, only needs to label the training text at the initial stage of the training model, is simple to realize and has higher recognition accuracy.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic flow chart of a method for recognizing a noise word in a text according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart illustrating an implementation of a pre-established noise word recognition model according to an embodiment of the present invention;
fig. 3 is another schematic flow chart of a method for recognizing a noise word in a text according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart illustrating an implementation of a pre-established noise word recognition model according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an apparatus for recognizing noise words in a text according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a server group according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
An embodiment of the present invention provides a method for recognizing a noise word in a text, please refer to fig. 1, which shows a flow diagram of the recognition method, and the method may include:
step S101: and acquiring a text to be recognized.
Step S102: and sequentially converting each character in the text to be recognized into a word vector to obtain a word vector set corresponding to the text to be recognized.
Step S103: and inputting a word vector set corresponding to the text to be recognized into a pre-established noise word recognition model to obtain a recognition result of the noise words in the text to be recognized, which is output by the noise word recognition model.
The noise word recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with noise words.
Referring to fig. 2, a schematic flow chart of a possible implementation process of pre-establishing a noise word recognition model in this embodiment is shown, which may include:
step S201: and acquiring a plurality of texts marked with noise words to form a training text set.
Specifically, a plurality of texts are obtained, the obtaining method can be, but is not limited to, selecting from an existing corpus, crawling from the network through a web crawler, and the like, then noise words in each text are labeled respectively, so that a plurality of texts labeled with the noise words are obtained, each text labeled with the noise words is a training text, and the texts labeled with the noise words form a training text set. Preferably, texts in different fields can be acquired so as to adapt the established noise word recognition model to different application fields.
Step S202: and sequentially converting each character in the training texts in the training text set into a word vector to obtain a word vector set corresponding to the training texts.
Wherein, the distance between different word vectors represents the relevance between the corresponding words. For example, there are a large number of training texts related to "hospital" such as "first people hospital", "second center hospital", etc. in the training text set, after vector conversion, the distance between the word vector corresponding to "doctor" and the word vector corresponding to "hospital" is short, i.e. the association between "doctor" and "hospital" is strong, and there are not a large number of words "person" and "middle", so the distance between the word vector corresponding to "person" and the word vector corresponding to "middle" is long, i.e. the association between "person" and "middle" is weak.
Specifically, the process of sequentially converting each word in the training texts in the training text set into a word vector may include: and sequentially processing each character in the training texts in the training text set into vector data, and converting the vector data into a word vector to obtain a word vector set corresponding to the training texts.
Step S203: and (3) taking the word vector set corresponding to the training text as input, training a recurrent neural network, and taking the recurrent neural network obtained by training as a noise word recognition model.
The recurrent neural network can be, but is not limited to, neural network models with memory functions such as RNN, LSTM, GRU, and the like.
The method for recognizing the noise words in the text comprises the steps of firstly obtaining the text to be recognized, then sequentially converting each character in the text to be recognized into a word vector, obtaining a word vector set corresponding to the text to be recognized, finally inputting the word vector set corresponding to the text to be recognized into a pre-established noise word recognition model, obtaining a recognition result of the noise words in the text to be recognized, which is output by the noise word recognition model, and training the noise words by using the word vector set corresponding to the training text marked with the noise words as a training sample to obtain the noise word recognition model. The method for recognizing the noise words in the text provided by the embodiment of the invention can be used for directly carrying out full text analysis on the text to be recognized and determining whether the text to be recognized contains the noise words. The identification method provided by the embodiment of the invention ensures that a user does not need strong industry knowledge and only needs to label the training text at the initial stage of the training model, so the method is simple to realize and has high identification accuracy.
Referring to fig. 3, another flow chart of a method for recognizing a noise word in a text according to an embodiment of the present invention is shown, where the method includes:
step S301: and acquiring the text to be recognized containing the target words.
Step S302: and sequentially converting each character in the text to be recognized into a word vector to obtain a word vector set corresponding to the text to be recognized.
Step S303: and inputting a word vector set corresponding to the text to be recognized into a pre-established noise word recognition model to obtain a recognition result which is output by the noise word recognition model and indicates whether a target word in the text to be recognized is a noise word.
Specifically, the noise recognition model is obtained by training a word vector set corresponding to a training text in which a target word is labeled as a noise word and a word vector set corresponding to a training text in which a target word is labeled as a non-noise word as a training sample.
Referring to fig. 4, a schematic flow chart of a possible implementation process of pre-establishing a noise word recognition model in this embodiment is shown, which may include:
step S401: and acquiring a text for labeling the target words in the text containing the target words to form a training text set.
Specifically, a plurality of texts including target words are obtained, the obtaining method may be, but is not limited to, selecting from an existing corpus, crawling from a network through a web crawler, and the like, then the target words in each text are labeled respectively, the target words are labeled as noise words or non-noise words, so that a plurality of texts labeling the target words are obtained, and the texts labeling the target words form a training text set. Preferably, a plurality of texts including the target word and belonging to different fields are obtained, so that the established noise word recognition model can adapt to different application fields.
Step S402: and sequentially converting each character in the training texts in the training text set into a word vector to obtain a word vector set corresponding to the training texts.
Wherein, the distance between different word vectors represents the relevance between the corresponding words.
Specifically, the process of sequentially converting each word in the training texts in the training text set into a word vector may include: and sequentially processing each character in the training texts in the training text set into vector data, and converting the vector data into a word vector to obtain a word vector set corresponding to the training texts.
In one possible implementation, all words appearing in the training text set may be one-hot encoded to complete the conversion of the text data into computer-processable vector data. It should be noted that one-hot coding is also called single hot coding, i.e. a unique code is given to each word appearing in the training text set.
Specifically, if there are N words in the training text set, each word can be represented by an N-1 dimensional vector, all bits of the N-1 dimensional vector of the first word are 0, the first bit of the second word is 1, the second bit of the third word is 1, and so on.
Illustratively, there are two sentences in the training text set: "123 first people hospital" and "liberation one way", there are 12 kinds of characters in the training text set, respectively: the codes corresponding to the 12 characters are as follows:
the "1" corresponds to a code of: [0,0,0,0,0,0,0,0,0,0,0,0]
The "2" corresponds to a code: [1,0,0,0,0,0,0,0,0,0,0,0]
The "3" corresponds to a code: [0,1,0,0,0,0,0,0,0,0,0,0]
The "second" corresponding code is: [0,0,1,0,0,0,0,0,0,0,0,0]
The code for a "one" corresponds to: [0,0,0,1,0,0,0,0,0,0,0,0]
……
The "put" corresponds to a code: [0,0,0,0,0,0,0,0,0,0,1,0]
The code corresponding to "way" is: [0,0,0,0,0,0,0,0,0,0,0,1]
After each word is processed into vector data, word2vec and the like may be used to convert each vector data into a word vector. Based on the above process, for each training sample, the word vectors corresponding to all the characters contained in the training sample can be determined, and the word vectors corresponding to all the characters contained in the training sample form a word vector set, so that the word vector set corresponding to the training sample can be obtained.
In addition, in order to realize the conversion from the characters in the text to be recognized to the word vectors, the mapping relationship between the vector data corresponding to all the characters appearing in the training text set and the corresponding word vectors can be stored. Specifically, the implementation process of sequentially converting each word in the text to be recognized into the word vector in step S102 may include: and sequentially converting each character in the text to be recognized into vector data serving as target vector data, and converting the target vector data into a word vector based on the corresponding relation between the vector data and the word vector. Specifically, a word vector corresponding to the target vector data is found in the correspondence relationship between the vector data and the word vector.
Step S403: and (3) taking the word vector set corresponding to the training text as input, training a recurrent neural network, and taking the recurrent neural network obtained by training as a noise word recognition model.
The recurrent neural network can be, but is not limited to, neural network models with memory functions such as RNN, LSTM, GRU, and the like.
The method for recognizing the noise words in the text comprises the steps of firstly obtaining a text to be recognized containing target words, then sequentially converting each character in the text to be recognized into a word vector, obtaining a word vector set corresponding to the text to be recognized, and finally inputting the word vector set corresponding to the text to be recognized into a pre-established noise word recognition model to obtain a recognition result output by the noise word recognition model. The method for recognizing the noise words in the text provided by the embodiment of the invention can analyze the target words in the text to be recognized and determine whether the target words in the text to be recognized are the noise words. The method provided by the embodiment of the invention does not need strong industry knowledge, only needs to label the training text at the initial stage of the training model, so the realization is simple, and the noise word recognition model judges whether the target word is the noise word according to the context environment of the target word, so the recognition accuracy is higher, and in addition, the training text is selected from a plurality of different fields, so the method can be suitable for a plurality of different fields, namely the application range is wider.
An embodiment of the present invention further provides a device for recognizing a noise word in a text, please refer to fig. 5, which shows a schematic structural diagram of the recognition device, and the device may include: a text to be recognized acquisition module 501, a text to be recognized conversion module 502 and a noise recognition module 503. Wherein:
a to-be-recognized text obtaining module 501, configured to obtain a to-be-recognized text.
The text to be recognized conversion module 502 is configured to sequentially convert each character in the text to be recognized into a word vector, and obtain a word vector set corresponding to the text to be recognized.
And the noise identification module 503 is configured to input the word vector set corresponding to the text to be identified into a pre-established noise word identification model to obtain an identification result of the noise word in the text to be identified, which is output by the noise word identification model.
The noise word recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with noise words.
The device for recognizing the noise words in the text provided by the embodiment of the invention can analyze the text to be recognized by utilizing the pre-established noise word recognition model to determine whether the text to be recognized contains the noise words. The recognition device provided by the embodiment of the invention ensures that a user does not need strong industry knowledge and only needs to label the training text at the initial stage of the training model, so the realization is simple, the recognition accuracy is high, and in addition, the training text is selected from a plurality of different fields, so the method can be suitable for a plurality of different fields, namely the application range is wide.
In a possible implementation manner, the text to be recognized acquired by the text to be recognized acquisition module 501 in the above embodiment includes the target word, and correspondingly, the training text also includes the target word. And the noise identification model is obtained by training a training sample by using a word vector set corresponding to the training text marked with the target word as the noise word and a word vector set corresponding to the training text marked with the target word as the non-noise word. The recognition result of the noise word in the text to be recognized output by the noise recognition module 503 is used to indicate whether the target word is a noise word.
In a possible implementation manner, the apparatus for recognizing a noise word in a text provided in the foregoing embodiment may further include: the training text conversion device comprises a training text acquisition module, a training text conversion module and a training module. Wherein:
and the training text acquisition module is used for acquiring a plurality of texts marked with the noise words to form a training text set.
And the training text conversion module is used for sequentially converting each character in the training text set into a word vector to obtain a word vector set corresponding to the training text.
Wherein, the distance between different word vectors represents the relevance between the corresponding words.
And the training module is used for taking the word vector set corresponding to the training text as input, training the recurrent neural network and taking the recurrent neural network obtained by training as a noise word recognition model.
The training text conversion module is specifically configured to sequentially process each word in a training text set into vector data, convert the vector data into a word vector, and obtain a word vector set corresponding to the training text.
The apparatus for recognizing a noise word in a text provided in the foregoing embodiment may further include: and a mapping relation obtaining module.
And the mapping relation acquisition module is used for acquiring the mapping relation between the vector data corresponding to each character appearing in the training text set and the corresponding character vector.
The text to be recognized conversion module 502 is specifically configured to sequentially convert each character in the text to be recognized into vector data as target vector data, and convert the target vector data into a word vector based on a mapping relationship between the vector data corresponding to each character appearing in the training text set and a corresponding word vector.
An embodiment of the present invention further provides a server group, where the server group may include: a memory 601 and a processor 602.
A memory 601 for storing programs;
a processor 602 for executing the program to:
acquiring a text to be identified;
sequentially converting each character in the text to be recognized into a word vector to obtain a word vector set corresponding to the text to be recognized;
and inputting a word vector set corresponding to the text to be recognized into a pre-established noise word recognition model to obtain a recognition result of the noise words in the text to be recognized, which is output by the noise word recognition model, wherein the noise word recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with the noise words.
The embodiment of the present invention further provides a readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the method for recognizing noise words in a text provided in any of the above embodiments.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
In the several embodiments provided in the present application, it should be understood that the disclosed method, apparatus, and device may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (6)

1. A method for recognizing noise words in a text, comprising:
acquiring a text to be recognized, wherein the text to be recognized comprises a target word;
sequentially converting each character in the text to be recognized into a word vector to obtain a word vector set corresponding to the text to be recognized;
inputting a word vector set corresponding to the text to be recognized into a pre-established and trained noise word recognition model to obtain a recognition result of a noise word in the text to be recognized, wherein the recognition result of the noise word in the text to be recognized is output by the noise word recognition model and is used for indicating whether the target word is a noise word; the noise word recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with a noise word, and comprises the following steps: the training text comprises the target word; the noise recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with the target word as a noise word and a word vector set corresponding to a training text marked with the target word as a non-noise word, and can judge whether the target word is the noise word according to the context environment of the target word;
the process of pre-establishing the noise word recognition model includes:
acquiring a plurality of texts marked with noise words to form a training text set;
sequentially converting each character in the training texts in the training text set into a character vector to obtain a character vector set corresponding to the training texts, wherein the distance between different character vectors represents the relevance between the characters corresponding to the different character vectors;
and training a recurrent neural network by using the word vector set corresponding to the training text as input, and taking the recurrent neural network obtained by training as the noise word recognition model.
2. The method for recognizing the noise word in the text according to claim 1, wherein the sequentially converting each word in the training texts in the training text set into a word vector comprises:
and sequentially processing each word in the training text set into vector data, and converting the vector data into a word vector to obtain a word vector set corresponding to the training text.
3. The method of claim 2, further comprising:
acquiring the mapping relation between vector data corresponding to each character appearing in the training text set and a corresponding character vector;
the sequentially converting each character in the text to be recognized into a word vector comprises:
and sequentially converting each character in the text to be recognized into vector data serving as target vector data, and converting the target vector data into a word vector based on the mapping relation between the vector data corresponding to each character appearing in the training text set and the corresponding word vector.
4. An apparatus for recognizing noise words in text, comprising: the device comprises a text to be recognized acquisition module, a text to be recognized conversion module and a noise recognition module;
the text to be recognized acquiring module is used for acquiring a text to be recognized, wherein the text to be recognized comprises a target word;
the text to be recognized conversion module is used for sequentially converting each character in the text to be recognized into a word vector to obtain a word vector set corresponding to the text to be recognized;
the noise recognition module is used for inputting a word vector set corresponding to the text to be recognized into a pre-established and trained noise word recognition model to obtain a recognition result of the noise words in the text to be recognized, wherein the recognition result of the noise words in the text to be recognized is output by the noise word recognition model and is used for indicating whether the target words are noise words or not; the noise word recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with a noise word, and comprises the following steps: the training text comprises the target word; the noise recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with the target word as a noise word and a word vector set corresponding to a training text marked with the target word as a non-noise word, and can judge whether the target word is a noise word according to the context environment of the target word and judge whether the target word is a noise word according to the context environment of the target word;
further comprising: the training text conversion module comprises a training text acquisition module, a training text conversion module and a training module;
the training text acquisition module is used for acquiring a plurality of texts marked with noise words to form a training text set;
the training text conversion module is used for sequentially converting each character in the training text set into a word vector to obtain a word vector set corresponding to the training text, wherein the distance between different word vectors represents the relevance between the corresponding characters;
and the training module is used for training a recurrent neural network by taking the word vector set corresponding to the training text as input, and taking the recurrent neural network obtained by training as the noise word recognition model.
5. A server group, comprising: a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to:
acquiring a text to be recognized, wherein the text to be recognized comprises a target word;
sequentially converting each character in the text to be recognized into a word vector to obtain a word vector set corresponding to the text to be recognized;
inputting a word vector set corresponding to the text to be recognized into a pre-established and trained noise word recognition model to obtain a recognition result of a noise word in the text to be recognized, wherein the recognition result of the noise word in the text to be recognized is output by the noise word recognition model and is used for indicating whether the target word is a noise word; the noise word recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with a noise word, and comprises the following steps: the training text comprises the target word; the noise recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with the target word as a noise word and a word vector set corresponding to a training text marked with the target word as a non-noise word, and can judge whether the target word is a noise word according to the context environment of the target word and judge whether the target word is a noise word according to the context environment of the target word;
the process of pre-establishing the noise word recognition model includes:
acquiring a plurality of texts marked with noise words to form a training text set;
sequentially converting each character in the training texts in the training text set into a character vector to obtain a character vector set corresponding to the training texts, wherein the distance between different character vectors represents the relevance between the characters corresponding to the different character vectors;
and training a recurrent neural network by using the word vector set corresponding to the training text as input, and taking the recurrent neural network obtained by training as the noise word recognition model.
6. A readable storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for recognizing a noise word in a text according to any one of claims 1 to 3.
CN201810195233.XA 2018-03-09 2018-03-09 Method, device, server group and storage medium for recognizing noise words in text Active CN108304387B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810195233.XA CN108304387B (en) 2018-03-09 2018-03-09 Method, device, server group and storage medium for recognizing noise words in text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810195233.XA CN108304387B (en) 2018-03-09 2018-03-09 Method, device, server group and storage medium for recognizing noise words in text

Publications (2)

Publication Number Publication Date
CN108304387A CN108304387A (en) 2018-07-20
CN108304387B true CN108304387B (en) 2021-06-15

Family

ID=62849431

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810195233.XA Active CN108304387B (en) 2018-03-09 2018-03-09 Method, device, server group and storage medium for recognizing noise words in text

Country Status (1)

Country Link
CN (1) CN108304387B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271526A (en) * 2018-08-14 2019-01-25 阿里巴巴集团控股有限公司 Method for text detection, device, electronic equipment and computer readable storage medium
CN109147767A (en) * 2018-08-16 2019-01-04 平安科技(深圳)有限公司 Digit recognition method, device, computer equipment and storage medium in voice
CN110196909B (en) * 2019-05-14 2022-05-31 北京来也网络科技有限公司 Text denoising method and device based on reinforcement learning
US11379660B2 (en) * 2019-06-27 2022-07-05 International Business Machines Corporation Deep learning approach to computing spans
CN111079854B (en) * 2019-12-27 2024-04-23 联想(北京)有限公司 Information identification method, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462378A (en) * 2014-12-09 2015-03-25 北京国双科技有限公司 Data processing method and device for text recognition
CN106815192A (en) * 2015-11-27 2017-06-09 北京国双科技有限公司 Model training method and device and sentence emotion identification method and device
CN107122416A (en) * 2017-03-31 2017-09-01 北京大学 A kind of Chinese event abstracting method

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7136802B2 (en) * 2002-01-16 2006-11-14 Intel Corporation Method and apparatus for detecting prosodic phrase break in a text to speech (TTS) system
US20170293597A1 (en) * 2016-04-07 2017-10-12 Khalifa University Of Science, Technology And Research Methods and systems for data processing
CN105956179B (en) * 2016-05-30 2020-05-26 上海智臻智能网络科技股份有限公司 Data filtering method and device
CN106202330B (en) * 2016-07-01 2020-02-07 北京小米移动软件有限公司 Junk information judgment method and device
CN106407971A (en) * 2016-09-14 2017-02-15 北京小米移动软件有限公司 Text recognition method and device
CN106919554B (en) * 2016-10-27 2020-06-30 阿里巴巴集团控股有限公司 Method and device for identifying invalid words in document
CN107273362B (en) * 2017-07-04 2020-10-30 联想(北京)有限公司 Data processing method and apparatus thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462378A (en) * 2014-12-09 2015-03-25 北京国双科技有限公司 Data processing method and device for text recognition
CN106815192A (en) * 2015-11-27 2017-06-09 北京国双科技有限公司 Model training method and device and sentence emotion identification method and device
CN107122416A (en) * 2017-03-31 2017-09-01 北京大学 A kind of Chinese event abstracting method

Also Published As

Publication number Publication date
CN108304387A (en) 2018-07-20

Similar Documents

Publication Publication Date Title
CN108304387B (en) Method, device, server group and storage medium for recognizing noise words in text
CN110110041B (en) Wrong word correcting method, wrong word correcting device, computer device and storage medium
CN114757176A (en) Method for obtaining target intention recognition model and intention recognition method
CN111160041A (en) Semantic understanding method and device, electronic equipment and storage medium
CN109492221A (en) A kind of information replying method and wearable device based on semantic analysis
CN112820269A (en) Text-to-speech method, device, electronic equipment and storage medium
CN112463942A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN111930792A (en) Data resource labeling method and device, storage medium and electronic equipment
CN116541493A (en) Interactive response method, device, equipment and storage medium based on intention recognition
CN110852071B (en) Knowledge point detection method, device, equipment and readable storage medium
CN111597815A (en) Multi-embedded named entity identification method, device, equipment and storage medium
CN109657127B (en) Answer obtaining method, device, server and storage medium
CN112182167B (en) Text matching method and device, terminal equipment and storage medium
CN113626576A (en) Method and device for extracting relational characteristics in remote supervision, terminal and storage medium
CN112559711A (en) Synonymous text prompting method and device and electronic equipment
CN112349294A (en) Voice processing method and device, computer readable medium and electronic equipment
CN111968646A (en) Voice recognition method and device
CN111046674A (en) Semantic understanding method and device, electronic equipment and storage medium
CN116662496A (en) Information extraction method, and method and device for training question-answering processing model
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
CN113901793A (en) Event extraction method and device combining RPA and AI
CN112530406A (en) Voice synthesis method, voice synthesis device and intelligent equipment
CN113822053A (en) Grammar error detection method and device, electronic equipment and storage medium
CN111814433B (en) Uygur language entity identification method and device and electronic equipment
CN111401070A (en) Word sense similarity determining method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant