CN108304387B

CN108304387B - Method, device, server group and storage medium for recognizing noise words in text

Info

Publication number: CN108304387B
Application number: CN201810195233.XA
Authority: CN
Inventors: 金宝宝; 杨帆; 张成松
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2018-03-09
Filing date: 2018-03-09
Publication date: 2021-06-15
Anticipated expiration: 2038-03-09
Also published as: CN108304387A

Abstract

The application provides a method, a device, a server group and a storage medium for recognizing noise words in texts, wherein the method comprises the following steps: the method comprises the steps of obtaining a text to be recognized, sequentially converting each character in the text to be recognized into a word vector, obtaining a word vector set corresponding to the text to be recognized, inputting the word vector set corresponding to the text to be recognized into a pre-established noise word recognition model, and obtaining a recognition result of a noise word in the text to be recognized, wherein the recognition result of the noise word is output by the noise word recognition model, and the noise word recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with the noise word. According to the method for recognizing the noise words in the text, the text to be recognized can be recognized through the pre-established noise word recognition model, and the noise word recognition model is obtained through training based on the training text marked with the noise words, so that the noise words can be recognized from the text to be recognized through the noise word recognition model.

Description

Method, device, server group and storage medium for recognizing noise words in text

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for identifying noise words in a text, a server group and a storage medium.

Background

Natural language processing is one of the most important sub-fields in the field of artificial intelligence, and is the technical core of current popular translation systems, man-machine conversation systems and question-answering systems. The non-normalization of text generated in the real world is one of the most important factors affecting the performance of natural language processing, and the non-normalization caused by noise words is particularly significant.

Where a noisy word refers to a word that is not in the stop word range, but is meaningless in the current context. Unlike the relatively fixed stop words, which are not fixed, some text may not be noise words in other text, such as the number 12 in "12 th school" which is a meaningless noise word, but not in "12 th middle of the month", which makes the noise words difficult to recognize.

Disclosure of Invention

In view of the above, the present invention provides a method, an apparatus, a server group and a storage medium for recognizing a noise word in a text, so as to solve the problem in the prior art that the noise word is difficult to recognize, and the technical solution is as follows:

a method for recognizing noise words in text comprises the following steps:

acquiring a text to be identified;

sequentially converting each character in the text to be recognized into a word vector to obtain a word vector set corresponding to the text to be recognized;

and inputting a word vector set corresponding to the text to be recognized into a pre-established noise word recognition model to obtain a recognition result of the noise words in the text to be recognized, which is output by the noise word recognition model, wherein the noise word recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with the noise words.

The text to be recognized comprises target words;

the training text comprises the target word;

the noise identification model is obtained by training a training sample by using a word vector set corresponding to a training text marked with the target word as a noise word and a word vector set corresponding to a training text marked with the target word as a non-noise word;

and the recognition result of the noise word in the text to be recognized is used for indicating whether the target word is the noise word.

The process of pre-establishing the noise word recognition model includes:

acquiring a plurality of texts marked with noise words to form a training text set;

sequentially converting each character in the training texts in the training text set into a character vector to obtain a character vector set corresponding to the training texts, wherein the distance between different character vectors represents the relevance between the characters corresponding to the different character vectors;

and training a recurrent neural network by using the word vector set corresponding to the training text as input, and taking the recurrent neural network obtained by training as the noise word recognition model.

Wherein, the sequentially converting each word in the training texts in the training text set into a word vector comprises:

and sequentially processing each word in the training text set into vector data, and converting the vector data into a word vector to obtain a word vector set corresponding to the training text.

The method for recognizing the noise words in the text further comprises the following steps:

acquiring the mapping relation between vector data corresponding to each character appearing in the training text set and a corresponding character vector;

the sequentially converting each character in the text to be recognized into a word vector comprises:

and sequentially converting each character in the text to be recognized into vector data serving as target vector data, and converting the target vector data into a word vector based on the mapping relation between the vector data corresponding to each character appearing in the training text set and the corresponding word vector.

An apparatus for recognizing noise words in text, comprising: the device comprises a text to be recognized acquisition module, a text to be recognized conversion module and a noise recognition module;

the text to be recognized acquisition module is used for acquiring a text to be recognized;

the text to be recognized conversion module is used for sequentially converting each character in the text to be recognized into a word vector to obtain a word vector set corresponding to the text to be recognized;

and the noise recognition module is used for inputting the word vector set corresponding to the text to be recognized into a pre-established noise word recognition model to obtain a recognition result of the noise words in the text to be recognized, which is output by the noise word recognition model, wherein the noise word recognition model is obtained by taking the word vector set corresponding to the training text marked with the noise words as a training sample for training.

The text to be recognized comprises target words;

the training text comprises the target word;

The device for recognizing the noise words in the text further comprises: the training text conversion module comprises a training text acquisition module, a training text conversion module and a training module;

the training text acquisition module is used for acquiring a plurality of texts marked with noise words to form a training text set;

the training text conversion module is used for sequentially converting each character in the training text set into a word vector to obtain a word vector set corresponding to the training text, wherein the distance between different word vectors represents the relevance between the corresponding characters;

and the training module is used for training a recurrent neural network by taking the word vector set corresponding to the training text as input, and taking the recurrent neural network obtained by training as the noise word recognition model.

A server group, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to:

acquiring a text to be identified;

A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for recognizing noise words in text as described.

The technical scheme has the following beneficial effects:

the invention provides a method, a device, a server group and a storage medium for recognizing noise words in a text, which are characterized in that firstly, a text to be recognized is obtained, then each character in the text to be recognized is sequentially converted into a character vector, a character vector set corresponding to the text to be recognized is obtained, finally, the character vector set corresponding to the text to be recognized is input into a pre-established noise word recognition model, a recognition result of the noise words in the text to be recognized, which is output by the noise word recognition model, is obtained, and the noise words can be recognized from the text to be recognized through the noise word recognition model as a result of training by taking the character vector set corresponding to a training text marked with the noise words as a training sample. The method for recognizing the noise words in the text provided by the invention ensures that a user does not need strong industry knowledge, only needs to label the training text at the initial stage of the training model, is simple to realize and has higher recognition accuracy.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart of a method for recognizing a noise word in a text according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart illustrating an implementation of a pre-established noise word recognition model according to an embodiment of the present invention;

fig. 3 is another schematic flow chart of a method for recognizing a noise word in a text according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart illustrating an implementation of a pre-established noise word recognition model according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an apparatus for recognizing noise words in a text according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a server group according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An embodiment of the present invention provides a method for recognizing a noise word in a text, please refer to fig. 1, which shows a flow diagram of the recognition method, and the method may include:

step S101: and acquiring a text to be recognized.

Step S102: and sequentially converting each character in the text to be recognized into a word vector to obtain a word vector set corresponding to the text to be recognized.

Step S103: and inputting a word vector set corresponding to the text to be recognized into a pre-established noise word recognition model to obtain a recognition result of the noise words in the text to be recognized, which is output by the noise word recognition model.

The noise word recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with noise words.

Referring to fig. 2, a schematic flow chart of a possible implementation process of pre-establishing a noise word recognition model in this embodiment is shown, which may include:

step S201: and acquiring a plurality of texts marked with noise words to form a training text set.

Specifically, a plurality of texts are obtained, the obtaining method can be, but is not limited to, selecting from an existing corpus, crawling from the network through a web crawler, and the like, then noise words in each text are labeled respectively, so that a plurality of texts labeled with the noise words are obtained, each text labeled with the noise words is a training text, and the texts labeled with the noise words form a training text set. Preferably, texts in different fields can be acquired so as to adapt the established noise word recognition model to different application fields.

Step S202: and sequentially converting each character in the training texts in the training text set into a word vector to obtain a word vector set corresponding to the training texts.

Wherein, the distance between different word vectors represents the relevance between the corresponding words. For example, there are a large number of training texts related to "hospital" such as "first people hospital", "second center hospital", etc. in the training text set, after vector conversion, the distance between the word vector corresponding to "doctor" and the word vector corresponding to "hospital" is short, i.e. the association between "doctor" and "hospital" is strong, and there are not a large number of words "person" and "middle", so the distance between the word vector corresponding to "person" and the word vector corresponding to "middle" is long, i.e. the association between "person" and "middle" is weak.

Specifically, the process of sequentially converting each word in the training texts in the training text set into a word vector may include: and sequentially processing each character in the training texts in the training text set into vector data, and converting the vector data into a word vector to obtain a word vector set corresponding to the training texts.

Step S203: and (3) taking the word vector set corresponding to the training text as input, training a recurrent neural network, and taking the recurrent neural network obtained by training as a noise word recognition model.

The recurrent neural network can be, but is not limited to, neural network models with memory functions such as RNN, LSTM, GRU, and the like.

The method for recognizing the noise words in the text comprises the steps of firstly obtaining the text to be recognized, then sequentially converting each character in the text to be recognized into a word vector, obtaining a word vector set corresponding to the text to be recognized, finally inputting the word vector set corresponding to the text to be recognized into a pre-established noise word recognition model, obtaining a recognition result of the noise words in the text to be recognized, which is output by the noise word recognition model, and training the noise words by using the word vector set corresponding to the training text marked with the noise words as a training sample to obtain the noise word recognition model. The method for recognizing the noise words in the text provided by the embodiment of the invention can be used for directly carrying out full text analysis on the text to be recognized and determining whether the text to be recognized contains the noise words. The identification method provided by the embodiment of the invention ensures that a user does not need strong industry knowledge and only needs to label the training text at the initial stage of the training model, so the method is simple to realize and has high identification accuracy.

Referring to fig. 3, another flow chart of a method for recognizing a noise word in a text according to an embodiment of the present invention is shown, where the method includes:

step S301: and acquiring the text to be recognized containing the target words.

Step S302: and sequentially converting each character in the text to be recognized into a word vector to obtain a word vector set corresponding to the text to be recognized.

Step S303: and inputting a word vector set corresponding to the text to be recognized into a pre-established noise word recognition model to obtain a recognition result which is output by the noise word recognition model and indicates whether a target word in the text to be recognized is a noise word.

Specifically, the noise recognition model is obtained by training a word vector set corresponding to a training text in which a target word is labeled as a noise word and a word vector set corresponding to a training text in which a target word is labeled as a non-noise word as a training sample.

Referring to fig. 4, a schematic flow chart of a possible implementation process of pre-establishing a noise word recognition model in this embodiment is shown, which may include:

step S401: and acquiring a text for labeling the target words in the text containing the target words to form a training text set.

Specifically, a plurality of texts including target words are obtained, the obtaining method may be, but is not limited to, selecting from an existing corpus, crawling from a network through a web crawler, and the like, then the target words in each text are labeled respectively, the target words are labeled as noise words or non-noise words, so that a plurality of texts labeling the target words are obtained, and the texts labeling the target words form a training text set. Preferably, a plurality of texts including the target word and belonging to different fields are obtained, so that the established noise word recognition model can adapt to different application fields.

Step S402: and sequentially converting each character in the training texts in the training text set into a word vector to obtain a word vector set corresponding to the training texts.

Wherein, the distance between different word vectors represents the relevance between the corresponding words.

In one possible implementation, all words appearing in the training text set may be one-hot encoded to complete the conversion of the text data into computer-processable vector data. It should be noted that one-hot coding is also called single hot coding, i.e. a unique code is given to each word appearing in the training text set.

Specifically, if there are N words in the training text set, each word can be represented by an N-1 dimensional vector, all bits of the N-1 dimensional vector of the first word are 0, the first bit of the second word is 1, the second bit of the third word is 1, and so on.

Illustratively, there are two sentences in the training text set: "123 first people hospital" and "liberation one way", there are 12 kinds of characters in the training text set, respectively: the codes corresponding to the 12 characters are as follows:

the "1" corresponds to a code of: [0,0,0,0,0,0,0,0,0,0,0,0]

The "2" corresponds to a code: [1,0,0,0,0,0,0,0,0,0,0,0]

The "3" corresponds to a code: [0,1,0,0,0,0,0,0,0,0,0,0]

The "second" corresponding code is: [0,0,1,0,0,0,0,0,0,0,0,0]

The code for a "one" corresponds to: [0,0,0,1,0,0,0,0,0,0,0,0]

……

The "put" corresponds to a code: [0,0,0,0,0,0,0,0,0,0,1,0]

The code corresponding to "way" is: [0,0,0,0,0,0,0,0,0,0,0,1]

After each word is processed into vector data, word2vec and the like may be used to convert each vector data into a word vector. Based on the above process, for each training sample, the word vectors corresponding to all the characters contained in the training sample can be determined, and the word vectors corresponding to all the characters contained in the training sample form a word vector set, so that the word vector set corresponding to the training sample can be obtained.

In addition, in order to realize the conversion from the characters in the text to be recognized to the word vectors, the mapping relationship between the vector data corresponding to all the characters appearing in the training text set and the corresponding word vectors can be stored. Specifically, the implementation process of sequentially converting each word in the text to be recognized into the word vector in step S102 may include: and sequentially converting each character in the text to be recognized into vector data serving as target vector data, and converting the target vector data into a word vector based on the corresponding relation between the vector data and the word vector. Specifically, a word vector corresponding to the target vector data is found in the correspondence relationship between the vector data and the word vector.

Step S403: and (3) taking the word vector set corresponding to the training text as input, training a recurrent neural network, and taking the recurrent neural network obtained by training as a noise word recognition model.

The method for recognizing the noise words in the text comprises the steps of firstly obtaining a text to be recognized containing target words, then sequentially converting each character in the text to be recognized into a word vector, obtaining a word vector set corresponding to the text to be recognized, and finally inputting the word vector set corresponding to the text to be recognized into a pre-established noise word recognition model to obtain a recognition result output by the noise word recognition model. The method for recognizing the noise words in the text provided by the embodiment of the invention can analyze the target words in the text to be recognized and determine whether the target words in the text to be recognized are the noise words. The method provided by the embodiment of the invention does not need strong industry knowledge, only needs to label the training text at the initial stage of the training model, so the realization is simple, and the noise word recognition model judges whether the target word is the noise word according to the context environment of the target word, so the recognition accuracy is higher, and in addition, the training text is selected from a plurality of different fields, so the method can be suitable for a plurality of different fields, namely the application range is wider.

An embodiment of the present invention further provides a device for recognizing a noise word in a text, please refer to fig. 5, which shows a schematic structural diagram of the recognition device, and the device may include: a text to be recognized acquisition module 501, a text to be recognized conversion module 502 and a noise recognition module 503. Wherein:

a to-be-recognized text obtaining module 501, configured to obtain a to-be-recognized text.

The text to be recognized conversion module 502 is configured to sequentially convert each character in the text to be recognized into a word vector, and obtain a word vector set corresponding to the text to be recognized.

And the noise identification module 503 is configured to input the word vector set corresponding to the text to be identified into a pre-established noise word identification model to obtain an identification result of the noise word in the text to be identified, which is output by the noise word identification model.

The device for recognizing the noise words in the text provided by the embodiment of the invention can analyze the text to be recognized by utilizing the pre-established noise word recognition model to determine whether the text to be recognized contains the noise words. The recognition device provided by the embodiment of the invention ensures that a user does not need strong industry knowledge and only needs to label the training text at the initial stage of the training model, so the realization is simple, the recognition accuracy is high, and in addition, the training text is selected from a plurality of different fields, so the method can be suitable for a plurality of different fields, namely the application range is wide.

In a possible implementation manner, the text to be recognized acquired by the text to be recognized acquisition module 501 in the above embodiment includes the target word, and correspondingly, the training text also includes the target word. And the noise identification model is obtained by training a training sample by using a word vector set corresponding to the training text marked with the target word as the noise word and a word vector set corresponding to the training text marked with the target word as the non-noise word. The recognition result of the noise word in the text to be recognized output by the noise recognition module 503 is used to indicate whether the target word is a noise word.

In a possible implementation manner, the apparatus for recognizing a noise word in a text provided in the foregoing embodiment may further include: the training text conversion device comprises a training text acquisition module, a training text conversion module and a training module. Wherein:

and the training text acquisition module is used for acquiring a plurality of texts marked with the noise words to form a training text set.

And the training text conversion module is used for sequentially converting each character in the training text set into a word vector to obtain a word vector set corresponding to the training text.

And the training module is used for taking the word vector set corresponding to the training text as input, training the recurrent neural network and taking the recurrent neural network obtained by training as a noise word recognition model.

The training text conversion module is specifically configured to sequentially process each word in a training text set into vector data, convert the vector data into a word vector, and obtain a word vector set corresponding to the training text.

The apparatus for recognizing a noise word in a text provided in the foregoing embodiment may further include: and a mapping relation obtaining module.

And the mapping relation acquisition module is used for acquiring the mapping relation between the vector data corresponding to each character appearing in the training text set and the corresponding character vector.

The text to be recognized conversion module 502 is specifically configured to sequentially convert each character in the text to be recognized into vector data as target vector data, and convert the target vector data into a word vector based on a mapping relationship between the vector data corresponding to each character appearing in the training text set and a corresponding word vector.

An embodiment of the present invention further provides a server group, where the server group may include: a memory 601 and a processor 602.

A memory 601 for storing programs;

a processor 602 for executing the program to:

acquiring a text to be identified;

The embodiment of the present invention further provides a readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the method for recognizing noise words in a text provided in any of the above embodiments.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

In the several embodiments provided in the present application, it should be understood that the disclosed method, apparatus, and device may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for recognizing noise words in a text, comprising:

acquiring a text to be recognized, wherein the text to be recognized comprises a target word;

inputting a word vector set corresponding to the text to be recognized into a pre-established and trained noise word recognition model to obtain a recognition result of a noise word in the text to be recognized, wherein the recognition result of the noise word in the text to be recognized is output by the noise word recognition model and is used for indicating whether the target word is a noise word; the noise word recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with a noise word, and comprises the following steps: the training text comprises the target word; the noise recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with the target word as a noise word and a word vector set corresponding to a training text marked with the target word as a non-noise word, and can judge whether the target word is the noise word according to the context environment of the target word;

the process of pre-establishing the noise word recognition model includes:

2. The method for recognizing the noise word in the text according to claim 1, wherein the sequentially converting each word in the training texts in the training text set into a word vector comprises:

3. The method of claim 2, further comprising:

4. An apparatus for recognizing noise words in text, comprising: the device comprises a text to be recognized acquisition module, a text to be recognized conversion module and a noise recognition module;

the text to be recognized acquiring module is used for acquiring a text to be recognized, wherein the text to be recognized comprises a target word;

the noise recognition module is used for inputting a word vector set corresponding to the text to be recognized into a pre-established and trained noise word recognition model to obtain a recognition result of the noise words in the text to be recognized, wherein the recognition result of the noise words in the text to be recognized is output by the noise word recognition model and is used for indicating whether the target words are noise words or not; the noise word recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with a noise word, and comprises the following steps: the training text comprises the target word; the noise recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with the target word as a noise word and a word vector set corresponding to a training text marked with the target word as a non-noise word, and can judge whether the target word is a noise word according to the context environment of the target word and judge whether the target word is a noise word according to the context environment of the target word;

further comprising: the training text conversion module comprises a training text acquisition module, a training text conversion module and a training module;

5. A server group, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to:

inputting a word vector set corresponding to the text to be recognized into a pre-established and trained noise word recognition model to obtain a recognition result of a noise word in the text to be recognized, wherein the recognition result of the noise word in the text to be recognized is output by the noise word recognition model and is used for indicating whether the target word is a noise word; the noise word recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with a noise word, and comprises the following steps: the training text comprises the target word; the noise recognition model is obtained by training a training sample by using a word vector set corresponding to a training text marked with the target word as a noise word and a word vector set corresponding to a training text marked with the target word as a non-noise word, and can judge whether the target word is a noise word according to the context environment of the target word and judge whether the target word is a noise word according to the context environment of the target word;

the process of pre-establishing the noise word recognition model includes:

6. A readable storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for recognizing a noise word in a text according to any one of claims 1 to 3.