CN113591469A

CN113591469A - Text enhancement method and system based on word interpretation

Info

Publication number: CN113591469A
Application number: CN202110662528.5A
Authority: CN
Inventors: 赵鹏阳; 杨红飞
Original assignee: Hangzhou Firestone Technology Co ltd
Current assignee: Hangzhou Firestone Technology Co ltd
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2021-11-02

Abstract

The application relates to a text enhancement method and system based on word interpretation, wherein the method comprises the following steps: acquiring a text to be detected, and acquiring an interpretation sentence of a target word in the text to be detected; secondly, preprocessing the text to be detected, setting an interpretation sentence of the target word as a label for a text classification task with the target word as a label, and adding the interpretation sentence of the target word into the text for a text classification task without the target word as a label; finally, the natural language classification model is trained through the preprocessed text, so that the problems of poor model training effect and low accuracy rate caused by insufficient words such as new words and the like in the text when the text is classified are solved, and the accuracy rate of the model is improved.

Description

Text enhancement method and system based on word interpretation

Technical Field

The present application relates to the field of computers, and more particularly, to a method and system for text enhancement based on word interpretation.

Background

In the application scenario of artificial intelligence, a natural language processing task based on machine learning requires a large amount of corpora to train a model. Therefore, the effect of the natural language processing model is good, a part of the natural language processing model depends on the corpus content, and when the corpus is insufficient, the problem that the accuracy rate and the recall rate of the model are not ideal occurs; or when the corpus data is unbalanced, for example, the data volume of some labels in the text classification is much larger than that of other labels, the model may be caused to pay too much attention to the label data with large data volume, so that the accuracy and recall rate of the labels with insufficient samples are low. Therefore, it is necessary to enhance the text, that is, to generate more corpora from the existing corpora and expand the corpora, and the existing common text data enhancement methods include: translation, non-core word replacement, text enhancement based on a generated language model, and the like.

However, in the related art, it is required that the text cannot relate to words with insufficient linguistic data, such as new words, and training of a large amount of linguistic data is required to obtain a relatively accurate model. Under the condition that the related linguistic data of the words are insufficient, the model is difficult to obtain a good effect.

At present, no effective solution is provided for the problems of poor model training effect and low accuracy rate existing when a text is classified and a new word and other words with insufficient related linguistic data exist in the text in the related technology.

Disclosure of Invention

The embodiment of the application provides a text enhancement method and system based on word interpretation, so as to at least solve the problems of poor model training effect and low accuracy rate in the prior art when the text is classified and the text contains words with insufficient related linguistic data such as new words.

In a first aspect, an embodiment of the present application provides a text enhancement method based on word interpretation, where the method includes:

acquiring a text to be detected, and acquiring an interpretation sentence of a target word in the text to be detected;

preprocessing the text to be detected, setting an interpretation sentence of the target word as a label for a text classification task taking the target word as a label, and adding the interpretation sentence of the target word into the text for a text classification task not taking the target word as a label;

and training the natural language classification model through the preprocessed text.

In some embodiments, the task of classifying the text tagged with the target word, wherein setting the interpretive sentence of the target word as a tag includes:

and setting the interpretation sentence of the target word as a label through averaging posing, and converting the interpretation sentence of the target word into a label vector with the same dimension as the word vector.

In some embodiments, the converting the target word interpretation sentence into a label vector having the same dimension as the word vector comprises:

performing word segmentation processing on the explanatory sentences of the target words, and acquiring BERT pre-trained word vectors corresponding to the word segments;

and calculating the average value of the word vectors in the same dimension to obtain the label vectors with the same dimension as the word vectors.

In some embodiments, the obtaining an interpretation sentence of a target word in the text to be tested includes:

and acquiring an interpretation sentence of the target word by a domain expert and a language expert, or searching a professional knowledge base to obtain the interpretation sentence of the target word.

In a second aspect, an embodiment of the present application provides a system for text enhancement based on word interpretation, the system including:

the acquisition module is used for acquiring a text to be detected and acquiring an interpretation sentence of a target word in the text to be detected;

the preprocessing module is used for preprocessing the text to be detected, setting the interpretation sentence of the target word as a label for a text classification task taking the target word as the label, and adding the interpretation sentence of the target word into the text for a text classification task not taking the target word as the label;

and the training module is used for training the natural language classification model through the preprocessed text.

In some embodiments, the preprocessing module is further configured to set the interpretive sentence of the target word as a tag through average posing, and convert the interpretive sentence of the target word into a tag vector with the same dimension as the word vector.

In some embodiments, the preprocessing module is further configured to perform word segmentation on the interpretive sentences of the target words, and obtain word vectors pre-trained by BERT corresponding to each word segmentation,

In some embodiments, the obtaining module is further configured to obtain an interpretation sentence of the target word by a domain expert and a language expert, or search a professional knowledge base to obtain the interpretation sentence of the target word.

In a third aspect, the present application provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the computer program to implement the method for text enhancement based on word interpretation according to the first aspect.

In a fourth aspect, embodiments of the present application provide a storage medium having stored thereon a computer program, which when executed by a processor, implements a method for word interpretation based text enhancement as described in the first aspect above.

Compared with the related art, the text enhancement method based on word interpretation provided by the embodiment of the application obtains the text to be tested and obtains the interpretation sentence of the target word in the text to be tested; secondly, preprocessing the text to be detected, setting an interpretation sentence of the target word as a label for a text classification task with the target word as a label, and adding the interpretation sentence of the target word into the text for a text classification task without the target word as a label; finally, the natural language classification model is trained through the preprocessed text, so that the problems of poor model training effect and low accuracy rate caused by insufficient words such as new words and the like in the text when the text is classified are solved, and the accuracy rate of the model is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic diagram of an application environment of a method for text enhancement based on word interpretation according to an embodiment of the present application;

FIG. 2 is a flow diagram of a method of text enhancement based on word interpretation according to an embodiment of the present application;

FIG. 3 is a block diagram of a system for text enhancement based on word interpretation according to an embodiment of the present application;

fig. 4 is an internal structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. Reference herein to "a plurality" means greater than or equal to two. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The method for text enhancement based on word interpretation provided by the application can be applied to the application environment shown in fig. 1, and fig. 1 is a schematic view of the application environment of the method for text enhancement based on word interpretation according to the embodiment of the application, as shown in fig. 1. Wherein the terminal device 11 communicates with the server 10 via a network. The server 10 acquires a text to be detected and acquires an interpretation sentence of a target word in the text to be detected; secondly, preprocessing the text to be detected, setting an interpretation sentence of the target word as a label for a text classification task with the target word as a label, and adding the interpretation sentence of the target word into the text for a text classification task without the target word as a label; and finally, training a natural language classification model through the preprocessed text, performing classification prediction on the text through the trained classification model, and displaying a prediction result on the terminal device 11. The terminal device 11 may be, but not limited to, various personal computers, notebook computers, smart phones, and tablet computers, and the server 10 may be implemented by an independent server or a server cluster formed by a plurality of servers.

The present embodiment provides a method for text enhancement based on word interpretation, and fig. 2 is a flowchart of a method for text enhancement based on word interpretation according to an embodiment of the present application, and as shown in fig. 2, the flowchart includes the following steps:

step S201, acquiring a text to be detected, and acquiring an interpretation sentence of a target word in the text to be detected;

optionally, in this embodiment, the obtaining of the interpretation sentence of the target word of the text to be tested is derived from a sentence of the target word interpreted by an expert and a language professional in the field using a natural language, or a professional knowledge base is searched, and the interpretation sentence of the target word is obtained from the professional knowledge base;

step S202, preprocessing the text to be detected, setting the interpretation sentence of the target word as a label for the text classification task using the target word as the label, and adding the interpretation sentence of the target word into the text for the text classification task not using the target word as the label;

preferably, in the embodiment, for the text classification task using the target word as the tag, the interpreting sentence of the target word is set as the tag by using averaging posing, and the interpreting sentence of the target word is converted into the tag vector with the same dimension as the word vector. Specifically, there is a text classification task that needs to train a model to classify a news into three categories, namely "sports", "economy" and "super-guest", wherein "super-guest" is the target word of the text, and "super-guest" can be interpreted as "technically enthusiast" by domain experts and language experts or by search of a professional knowledge base. Firstly, setting an explanation sentence of a target word 'jike' as a label for a technically enthusiast person; then, the explanation sentence of the target word 'extremely visitor' is subjected to word segmentation processing on the technical fierce person to obtain: [ 'to', 'technical', 'rabies', 'human' ]; then, obtaining word vectors of BERT pre-training corresponding to each participle, namely obtaining word vectors corresponding to [ 'pair', 'technique', 'heat', 'person' ] participles as v1, v2, v3, v4 and v 5; finally, calculating the average value of the word vectors in the same dimension to obtain label vectors with the same dimension as the word vectors, namely calculating v ^ average (v1, v2, v3, v4 and v5), wherein the obtained v ^ is the same as the dimension of the word vectors;

further, for the text classification task which does not take the target word as a label, the explanation sentence of the target word is added into the text. For example, if there is a T text in which the target word is "extremely guest," then "extremely guest" means a person who is technically enthusiast "is added to the end of the text T. Therefore, the target words and the interpretations of the target words can be associated by the interpretation sentences, and text classification is facilitated;

according to the method, the target words are explained, so that the problems of classification errors and the like caused by semantic ambiguity can be reduced, the accuracy of text classification can be effectively improved, and the accuracy of the model can be improved;

and step S203, training a natural language classification model through the preprocessed text. And (4) optional. In this embodiment, the natural language classification model may be trained through the preprocessed text, and it should be noted that the natural language classification model is not specifically limited in this embodiment.

Through the steps S201 to S203, the method for obtaining the text target word explanatory sentence and setting the label to the explanatory sentence in this embodiment solves the problems of poor model training effect and low accuracy when the text is classified and there are words with insufficient related linguistic data, such as new words, in the text, and improves the accuracy of the model.

It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.

The embodiment also provides a system for text enhancement based on word interpretation, which is used for implementing the above embodiments and preferred embodiments, and the description of the system is omitted. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 3 is a block diagram of a system for text enhancement based on word interpretation according to an embodiment of the present application, and as shown in fig. 3, the system includes an acquisition module 31, a preprocessing module 32, and a training module 33:

the obtaining module 31 is configured to obtain a text to be tested, and obtain an interpretation sentence of a target word in the text to be tested; the preprocessing module 32 is configured to preprocess the text to be detected, set the interpretation sentence of the target word as a tag for a text classification task using the target word as a tag, and add the interpretation sentence of the target word to the text for a text classification task not using the target word as a tag; and the training module 33 is configured to train the natural language classification model through the preprocessed text.

Through the system, the method for acquiring the text target word explanation sentence and setting the label for the explanation sentence solves the problems of poor model training effect and low accuracy rate when the text is classified and the text contains words with insufficient related linguistic data such as new words, and improves the accuracy rate of the model.

It should be noted that, for specific examples in other embodiments in the present application, reference may be made to examples described in the embodiment and the optional implementation manner of the text enhancement method based on word interpretation, and this embodiment is not described herein again.

Note that each of the modules may be a functional module or a program module, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.

The present embodiment also provides an electronic device comprising a memory having a computer program stored therein and a processor configured to execute the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

In addition, in combination with the method for text enhancement based on word interpretation in the above embodiments, the embodiments of the present application may be implemented by providing a storage medium. The storage medium having stored thereon a computer program; the computer program, when executed by a processor, implements any of the above-described embodiments of a method for word interpretation based text enhancement.

In one embodiment, a computer device is provided, which may be a terminal. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of text enhancement based on word interpretation. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

In one embodiment, fig. 4 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application, and as shown in fig. 4, there is provided an electronic device, which may be a server, and its internal structure diagram may be as shown in fig. 4. The electronic device comprises a processor, a network interface, an internal memory and a non-volatile memory connected by an internal bus, wherein the non-volatile memory stores an operating system, a computer program and a database. The processor is used for providing calculation and control capability, the network interface is used for communicating with an external terminal through network connection, the internal memory is used for providing an environment for an operating system and the running of a computer program, the computer program is executed by the processor to realize a text enhancement method based on word interpretation, and the database is used for storing data.

Those skilled in the art will appreciate that the configuration shown in fig. 4 is a block diagram of only a portion of the configuration associated with the present application, and does not constitute a limitation on the electronic device to which the present application is applied, and a particular electronic device may include more or less components than those shown in the drawings, or combine certain components, or have a different arrangement of components.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It should be understood by those skilled in the art that various features of the above-described embodiments can be combined in any combination, and for the sake of brevity, all possible combinations of features in the above-described embodiments are not described in detail, but rather, all combinations of features which are not inconsistent with each other should be construed as being within the scope of the present disclosure.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of word interpretation based text enhancement, the method comprising:

2. The method of claim 1, wherein the task of classifying the text tagged with the target word, wherein setting the interpretive sentence of the target word as a tag comprises:

3. The method of claim 2, wherein converting the target word interpretation sentence into a label vector having the same dimension as the word vector comprises:

4. The method of claim 1, wherein the obtaining of the interpretive sentence of the target word in the text to be tested comprises:

5. A system for word interpretation based text enhancement, the system comprising:

6. The system of claim 5,

the preprocessing module is further configured to set the interpretation sentence of the target word as a tag through averaging posing, and convert the interpretation sentence of the target word into a tag vector with the same dimension as the word vector.

7. The system of claim 6,

the preprocessing module is also used for performing word segmentation processing on the interpretive sentences of the target words and acquiring BERT pre-trained word vectors corresponding to the word segments,

8. The system of claim 5,

the acquisition module is further configured to acquire an interpretation sentence of the target word by a domain expert and a language expert, or search a professional knowledge base to obtain the interpretation sentence of the target word.

9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and the processor is configured to execute the computer program to perform the method of word interpretation based text enhancement of any of claims 1 to 4.

10. A storage medium having stored thereon a computer program, wherein the computer program is arranged to perform the method for word interpretation based text enhancement as claimed in any of claims 1 to 4 when executed.