WO2021063060A1

WO2021063060A1 - Text information extraction method and apparatus, storage medium and device

Info

Publication number: WO2021063060A1
Application number: PCT/CN2020/100483
Authority: WO
Inventors: 戴泽辉
Original assignee: 北京国双科技有限公司
Priority date: 2019-09-30
Filing date: 2020-07-06
Publication date: 2021-04-08
Also published as: CN112580358A

Abstract

A text information extraction method and apparatus, a storage medium and a device. The method comprises: determining a target text matrix corresponding to a text to be processed (11); inputting the target text matrix into a text information extraction model to obtain an information extraction result output by the text information extraction model (12), wherein the information extraction result comprises label information of each word segment in a text to be processed, the label information is used to indicate the type of word segment, if the word segment type is attribute type, and the label information of the word segment is also used to indicate the emotion type of the word segment; and determining, according to a first word segment in the text to be processed and the emotion type of the first word segment, attribute words in the text to be processed and the emotion type of each attribute word (13). In this way, by means of the training model, the word segment belonging to the attribute class in the text to be processed and the emotion type of these word segments can be obtained at the same time, so that the attribute words in the text to be processed and the emotion type of each attribute word can be obtained, and thus the efficiency is high and accuracy is ensured.

Description

Text information extraction method, device, storage medium and equipment

Technical field

The present disclosure relates to the field of computer technology, and in particular, to a method, device, storage medium, and equipment for extracting text information.

Background technique

In a text (for example, a user review), in addition to the subject, there are also words (or phrases) belonging to the attribute category and words (or phrases) belonging to the emotion category. The words (or phrases) belonging to the emotion category can be Attribute vocabulary (or phrase) is used to describe emotions. Among them, attribute vocabulary (or phrase) is used to describe the function or performance of the subject. The emotion type of emotion vocabulary generally includes three types: positive, neutral, and negative. . For example, if the text is "The appearance of A car is ugly", it can be known that the subject of the text is "A car", the vocabulary belonging to the attribute category is "appearance", and the vocabulary belonging to the emotional category is "ugly". For such texts, attribute sentiment can be obtained from it, that is, specific attributes and sentiment types can be extracted according to the content of the text.

At present, attribute sentiment is generally obtained through a two-step method using a pipeline structure, that is, firstly extract the vocabulary (or phrase) belonging to the attribute category in the text through the method of sequence labeling (for example, LSTM-CRF, BERT-CRF, etc.), and then, For each attribute vocabulary (or phrase), use the attribute vocabulary (or phrase) and its sentence as the model training data, and use deep learning (for example, LSTM-attention, BERT-CLS, ATAE, Recurrent Attention, Transformation Network Etc.) method for training to obtain a model for predicting a single attribute vocabulary. However, such a two-step training method will cause information loss and error stacking, resulting in a bias in attribute emotions, which is manifested as a loss of accuracy.

Summary of the invention

The purpose of the present disclosure is to provide a text information extraction method, device, storage medium, and equipment, which can achieve text information extraction more quickly and accurately, and quickly obtain attribute words and their emotions.

In order to achieve the foregoing objective, according to the first aspect of the present disclosure, a method for extracting text information is provided, the method including:

Determine the target text matrix corresponding to the text to be processed, where the text matrix corresponding to the text includes the vectorized representation corresponding to each word segmentation in the text;

The target text matrix is input to a text information extraction model to obtain an information extraction result output by the text information extraction model. The information extraction result includes the label information of each word segment in the text to be processed, and the label information is used In order to indicate the word segmentation type, the word segmentation type includes an attribute type, and if the word segmentation type of the word segmentation is the attribute type, the tag information of the word segmentation is also used to indicate the emotion type of the word segmentation;

Determine the attribute words in the to-be-processed text and the emotion type of each attribute word according to the first participle of the attribute class and the emotion type of the first participle in the to-be-processed text, wherein, Each of the attribute words is composed of at least one of the first participles.

Optionally, the word segmentation type further includes sentiment;

The method also includes:

Determine the emotional word in the to-be-processed text according to the second word segmentation in the to-be-processed text whose word segmentation type is the emotion category, wherein each of the emotional words is composed of at least one of the second participles;

Combine each attribute word with each emotion word respectively to obtain attribute emotion word pairs, and each attribute emotion word pair includes one attribute word and one emotion word;

Input the position information of the attribute emotion word pair and the attribute emotion word pair into the association model to obtain the association result output by the association model, and the association result is used to indicate the attribute words in each attribute emotion word pair Whether it is related to an emotion word, and the position information of the attribute emotion word pair is used to indicate the position relationship between the attribute word in the attribute emotion word pair and the emotion word in the text to be processed;

According to the association result, a target attribute emotional word pair related to the attribute word and the emotional word is determined.

Optionally, the association model is obtained in the following manner:

Using the historical attribute emotional word pair corresponding to the first historical text and the position information of the historical attribute emotional word pair as input data, and the historical association result of each historical attribute emotional word pair in the first historical text as output data, Training the deep neural network model to obtain the association model.

Optionally, the determining the target text matrix corresponding to the text to be processed includes:

Perform word segmentation processing on the to-be-processed text, and determine the word vector and part-of-speech vector corresponding to each word segmentation in the to-be-processed text;

Splicing the word vector and the part-of-speech vector of each said word segmentation to obtain a vectorized representation corresponding to each said word segmentation;

The target text matrix is determined according to the vectorized representation corresponding to each word segmentation, wherein the vectorized representation of each word segmentation in the to-be-processed text corresponds to a row in the target text matrix.

Optionally, the text information extraction model is obtained in the following manner:

The historical text matrix corresponding to the second historical text is used as input data, and the historical label information corresponding to each word segmentation in the second historical text is used as output data, and the deep neural network model is trained to obtain the text information extraction model .

Optionally, if the word segmentation type of the word segmentation is the attribute class, the tag information of the word segmentation is also used to indicate whether the word segmentation is in the first position in the preset word;

Determining the attribute words in the to-be-processed text includes:

According to the label information of each of the first word segmentation and its position in the to-be-processed text, determine the attribute word in the to-be-processed text, where the first word in the preset word A participle is in the first place among the attribute words in which it is located.

Optionally, determining the sentiment type of the attribute word includes:

The sentiment type corresponding to the first said first participle constituting the attribute word is determined as the sentiment type corresponding to the attribute word.

According to a second aspect of the present disclosure, there is provided a text information extraction device, the device including:

The first determining module is used to determine the target text matrix corresponding to the text to be processed, wherein the text matrix corresponding to the text includes the vectorized representation corresponding to each word segmentation in the text;

The first processing module is configured to input the target text matrix into a text information extraction model to obtain an information extraction result output by the text information extraction model, and the information extraction result includes the label of each word segment in the text to be processed Information, the tag information is used to indicate the word segmentation type, the word segmentation type includes an attribute class, and, if the word segmentation type of the word segmentation is the attribute class, the tag information of the word segmentation is also used to indicate the emotion type of the word segmentation;

The second determining module is configured to determine the attribute words in the to-be-processed text and each of the attributes according to the first participle of the attribute type in the to-be-processed text and the sentiment type of the first participle The emotional type of the word, wherein each of the attribute words is composed of at least one of the first participles.

Optionally, the word segmentation type further includes sentiment;

The device also includes:

The third determining module is configured to determine the emotional words in the to-be-processed text according to the second word-segmentation type of the emotional category in the to-be-processed text, wherein each of the emotional words is composed of at least one of the The second participle constitutes;

The fourth determining module is used to combine each attribute word with each emotion word to obtain attribute emotion word pairs, and each attribute emotion word pair includes one attribute word and one emotion word;

The second processing module is configured to input the position information of the attribute emotion word pair and the attribute emotion word pair into the association model to obtain the association result output by the association model, and the association result is used to indicate each of the attributes Whether the attribute word and the emotion word in the emotion word pair are related, and the position information of the attribute emotion word pair is used to indicate the position relationship between the attribute word and the emotion word in the attribute emotion word pair in the text to be processed ；

The fifth determining module is used to determine the target attribute emotional word pair related to the attribute word and the emotional word according to the association result.

Optionally, the association model is obtained in the following manner:

Optionally, the first determining module includes:

The first determining sub-module is configured to perform word segmentation processing on the to-be-processed text, and determine the word vector and part-of-speech vector corresponding to each word segmentation in the to-be-processed text;

A processing sub-module for splicing the word vector and part-of-speech vector of each said word segmentation to obtain the vectorized representation corresponding to each said word segmentation;

The second determining submodule is configured to determine the target text matrix according to the vectorized representation corresponding to each of the word segmentation, wherein the vectorized representation of each word segmentation in the to-be-processed text corresponds to the target text A row in the matrix.

The second determining module includes:

The attribute word determination sub-module is used to determine the attribute words in the to-be-processed text according to the tag information of each of the first word segmentation and its position in the to-be-processed text, wherein, in the preset The first participle in the first position in the word is in the first position in the attribute word in which it is located.

Optionally, the second determining module includes:

The emotion type determination sub-module is used to determine the emotion type corresponding to the first part of the attribute word as the emotion type corresponding to the attribute word.

According to a third aspect of the present disclosure, there is provided a storage medium having a program stored thereon, and when the program is executed by a processor, the steps of the method described in the first aspect of the present disclosure are implemented.

According to a fourth aspect of the present disclosure, there is provided a device including:

At least one processor, and at least one memory and bus connected to the processor;

Wherein, the processor and the memory complete mutual communication through the bus;

The processor is configured to call program instructions in the memory to execute the steps of the method described in the first aspect of the present disclosure.

According to the fifth aspect of the present disclosure, there is provided a computer program product, which when executed on a data processing device, is adapted to perform the steps of the method described in the first aspect of the present disclosure.

Through the above technical solution, the target text matrix corresponding to the text to be processed is determined, and the target text matrix is input to the text information extraction model to obtain the information extraction result output by the text information extraction model. The information extraction result includes each word segmentation in the text to be processed Label information. After that, the attribute words in the text to be processed and the emotion type of each attribute word are determined according to the first participle of the attribute type and the emotion type of the first participle in the text to be processed. In this way, according to the information extraction result of the text information extraction model, the word segmentation belonging to the attribute category in the text to be processed and the emotional type of these word segmentation can be obtained at the same time, thereby obtaining the attribute words in the text to be processed and the emotional type of each attribute word, which is highly efficient And can guarantee the accuracy rate.

Other features and advantages of the present disclosure will be described in detail in the following specific embodiments.

Description of the drawings

The accompanying drawings are used to provide a further understanding of the present disclosure and constitute a part of the specification. Together with the following specific embodiments, they are used to explain the present disclosure, but do not constitute a limitation to the present disclosure. In the attached picture:

Fig. 1 is a flowchart of a method for extracting text information according to an embodiment of the present disclosure;

2A and 2B are exemplary schematic diagrams of tag information in the text information extraction method provided according to the present disclosure;

FIG. 3 is a flowchart of a method for extracting text information according to another embodiment of the present disclosure;

Fig. 4 is a block diagram of a text information extraction device provided according to an embodiment of the present disclosure;

Fig. 5 is a block diagram of a device provided according to an embodiment of the present disclosure.

Detailed ways

The specific embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are only used to illustrate and explain the present disclosure, and are not used to limit the present disclosure.

Fig. 1 is a flowchart of a method for extracting text information according to an embodiment of the present disclosure. As shown in Figure 1, the method may include the following steps.

In step 11, the target text matrix corresponding to the text to be processed is determined.

The text matrix corresponding to the text includes the vectorized representation corresponding to each word segmentation in the text. For a piece of text, first perform word segmentation processing on the text, and then obtain the vectorized representation corresponding to each word segmentation of the text, where the vectorized representation corresponding to the word segmentation can reflect the characteristics of the word segmentation itself and the part-of-speech characteristics of the word segmentation. Correspondingly, the target text matrix corresponding to the text to be processed includes the vectorized representation corresponding to each word segmentation in the text to be processed.

In a possible implementation manner, step 11 may include the following steps:

Perform word segmentation processing on the text to be processed, and determine the word vector and part-of-speech vector corresponding to each word segmentation in the text to be processed;

Join the word vector and part of speech vector of each word segmentation to obtain the vectorized representation corresponding to each word segmentation;

According to the vectorized representation corresponding to each word segmentation, the target text matrix is determined.

Among them, the word vector maps the vocabulary to the vector space, and the similarity relationship between the word vectors can reflect the similarity relationship between the words. The part-of-speech vector can reflect the part-of-speech characteristics of the vocabulary, that is, the part-of-speech vector can be used to determine the part of speech of the vocabulary. The part-of-speech vector can be represented by a random vector of a certain dimension. For example, if there are 30 parts of speech A1～A30, they can be represented by the vectors a1～a30 in turn , The dimensions of a1 to a30 are specified fixed values (for example, 20), and each dimension can be a randomly generated decimal close to 0.

In this method, word segmentation is performed in advance for the text in the relevant corpus (for example, the corpus of the text to be processed), and the word vector model (for example, Word2vec, Glove, ELMo, etc.) is used for word vector training to obtain each vocabulary The corresponding word vector. For example, the vocabulary can be mapped to a 100-dimensional vector space, that is, the word vector corresponding to the vocabulary is a 100-dimensional vector.

When processing the text to be processed, first perform word segmentation processing on the text to be processed to obtain the word segmentation result. Then, according to the word segmentation result and the word vector corresponding to each vocabulary obtained by pre-training, the word vector corresponding to each word segmentation in the text to be processed is determined, and the part-of-speech vector is determined according to the word segmentation result. After obtaining the word vector and the part-of-speech vector corresponding to each word segmentation in the text to be processed, for each word segmentation, the word vector and the part-of-speech vector are spliced together to obtain the vectorized representation corresponding to each word segmentation. For example, if the word vector of a word segmentation is a 100-dimensional vector [B1, B2, B3,..., B100], and the part-of-speech vector is a 20-dimensional vector [C1, C2, C3,..., C20], then the word segmentation corresponds to the vectorization It can be a 120-dimensional vector [B1, B2, B3,..., B100, C1, C2, C3,..., C20].

According to the vectorized representation corresponding to each word segment in the text to be processed, the target text matrix is determined, where the vectorized representation of each word segment in the text to be processed corresponds to a row in the target text matrix. In addition, the vectorization of word segmentation means that the order of appearance in the target text matrix is consistent with the order of appearance in the text to be processed. For example, if the order of occurrence of word segmentation in the text to be processed is word segmentation 1, word segmentation 2, word segmentation 3, then in the target text matrix, word segmentation 1, word segmentation 2, and word segmentation 3 correspond to the kth row and kth row in the target text matrix. In line +1, line k+2, k is a positive integer, such as 1.

In a possible embodiment, the target text matrix may be formed by direct combination of the vectorized representations corresponding to each word segmentation in the text to be processed. For example, if the text to be processed has a total of 200 word segments, and the vectorized representation corresponding to each word segmentation is a 120-dimensional vector, the target text matrix is a 200*120 matrix.

In another possible embodiment, after the vectorized representation corresponding to each word segmentation in the text to be processed is combined to obtain a matrix, it can also be appropriately expanded on the basis of the obtained matrix (for example, horizontal expansion, and/or vertical expansion). Expand) to form a target text matrix, where the expanded part can be processed with zero padding. For example, if the text to be processed has a total of 200 word segments, and the vectorization corresponding to each word segmentation is represented as a 120-dimensional vector, a 200*120 matrix is obtained after the combination, and it is expanded to a 200*200 text matrix as the target text matrix. In this way, even if the text lengths are different, the format of the obtained target text matrix is the same, which can ensure that the form of the target text matrix is consistent and facilitate subsequent data processing.

Using the above method, after the word segmentation of the text to be processed, the word segmentation feature and the part-of-speech feature of the word segmentation are extracted to obtain the vectorized expression of each word segmentation and form a matrix, which can provide effective data support for subsequent data processing.

In step 12, the target text matrix is input to the text information extraction model to obtain the information extraction result output by the text information extraction model.

Among them, the information extraction result includes tag information of each word segmentation in the text to be processed. Tag information can be used to indicate the word segmentation type. The word segmentation type can include attribute type, emotion type, and other types except attribute type and emotion type. As mentioned above, the word segmentation belonging to attribute type is used to describe function or performance and belongs to emotion type. The word segmentation of is used to describe the emotion of the word segmentation belonging to the attribute class.

Optionally, if the word segmentation type of the word segmentation is an attribute type, the tag information of the word segmentation may also indicate the emotion type of the word segmentation. In other words, for attribute word segmentation, the label information can reflect the word segmentation type and the emotion type of the word segmentation at the same time. Among them, the word segmentation type and the emotion type can be distinguished by identification information such as keywords. For example, emotion types can be divided into three types: positive, neutral, and negative.

Optionally, if the word segmentation type of the word segmentation is attribute type or sentiment type, the tag information of the word segmentation can also be used to indicate whether the word segmentation is in the first position in the preset word. Among them, the presupposition word is a word or phrase belonging to the attribute category or emotion category. For example, if the preset word is "engine power" in the attribute category, the tag information of "engine" indicates that it is in the first place in the preset word, and the tag information of "power" indicates that it is not in the preset word. In the first place. For another example, if the preset word is emotional "unsightly", the tag information of "very" indicates that it is in the first place in the preset word, and the tag information of "ugly" indicates that it is not in the preset word. In the first place.

For example, Figure 2A is an example of label information for each word segmentation in the text, where the attribute class corresponds to Attr, the sentiment class corresponds to Opin, the other classes correspond to O, the first position in the preset word corresponds to B, and the pre-set word corresponds to B. The first position in the word corresponds to I, the positive emotion corresponds to Pos, the neutral emotion corresponds to Neu, and the negative emotion corresponds to Neg. As shown in FIG. 2A, the tag information of the word segmentation "engine" is B_Attr_Pos, "engine" belongs to the attribute category, is in the first place among the preset words, and the emotion type is positive. For another example, the label information of each word segmentation in the text may also be as shown in FIG. 2B, where the label information corresponding to each word has the same meaning as represented in FIG. 2A, and only the form of the label information is different. It should be noted that the label information in FIG. 2A and FIG. 2B are only examples. The label information in this method is not limited to the above-mentioned form, and can be distinguished. Other possible examples will not be repeated here.

Through the tag information of the word segmentation, it can be determined what kind of word segmentation the word segment is, for example, it belongs to the attribute type or emotion type or other types, and if it belongs to the attribute type, what is its emotion type.

In a possible implementation, the text information extraction model can be obtained in the following manner:

Using the historical text matrix corresponding to the second historical text as input data, and the historical label information corresponding to each word segmentation in the second historical text as output data, the deep neural network model is trained to obtain a text information extraction model.

The second historical text can be taken from a corpus related to the text to be processed. The method of obtaining the historical text matrix corresponding to the second historical text has the same principle as the method of obtaining the target text matrix, which has been described in the foregoing and will not be repeated here. The historical label information corresponding to each word segmentation in the second historical text can be manually labeled. The label information is also described in the previous section, and the description will not be repeated here.

Thus, the historical text matrix corresponding to the second historical text is used as input data, and the historical label information corresponding to each word segmentation in the second historical text is used as output data, and the deep neural network model is trained to obtain a text information extraction model. For example, during model training, the deep neural network model is trained based on learning frameworks such as tensorflow, mxnet, pytorch, etc., and one or more encoders (for example, LSTM, Transformer, BERT) are used for encoding, and the decoder ( For example, CRF) decodes the position of each word segmentation to extract the label information corresponding to each word segmentation position. It should be noted that the method for training the deep neural network model belongs to the prior art and is well known to those skilled in the art, and will not be repeated here.

In the above manner, model training is performed based on the existing data to obtain the text information extraction model. In actual application, the corresponding data is directly input into the text information extraction model to obtain the information extraction results output by the text information extraction model. The application is simple and convenient.

In step 13, the attribute words in the text to be processed and the emotion types of each attribute word are determined according to the first word segmentation whose word segmentation type is the attribute type in the text to be processed and its sentiment type.

Among them, each attribute word is composed of at least one first participle. If the attribute word consists of a first participle, the attribute word is the first participle. If the attribute word is composed of more than one first participle, the attribute word is a compound word formed by these first participles, and when the attribute word is composed of more than one first participle, these first participles that constitute the attribute word are to be processed The position in the text is continuous.

In a possible implementation manner, determining the attribute words in the text to be processed in step 13 may include the following steps:

According to the label information of each first word segmentation and its position in the text to be processed, the attribute words in the text to be processed are determined.

Among them, the first participle in the first place among the presupposition words is still in the first place among the attribute words in which it is located.

If the label information of the first participle indicates that the first participle is in the first place in the preset word, the first participle (hereinafter referred to as the "starting word") can be used as the first participle in the attribute word, and the attribute word can be determined continuously The remainder of

In one case, if the participle to the right of the start word in the text to be processed does not belong to the attribute class, then the start word is determined as an attribute word. For example, if the tag information corresponding to a piece of text {e1, e2, e3} is {O, Begin1, O} in order, where O indicates that the word segmentation type is other types, and Begin1 indicates that the word segmentation type is an attribute type and is in the preset word In the first place, it can be seen that the start word is e2, and the participle e3 on the right side of the start word e2 does not belong to the attribute category, and the participle e2 can be directly determined as an attribute word.

In another case, if the participle to the right of the start word (hereinafter referred to as "right neighbor") in the text to be processed belongs to the attribute category, and the label information of the right neighbor indicates that the participle is not in the preset word First, use the right-neighbor word as the starting point to search for consecutive participles that belong to the attribute class and are not in the first place in the presupposition word, and use the search result as the remaining part of the attribute word. For example, if the tag information corresponding to a piece of text {e4, e5, e6, e7, e8} is {O, Begin1, Inside1, Inside1, O}, where O indicates that the word segmentation type is other types, and Begin1 indicates that the word segmentation type is Attribute category and is in the first place in the presupposition word. Inside1 indicates that the word segmentation type is attribute category and is not in the first place in the presupposition word. It can be seen that the start word is e5, and the right neighbor e6 of the start word e5 belongs to the attribute category and is in the presupposition. The presupposition word is not in the first position, so you can use e6 as the starting point to find a continuous participle that belongs to the attribute class and is not in the first position in the presupposition word. The search result is e6e7, and the remaining part of the attribute word can be determined as e6e7, and finally determined The attribute word is e5e6e7.

By referring to the above method, all the attribute words in the text to be processed can be determined.

In the above manner, using the label information of the first word segmentation and the position of the first word segmentation in the text to be processed, each attribute word in the text to be processed can be quickly determined, and the efficiency is high.

In a possible implementation manner, determining the sentiment type of the attribute word in step 13 may include the following steps:

The emotion type corresponding to the first first participle constituting the attribute word is determined as the emotion type corresponding to the attribute word.

Generally speaking, the emotion type corresponding to all the first participles constituting an attribute word is the same. Therefore, the emotion type corresponding to one of the first participles can be directly determined as the emotion type of the attribute word, for example, the first participle One participle. In case of different situations, the above method can also be used to directly determine the emotion type corresponding to the first first participle as the emotion type corresponding to the attribute word.

Through the above solution, the target text matrix corresponding to the text to be processed is determined, and the target text matrix is input to the text information extraction model to obtain the information extraction result output by the text information extraction model. The information extraction result includes the word segmentation of the text to be processed Label Information. After that, according to the first word segmentation whose word segmentation type is the attribute type in the text to be processed and its sentiment type, the attribute words in the text to be processed and the sentiment type of each attribute word are determined. In this way, according to the information extraction result of the text information extraction model, the word segmentation belonging to the attribute category in the text to be processed and the emotional type of these word segmentation can be obtained at the same time, thereby obtaining the attribute words in the text to be processed and the emotional type of each attribute word, which is highly efficient And can guarantee the accuracy rate.

Fig. 3 is a flowchart of a method for extracting text information according to another embodiment of the present disclosure. As shown in FIG. 3, based on the steps shown in FIG. 1, the method provided by the present disclosure may further include the following steps.

In step 31, the emotional word in the text to be processed is determined according to the second word segmentation of the sentiment type in the text to be processed.

Among them, each emotional word is composed of at least one second participle. If the emotional word consists of a second participle, the emotional word is the second participle. If the emotional word is composed of more than one second participle, the emotional word is a compound word formed by these second participles, and when the emotional word is composed of more than one second participle, the second participles that constitute the emotional word are waiting to be processed The position in the text is continuous.

In a possible implementation manner, the emotional word in the text to be processed is determined according to the label information of each second word segmentation and its position in the text to be processed. Among them, the method of determining the emotional word is the same as the principle of determining the attribute word, which has been described above and will not be repeated here.

In step 32, each attribute word is combined with each emotion word to obtain an attribute emotion word pair.

Each attribute affect word pair contains an attribute word and an affect word.

For example, if there are attribute words {m1, m2, m3, m4} and emotion words {n1, n2} in the text to be processed, 8 attribute emotion word pairs can be obtained after the combination, namely: m1-n1, m1 -n2, m2-n1, m2-n2, m3-n1, m3-n2, m4-n1, m4-n2.

In step 33, the position information of the attribute emotion word pair and the attribute emotion word pair is input to the association model to obtain the association result output by the association model.

Among them, the position information of the attribute emotion word pair is used to indicate the position relationship between the attribute word in the attribute emotion word pair and the emotion word in the text to be processed. For example, the position of the attribute word and the emotion word in the text to be processed, the distance between the attribute word and the emotion word in the text to be processed, whether the attribute word and the emotion word are in the same sentence, and so on.

The association result is used to indicate whether the attribute words and emotion words in each attribute emotion word pair are related. The correlation between attribute words and emotional words means that the object described by the emotional word is the attribute word. Whether attribute words and emotional words are related can be reflected by their location. For example, related attribute words and emotional words are generally located in the same sentence or in similar locations.

In a possible implementation, the association model can be obtained in the following manner:

The position information of the historical attribute emotional word pair and the historical attribute emotional word pair corresponding to the first historical text is used as input data, and the historical association result of each historical attribute emotional word pair in the first historical text is used as the output data. The model is trained to obtain an associated model.

The first historical text can be taken from a corpus related to the text to be processed. The first historical text and the second historical text in the preceding text may be the same. The method of obtaining the historical attribute emotional word pair corresponding to the first historical text is the same as the method of obtaining the attribute emotional word pair in step 32 (and related steps of how to obtain attribute words, emotional words, etc.), which has been described in the foregoing. Do not repeat it here. The historical association results of each historical attribute emotional word pair in the first historical text can be manually labeled, that is, whether each historical attribute emotional word pair is related or not. For example, if the first historical text is "The engine power of A car is strong, but the appearance is ugly", the attribute words are "engine power" and "appearance", and the emotional words are "strong" and "unsightly", among which, There are a total of 4 historical attribute emotional word pairs, namely {engine power-strong, engine power-ugly, appearance-strong, appearance-ugly}. When manually labeling, "engine power-strong" and "appearance-very "Unsightly" is marked as relevant, and "engine power-ugly" and "appearance-strong" are marked as irrelevant.

Therefore, the position information of the historical attribute emotional word pair and the historical attribute emotional word pair corresponding to the first historical text is used as input data, and the historical association result of each historical attribute emotional word pair in the first historical text is used as output data. The deep neural network model is trained to obtain the correlation model. For example, during model training, the deep neural network model is trained based on learning methods such as RandomForest, LSTM-attention, and Recurrent Attention.

In the above manner, model training is performed based on the existing data to obtain the association model. In actual application, the corresponding data is directly input into the association model to obtain the association result output by the association model, and the application is simple and convenient.

In step 34, the target attribute emotional word pair is determined according to the association result.

The target attribute emotional word pair refers to the attribute emotional word pair related to the attribute word and the emotional word in the text to be processed.

After the association result is obtained, the target emotional word pair related to the attribute word and the emotional word can be selected from it for the user to view or use.

Using the above method, after determining the attribute words in the text to be processed and their emotional types, the emotional words related to each attribute word can also be extracted from the text to be processed. The information extraction function is more complete and it is convenient for users to view the data. And use.

Fig. 4 is a block diagram of a text information extraction device provided according to an embodiment of the present disclosure. As shown in Fig. 4, the device 40 includes:

The first determining module 41 is configured to determine a target text matrix corresponding to the text to be processed, where the text matrix corresponding to the text includes a vectorized representation corresponding to each word segmentation in the text;

The first processing module 42 is configured to input the target text matrix into a text information extraction model to obtain an information extraction result output by the text information extraction model, and the information extraction result includes the information of each word segmentation in the text to be processed Tag information, the tag information is used to indicate the word segmentation type, the word segmentation type includes an attribute class, and if the word segmentation type of the word segmentation is the attribute class, the tag information of the word segmentation is also used to indicate the emotion type of the word segmentation;

The second determining module 43 is configured to determine the attribute words in the to-be-processed text and each of the attribute words in the to-be-processed text according to the first participle of the attribute type in the text to be processed and the emotion type of the first participle. The sentiment type of the attribute word, wherein each attribute word is composed of at least one of the first participles.

Optionally, the word segmentation type further includes sentiment;

The device 40 also includes:

Optionally, the association model is obtained in the following manner:

Optionally, the first determining module 41 includes:

The second determining module 43 includes:

Optionally, the second determining module 43 includes:

Regarding the device in the foregoing embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and detailed description will not be given here.

The text information extraction device includes a processor and a memory, and the above-mentioned first determination module, first processing module, second determination module, third determination module, fourth determination module, second processing module, fifth determination module, etc., all serve as The program unit is stored in the memory, and the above-mentioned program unit stored in the memory is executed by the processor to realize the corresponding function.

The processor contains the kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to one or more, and the text information can be extracted more quickly and accurately by adjusting the kernel parameters, and the attribute words and their emotion can be obtained quickly.

The embodiment of the present invention provides a storage medium on which a program is stored, and when the program is executed by a processor, the text information extraction method is implemented.

The embodiment of the present invention provides a processor configured to run a program, wherein the method for extracting text information is executed when the program is running.

An embodiment of the present invention provides a device. As shown in FIG. 5, the device 70 includes at least one processor 701, and at least one memory 702 and a bus 703 connected to the processor 701; wherein the processor 701 and the memory 702 pass through the bus 703 completes mutual communication; the processor 701 is configured to call program instructions in the memory 702 to execute the above-mentioned text information extraction method. The device in this article can be a server, a PC, etc.

This application also provides a computer program product, which when executed on a data processing device, is suitable for executing a program that initializes the following method steps:

Determine the attribute words in the to-be-processed text and the emotion type of each attribute word according to the first participle of the attribute type in the text to be processed and the emotion type thereof, where each attribute The word is composed of at least one of the first participles.

Optionally, the word segmentation type further includes sentiment;

The method also includes:

Input the position information of the attribute emotion word pair and the attribute emotion word pair into the association model to obtain the association result output by the association model, and the association result is used to indicate the attribute in each attribute emotion word pair Whether the word and the emotion word are related, and the position information of the attribute emotion word pair is used to indicate the position relationship between the attribute word and the emotion word in the attribute emotion word pair in the text to be processed;

Optionally, the association model is obtained in the following manner:

Determining the attribute words in the to-be-processed text includes:

Optionally, determining the sentiment type of the attribute word includes:

This application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

In a typical configuration, the device includes one or more processors (CPUs), memory, and buses. The device may also include input/output interfaces, network interfaces, and so on.

The memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), and the memory includes at least one Memory chip. The memory is an example of a computer-readable medium.

Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

It should also be noted that the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or equipment including a series of elements not only includes those elements, but also includes Other elements that are not explicitly listed, or they also include elements inherent to such processes, methods, commodities, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, commodity, or equipment that includes the element.

Those skilled in the art should understand that the embodiments of the present application can be provided as a method, a system, or a computer program product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

The above are only examples of the application, and are not used to limit the application. For those skilled in the art, this application can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.

Claims

A method for extracting text information, characterized in that the method includes:

Determine the target text matrix corresponding to the text to be processed, where the text matrix corresponding to the text includes the vectorized representation corresponding to each word segmentation in the text;

The target text matrix is input to a text information extraction model to obtain an information extraction result output by the text information extraction model. The information extraction result includes the label information of each word segment in the text to be processed, and the label information is used In order to indicate the word segmentation type, the word segmentation type includes an attribute type, and if the word segmentation type of the word segmentation is the attribute type, the tag information of the word segmentation is also used to indicate the emotion type of the word segmentation;

Determine the attribute words in the to-be-processed text and the emotion type of each attribute word according to the first participle of the attribute class and the emotion type of the first participle in the to-be-processed text, wherein, Each of the attribute words is composed of at least one of the first participles.
The method according to claim 1, wherein the word segmentation type further includes emotion type;

The method also includes:

Determine the emotional word in the to-be-processed text according to the second word segmentation in the to-be-processed text whose word segmentation type is the emotion category, wherein each of the emotional words is composed of at least one of the second participles;

Combine each attribute word with each emotion word respectively to obtain attribute emotion word pairs, and each attribute emotion word pair includes one attribute word and one emotion word;

Input the position information of the attribute emotion word pair and the attribute emotion word pair into the association model to obtain the association result output by the association model, and the association result is used to indicate the attribute words in each attribute emotion word pair Whether it is related to an emotion word, and the position information of the attribute emotion word pair is used to indicate the position relationship between the attribute word in the attribute emotion word pair and the emotion word in the text to be processed;

According to the association result, a target attribute emotional word pair related to the attribute word and the emotional word is determined.
The method according to claim 2, wherein the association model is obtained in the following manner:

Using the historical attribute emotional word pair corresponding to the first historical text and the position information of the historical attribute emotional word pair as input data, and the historical association result of each historical attribute emotional word pair in the first historical text as output data, Training the deep neural network model to obtain the association model.
The method according to claim 1, wherein said determining the target text matrix corresponding to the text to be processed comprises:

Perform word segmentation processing on the to-be-processed text, and determine the word vector and part-of-speech vector corresponding to each word segmentation in the to-be-processed text;

Splicing the word vector and the part-of-speech vector of each said word segmentation to obtain a vectorized representation corresponding to each said word segmentation;

The target text matrix is determined according to the vectorized representation corresponding to each word segmentation, wherein the vectorized representation of each word segmentation in the to-be-processed text corresponds to a row in the target text matrix.
The method according to claim 1, wherein the text information extraction model is obtained in the following manner:

The historical text matrix corresponding to the second historical text is used as input data, and the historical label information corresponding to each word segmentation in the second historical text is used as output data, and the deep neural network model is trained to obtain the text information extraction model .
The method according to claim 1, wherein if the word segmentation type of the word segmentation is the attribute class, the tag information of the word segmentation is also used to indicate whether the word segmentation is in the first position in the preset word;

Determining the attribute words in the to-be-processed text includes:

According to the label information of each of the first word segmentation and its position in the to-be-processed text, determine the attribute word in the to-be-processed text, where the first word in the preset word A participle is in the first place among the attribute words in which it is located.
The method according to claim 1, wherein determining the sentiment type of the attribute word comprises:

The sentiment type corresponding to the first said first participle constituting the attribute word is determined as the sentiment type corresponding to the attribute word.
A text information extraction device, characterized in that the device comprises:

The first determining module is used to determine the target text matrix corresponding to the text to be processed, wherein the text matrix corresponding to the text includes the vectorized representation corresponding to each word segmentation in the text;

The first processing module is configured to input the target text matrix into a text information extraction model to obtain an information extraction result output by the text information extraction model, and the information extraction result includes the label of each word segment in the text to be processed Information, the tag information is used to indicate the word segmentation type, the word segmentation type includes an attribute class, and, if the word segmentation type of the word segmentation is the attribute class, the tag information of the word segmentation is also used to indicate the emotion type of the word segmentation;

The second determining module is configured to determine the attribute words in the to-be-processed text and each of the attributes according to the first participle of the attribute type in the to-be-processed text and the sentiment type of the first participle The emotional type of the word, wherein each of the attribute words is composed of at least one of the first participles.
8. The device according to claim 8, wherein the word segmentation type further includes sentiment;

The device also includes:

The third determining module is configured to determine the emotional words in the to-be-processed text according to the second word-segmentation type of the emotional category in the to-be-processed text, wherein each of the emotional words is composed of at least one of the The second participle constitutes;

The fourth determining module is used to combine each attribute word with each emotion word to obtain attribute emotion word pairs, and each attribute emotion word pair includes one attribute word and one emotion word;

The second processing module is configured to input the position information of the attribute emotion word pair and the attribute emotion word pair into the association model to obtain the association result output by the association model, and the association result is used to indicate each of the attributes Whether the attribute word and the emotion word in the emotion word pair are related, and the position information of the attribute emotion word pair is used to indicate the position relationship between the attribute word and the emotion word in the attribute emotion word pair in the text to be processed ；

The fifth determining module is used to determine the target attribute emotional word pair related to the attribute word and the emotional word according to the association result.
The device according to claim 9, wherein the association model is obtained in the following manner:

Using the historical attribute emotional word pair corresponding to the first historical text and the position information of the historical attribute emotional word pair as input data, and the historical association result of each historical attribute emotional word pair in the first historical text as output data, Training the deep neural network model to obtain the association model.
The device according to claim 8, wherein the first determining module comprises:

The first determining sub-module is configured to perform word segmentation processing on the to-be-processed text, and determine the word vector and part-of-speech vector corresponding to each word segmentation in the to-be-processed text;

A processing sub-module for splicing the word vector and part-of-speech vector of each said word segmentation to obtain the vectorized representation corresponding to each said word segmentation;

The second determining submodule is configured to determine the target text matrix according to the vectorized representation corresponding to each of the word segmentation, wherein the vectorized representation of each word segmentation in the to-be-processed text corresponds to the target text A row in the matrix.
The device according to claim 8, wherein the text information extraction model is obtained in the following manner:

The historical text matrix corresponding to the second historical text is used as input data, and the historical label information corresponding to each word segmentation in the second historical text is used as output data, and the deep neural network model is trained to obtain the text information extraction model .
8. The device according to claim 8, wherein if the word segmentation type of the word segmentation is the attribute class, the tag information of the word segmentation is also used to indicate whether the word segmentation is in the first position in the preset word;

The second determining module includes:

The attribute word determination sub-module is used to determine the attribute words in the to-be-processed text according to the tag information of each of the first word segmentation and its position in the to-be-processed text, wherein, in the preset The first participle in the first position in the word is in the first position in the attribute word in which it is located.
The device according to claim 8, wherein the second determining module comprises:

The emotion type determination sub-module is used to determine the emotion type corresponding to the first part of the attribute word as the emotion type corresponding to the attribute word.
A storage medium having a program stored thereon, wherein the program is executed by a processor to implement the steps of the method according to any one of claims 1-7.
A device, characterized in that the device includes:

At least one processor, and at least one memory and bus connected to the processor;

Wherein, the processor and the memory complete mutual communication through the bus;

The processor is used to call program instructions in the memory to execute the steps of the method in any one of claims 1-7.
A computer program product, characterized in that, when executed on a data processing device, it is adapted to perform the steps of initializing the method according to any one of claims 1-7.