WO2021063060A1 - Procédé et appareil d'extraction d'informations de texte, support d'enregistrement et dispositif - Google Patents

Procédé et appareil d'extraction d'informations de texte, support d'enregistrement et dispositif Download PDF

Info

Publication number
WO2021063060A1
WO2021063060A1 PCT/CN2020/100483 CN2020100483W WO2021063060A1 WO 2021063060 A1 WO2021063060 A1 WO 2021063060A1 CN 2020100483 W CN2020100483 W CN 2020100483W WO 2021063060 A1 WO2021063060 A1 WO 2021063060A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
text
attribute
emotion
processed
Prior art date
Application number
PCT/CN2020/100483
Other languages
English (en)
Chinese (zh)
Inventor
戴泽辉
Original Assignee
北京国双科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京国双科技有限公司 filed Critical 北京国双科技有限公司
Publication of WO2021063060A1 publication Critical patent/WO2021063060A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to a method, device, storage medium, and equipment for extracting text information.
  • a text for example, a user review
  • the words (or phrases) belonging to the attribute category can be Attribute vocabulary (or phrase) is used to describe emotions.
  • attribute vocabulary (or phrase) is used to describe the function or performance of the subject.
  • the emotion type of emotion vocabulary generally includes three types: positive, neutral, and negative. . For example, if the text is "The appearance of A car is ugly", it can be known that the subject of the text is "A car", the vocabulary belonging to the attribute category is "appearance”, and the vocabulary belonging to the emotional category is "ugly”.
  • attribute sentiment can be obtained from it, that is, specific attributes and sentiment types can be extracted according to the content of the text.
  • attribute sentiment is generally obtained through a two-step method using a pipeline structure, that is, firstly extract the vocabulary (or phrase) belonging to the attribute category in the text through the method of sequence labeling (for example, LSTM-CRF, BERT-CRF, etc.), and then, For each attribute vocabulary (or phrase), use the attribute vocabulary (or phrase) and its sentence as the model training data, and use deep learning (for example, LSTM-attention, BERT-CLS, ATAE, Recurrent Attention, Transformation Network Etc.) method for training to obtain a model for predicting a single attribute vocabulary.
  • sequence labeling for example, LSTM-CRF, BERT-CRF, etc.
  • deep learning for example, LSTM-attention, BERT-CLS, ATAE, Recurrent Attention, Transformation Network Etc.
  • the purpose of the present disclosure is to provide a text information extraction method, device, storage medium, and equipment, which can achieve text information extraction more quickly and accurately, and quickly obtain attribute words and their emotions.
  • a method for extracting text information including:
  • the target text matrix is input to a text information extraction model to obtain an information extraction result output by the text information extraction model.
  • the information extraction result includes the label information of each word segment in the text to be processed, and the label information is used
  • the word segmentation type includes an attribute type, and if the word segmentation type of the word segmentation is the attribute type, the tag information of the word segmentation is also used to indicate the emotion type of the word segmentation;
  • Each of the attribute words is composed of at least one of the first participles.
  • the word segmentation type further includes sentiment
  • the method also includes:
  • each attribute word with each emotion word respectively to obtain attribute emotion word pairs, and each attribute emotion word pair includes one attribute word and one emotion word;
  • a target attribute emotional word pair related to the attribute word and the emotional word is determined.
  • association model is obtained in the following manner:
  • the determining the target text matrix corresponding to the text to be processed includes:
  • the target text matrix is determined according to the vectorized representation corresponding to each word segmentation, wherein the vectorized representation of each word segmentation in the to-be-processed text corresponds to a row in the target text matrix.
  • the text information extraction model is obtained in the following manner:
  • the historical text matrix corresponding to the second historical text is used as input data, and the historical label information corresponding to each word segmentation in the second historical text is used as output data, and the deep neural network model is trained to obtain the text information extraction model .
  • the tag information of the word segmentation is also used to indicate whether the word segmentation is in the first position in the preset word;
  • Determining the attribute words in the to-be-processed text includes:
  • the label information of each of the first word segmentation and its position in the to-be-processed text determine the attribute word in the to-be-processed text, where the first word in the preset word A participle is in the first place among the attribute words in which it is located.
  • determining the sentiment type of the attribute word includes:
  • the sentiment type corresponding to the first said first participle constituting the attribute word is determined as the sentiment type corresponding to the attribute word.
  • a text information extraction device including:
  • the first determining module is used to determine the target text matrix corresponding to the text to be processed, wherein the text matrix corresponding to the text includes the vectorized representation corresponding to each word segmentation in the text;
  • the first processing module is configured to input the target text matrix into a text information extraction model to obtain an information extraction result output by the text information extraction model, and the information extraction result includes the label of each word segment in the text to be processed Information, the tag information is used to indicate the word segmentation type, the word segmentation type includes an attribute class, and, if the word segmentation type of the word segmentation is the attribute class, the tag information of the word segmentation is also used to indicate the emotion type of the word segmentation;
  • the second determining module is configured to determine the attribute words in the to-be-processed text and each of the attributes according to the first participle of the attribute type in the to-be-processed text and the sentiment type of the first participle The emotional type of the word, wherein each of the attribute words is composed of at least one of the first participles.
  • the word segmentation type further includes sentiment
  • the device also includes:
  • the third determining module is configured to determine the emotional words in the to-be-processed text according to the second word-segmentation type of the emotional category in the to-be-processed text, wherein each of the emotional words is composed of at least one of the The second participle constitutes;
  • the fourth determining module is used to combine each attribute word with each emotion word to obtain attribute emotion word pairs, and each attribute emotion word pair includes one attribute word and one emotion word;
  • the second processing module is configured to input the position information of the attribute emotion word pair and the attribute emotion word pair into the association model to obtain the association result output by the association model, and the association result is used to indicate each of the attributes Whether the attribute word and the emotion word in the emotion word pair are related, and the position information of the attribute emotion word pair is used to indicate the position relationship between the attribute word and the emotion word in the attribute emotion word pair in the text to be processed ;
  • the fifth determining module is used to determine the target attribute emotional word pair related to the attribute word and the emotional word according to the association result.
  • association model is obtained in the following manner:
  • the first determining module includes:
  • the first determining sub-module is configured to perform word segmentation processing on the to-be-processed text, and determine the word vector and part-of-speech vector corresponding to each word segmentation in the to-be-processed text;
  • the second determining submodule is configured to determine the target text matrix according to the vectorized representation corresponding to each of the word segmentation, wherein the vectorized representation of each word segmentation in the to-be-processed text corresponds to the target text A row in the matrix.
  • the text information extraction model is obtained in the following manner:
  • the historical text matrix corresponding to the second historical text is used as input data, and the historical label information corresponding to each word segmentation in the second historical text is used as output data, and the deep neural network model is trained to obtain the text information extraction model .
  • the tag information of the word segmentation is also used to indicate whether the word segmentation is in the first position in the preset word;
  • the second determining module includes:
  • the attribute word determination sub-module is used to determine the attribute words in the to-be-processed text according to the tag information of each of the first word segmentation and its position in the to-be-processed text, wherein, in the preset The first participle in the first position in the word is in the first position in the attribute word in which it is located.
  • the second determining module includes:
  • the emotion type determination sub-module is used to determine the emotion type corresponding to the first part of the attribute word as the emotion type corresponding to the attribute word.
  • a storage medium having a program stored thereon, and when the program is executed by a processor, the steps of the method described in the first aspect of the present disclosure are implemented.
  • a device including:
  • At least one processor and at least one memory and bus connected to the processor;
  • processor and the memory complete mutual communication through the bus
  • the processor is configured to call program instructions in the memory to execute the steps of the method described in the first aspect of the present disclosure.
  • a computer program product which when executed on a data processing device, is adapted to perform the steps of the method described in the first aspect of the present disclosure.
  • the target text matrix corresponding to the text to be processed is determined, and the target text matrix is input to the text information extraction model to obtain the information extraction result output by the text information extraction model.
  • the information extraction result includes each word segmentation in the text to be processed Label information.
  • the word segmentation belonging to the attribute category in the text to be processed and the emotional type of these word segmentation can be obtained at the same time, thereby obtaining the attribute words in the text to be processed and the emotional type of each attribute word, which is highly efficient And can guarantee the accuracy rate.
  • Fig. 1 is a flowchart of a method for extracting text information according to an embodiment of the present disclosure
  • FIGS. 2A and 2B are exemplary schematic diagrams of tag information in the text information extraction method provided according to the present disclosure
  • FIG. 3 is a flowchart of a method for extracting text information according to another embodiment of the present disclosure
  • Fig. 4 is a block diagram of a text information extraction device provided according to an embodiment of the present disclosure.
  • Fig. 5 is a block diagram of a device provided according to an embodiment of the present disclosure.
  • Fig. 1 is a flowchart of a method for extracting text information according to an embodiment of the present disclosure. As shown in Figure 1, the method may include the following steps.
  • step 11 the target text matrix corresponding to the text to be processed is determined.
  • the text matrix corresponding to the text includes the vectorized representation corresponding to each word segmentation in the text.
  • first perform word segmentation processing on the text and then obtain the vectorized representation corresponding to each word segmentation of the text, where the vectorized representation corresponding to the word segmentation can reflect the characteristics of the word segmentation itself and the part-of-speech characteristics of the word segmentation.
  • the target text matrix corresponding to the text to be processed includes the vectorized representation corresponding to each word segmentation in the text to be processed.
  • step 11 may include the following steps:
  • the target text matrix is determined.
  • the word vector maps the vocabulary to the vector space, and the similarity relationship between the word vectors can reflect the similarity relationship between the words.
  • the part-of-speech vector can reflect the part-of-speech characteristics of the vocabulary, that is, the part-of-speech vector can be used to determine the part of speech of the vocabulary.
  • the part-of-speech vector can be represented by a random vector of a certain dimension. For example, if there are 30 parts of speech A1 ⁇ A30, they can be represented by the vectors a1 ⁇ a30 in turn ,
  • the dimensions of a1 to a30 are specified fixed values (for example, 20), and each dimension can be a randomly generated decimal close to 0.
  • word segmentation is performed in advance for the text in the relevant corpus (for example, the corpus of the text to be processed), and the word vector model (for example, Word2vec, Glove, ELMo, etc.) is used for word vector training to obtain each vocabulary The corresponding word vector.
  • the vocabulary can be mapped to a 100-dimensional vector space, that is, the word vector corresponding to the vocabulary is a 100-dimensional vector.
  • the word vector corresponding to each word segmentation in the text to be processed is determined, and the part-of-speech vector is determined according to the word segmentation result.
  • the word vector and the part-of-speech vector are spliced together to obtain the vectorized representation corresponding to each word segmentation.
  • the word segmentation corresponds to the vectorization It can be a 120-dimensional vector [B1, B2, B3,..., B100, C1, C2, C3,..., C20].
  • the target text matrix is determined, where the vectorized representation of each word segment in the text to be processed corresponds to a row in the target text matrix.
  • the vectorization of word segmentation means that the order of appearance in the target text matrix is consistent with the order of appearance in the text to be processed. For example, if the order of occurrence of word segmentation in the text to be processed is word segmentation 1, word segmentation 2, word segmentation 3, then in the target text matrix, word segmentation 1, word segmentation 2, and word segmentation 3 correspond to the kth row and kth row in the target text matrix. In line +1, line k+2, k is a positive integer, such as 1.
  • the target text matrix may be formed by direct combination of the vectorized representations corresponding to each word segmentation in the text to be processed. For example, if the text to be processed has a total of 200 word segments, and the vectorized representation corresponding to each word segmentation is a 120-dimensional vector, the target text matrix is a 200*120 matrix.
  • the vectorized representation corresponding to each word segmentation in the text to be processed is combined to obtain a matrix, it can also be appropriately expanded on the basis of the obtained matrix (for example, horizontal expansion, and/or vertical expansion). Expand) to form a target text matrix, where the expanded part can be processed with zero padding. For example, if the text to be processed has a total of 200 word segments, and the vectorization corresponding to each word segmentation is represented as a 120-dimensional vector, a 200*120 matrix is obtained after the combination, and it is expanded to a 200*200 text matrix as the target text matrix. In this way, even if the text lengths are different, the format of the obtained target text matrix is the same, which can ensure that the form of the target text matrix is consistent and facilitate subsequent data processing.
  • the word segmentation feature and the part-of-speech feature of the word segmentation are extracted to obtain the vectorized expression of each word segmentation and form a matrix, which can provide effective data support for subsequent data processing.
  • step 12 the target text matrix is input to the text information extraction model to obtain the information extraction result output by the text information extraction model.
  • the information extraction result includes tag information of each word segmentation in the text to be processed.
  • Tag information can be used to indicate the word segmentation type.
  • the word segmentation type can include attribute type, emotion type, and other types except attribute type and emotion type.
  • the word segmentation belonging to attribute type is used to describe function or performance and belongs to emotion type.
  • the word segmentation of is used to describe the emotion of the word segmentation belonging to the attribute class.
  • the tag information of the word segmentation may also indicate the emotion type of the word segmentation.
  • the label information can reflect the word segmentation type and the emotion type of the word segmentation at the same time.
  • the word segmentation type and the emotion type can be distinguished by identification information such as keywords.
  • emotion types can be divided into three types: positive, neutral, and negative.
  • the tag information of the word segmentation can also be used to indicate whether the word segmentation is in the first position in the preset word.
  • the presupposition word is a word or phrase belonging to the attribute category or emotion category. For example, if the preset word is "engine power" in the attribute category, the tag information of "engine” indicates that it is in the first place in the preset word, and the tag information of "power” indicates that it is not in the preset word. In the first place.
  • the tag information of "very” indicates that it is in the first place in the preset word
  • the tag information of "ugly” indicates that it is not in the preset word. In the first place.
  • Figure 2A is an example of label information for each word segmentation in the text, where the attribute class corresponds to Attr, the sentiment class corresponds to Opin, the other classes correspond to O, the first position in the preset word corresponds to B, and the pre-set word corresponds to B.
  • the first position in the word corresponds to I
  • the positive emotion corresponds to Pos
  • the neutral emotion corresponds to Neu
  • the negative emotion corresponds to Neg.
  • the tag information of the word segmentation "engine” is B_Attr_Pos
  • "engine” belongs to the attribute category
  • the emotion type is positive.
  • the label information of each word segmentation in the text may also be as shown in FIG.
  • label information corresponding to each word has the same meaning as represented in FIG. 2A, and only the form of the label information is different.
  • FIG. 2A and FIG. 2B are only examples.
  • the label information in this method is not limited to the above-mentioned form, and can be distinguished. Other possible examples will not be repeated here.
  • the tag information of the word segmentation it can be determined what kind of word segmentation the word segment is, for example, it belongs to the attribute type or emotion type or other types, and if it belongs to the attribute type, what is its emotion type.
  • the text information extraction model can be obtained in the following manner:
  • the deep neural network model is trained to obtain a text information extraction model.
  • the second historical text can be taken from a corpus related to the text to be processed.
  • the method of obtaining the historical text matrix corresponding to the second historical text has the same principle as the method of obtaining the target text matrix, which has been described in the foregoing and will not be repeated here.
  • the historical label information corresponding to each word segmentation in the second historical text can be manually labeled. The label information is also described in the previous section, and the description will not be repeated here.
  • the historical text matrix corresponding to the second historical text is used as input data
  • the historical label information corresponding to each word segmentation in the second historical text is used as output data
  • the deep neural network model is trained to obtain a text information extraction model.
  • the deep neural network model is trained based on learning frameworks such as tensorflow, mxnet, pytorch, etc., and one or more encoders (for example, LSTM, Transformer, BERT) are used for encoding, and the decoder ( For example, CRF) decodes the position of each word segmentation to extract the label information corresponding to each word segmentation position.
  • the method for training the deep neural network model belongs to the prior art and is well known to those skilled in the art, and will not be repeated here.
  • model training is performed based on the existing data to obtain the text information extraction model.
  • the corresponding data is directly input into the text information extraction model to obtain the information extraction results output by the text information extraction model.
  • the application is simple and convenient.
  • step 13 the attribute words in the text to be processed and the emotion types of each attribute word are determined according to the first word segmentation whose word segmentation type is the attribute type in the text to be processed and its sentiment type.
  • each attribute word is composed of at least one first participle. If the attribute word consists of a first participle, the attribute word is the first participle. If the attribute word is composed of more than one first participle, the attribute word is a compound word formed by these first participles, and when the attribute word is composed of more than one first participle, these first participles that constitute the attribute word are to be processed The position in the text is continuous.
  • determining the attribute words in the text to be processed in step 13 may include the following steps:
  • the attribute words in the text to be processed are determined.
  • the first participle in the first place among the presupposition words is still in the first place among the attribute words in which it is located.
  • the first participle (hereinafter referred to as the "starting word") can be used as the first participle in the attribute word, and the attribute word can be determined continuously The remainder of
  • the start word is determined as an attribute word.
  • the tag information corresponding to a piece of text ⁇ e1, e2, e3 ⁇ is ⁇ O, Begin1, O ⁇ in order, where O indicates that the word segmentation type is other types, and Begin1 indicates that the word segmentation type is an attribute type and is in the preset word In the first place, it can be seen that the start word is e2, and the participle e3 on the right side of the start word e2 does not belong to the attribute category, and the participle e2 can be directly determined as an attribute word.
  • the participle to the right of the start word (hereinafter referred to as "right neighbor”) in the text to be processed belongs to the attribute category, and the label information of the right neighbor indicates that the participle is not in the preset word First, use the right-neighbor word as the starting point to search for consecutive participles that belong to the attribute class and are not in the first place in the presupposition word, and use the search result as the remaining part of the attribute word.
  • the tag information corresponding to a piece of text ⁇ e4, e5, e6, e7, e8 ⁇ is ⁇ O, Begin1, Inside1, Inside1, O ⁇ , where O indicates that the word segmentation type is other types, and Begin1 indicates that the word segmentation type is Attribute category and is in the first place in the presupposition word. Inside1 indicates that the word segmentation type is attribute category and is not in the first place in the presupposition word. It can be seen that the start word is e5, and the right neighbor e6 of the start word e5 belongs to the attribute category and is in the presupposition.
  • the presupposition word is not in the first position, so you can use e6 as the starting point to find a continuous participle that belongs to the attribute class and is not in the first position in the presupposition word.
  • the search result is e6e7, and the remaining part of the attribute word can be determined as e6e7, and finally determined
  • the attribute word is e5e6e7.
  • each attribute word in the text to be processed can be quickly determined, and the efficiency is high.
  • determining the sentiment type of the attribute word in step 13 may include the following steps:
  • the emotion type corresponding to the first first participle constituting the attribute word is determined as the emotion type corresponding to the attribute word.
  • the emotion type corresponding to all the first participles constituting an attribute word is the same. Therefore, the emotion type corresponding to one of the first participles can be directly determined as the emotion type of the attribute word, for example, the first participle One participle. In case of different situations, the above method can also be used to directly determine the emotion type corresponding to the first first participle as the emotion type corresponding to the attribute word.
  • the target text matrix corresponding to the text to be processed is determined, and the target text matrix is input to the text information extraction model to obtain the information extraction result output by the text information extraction model.
  • the information extraction result includes the word segmentation of the text to be processed Label Information.
  • the word segmentation belonging to the attribute category in the text to be processed and the emotional type of these word segmentation can be obtained at the same time, thereby obtaining the attribute words in the text to be processed and the emotional type of each attribute word, which is highly efficient And can guarantee the accuracy rate.
  • Fig. 3 is a flowchart of a method for extracting text information according to another embodiment of the present disclosure. As shown in FIG. 3, based on the steps shown in FIG. 1, the method provided by the present disclosure may further include the following steps.
  • step 31 the emotional word in the text to be processed is determined according to the second word segmentation of the sentiment type in the text to be processed.
  • each emotional word is composed of at least one second participle. If the emotional word consists of a second participle, the emotional word is the second participle. If the emotional word is composed of more than one second participle, the emotional word is a compound word formed by these second participles, and when the emotional word is composed of more than one second participle, the second participles that constitute the emotional word are waiting to be processed The position in the text is continuous.
  • the emotional word in the text to be processed is determined according to the label information of each second word segmentation and its position in the text to be processed.
  • the method of determining the emotional word is the same as the principle of determining the attribute word, which has been described above and will not be repeated here.
  • each attribute word is combined with each emotion word to obtain an attribute emotion word pair.
  • Each attribute affect word pair contains an attribute word and an affect word.
  • 8 attribute emotion word pairs can be obtained after the combination, namely: m1-n1, m1 -n2, m2-n1, m2-n2, m3-n1, m3-n2, m4-n1, m4-n2.
  • step 33 the position information of the attribute emotion word pair and the attribute emotion word pair is input to the association model to obtain the association result output by the association model.
  • the position information of the attribute emotion word pair is used to indicate the position relationship between the attribute word in the attribute emotion word pair and the emotion word in the text to be processed. For example, the position of the attribute word and the emotion word in the text to be processed, the distance between the attribute word and the emotion word in the text to be processed, whether the attribute word and the emotion word are in the same sentence, and so on.
  • the association result is used to indicate whether the attribute words and emotion words in each attribute emotion word pair are related.
  • the correlation between attribute words and emotional words means that the object described by the emotional word is the attribute word. Whether attribute words and emotional words are related can be reflected by their location. For example, related attribute words and emotional words are generally located in the same sentence or in similar locations.
  • association model can be obtained in the following manner:
  • the position information of the historical attribute emotional word pair and the historical attribute emotional word pair corresponding to the first historical text is used as input data, and the historical association result of each historical attribute emotional word pair in the first historical text is used as the output data.
  • the model is trained to obtain an associated model.
  • the first historical text can be taken from a corpus related to the text to be processed.
  • the first historical text and the second historical text in the preceding text may be the same.
  • the method of obtaining the historical attribute emotional word pair corresponding to the first historical text is the same as the method of obtaining the attribute emotional word pair in step 32 (and related steps of how to obtain attribute words, emotional words, etc.), which has been described in the foregoing. Do not repeat it here.
  • the historical association results of each historical attribute emotional word pair in the first historical text can be manually labeled, that is, whether each historical attribute emotional word pair is related or not.
  • the position information of the historical attribute emotional word pair and the historical attribute emotional word pair corresponding to the first historical text is used as input data, and the historical association result of each historical attribute emotional word pair in the first historical text is used as output data.
  • the deep neural network model is trained to obtain the correlation model. For example, during model training, the deep neural network model is trained based on learning methods such as RandomForest, LSTM-attention, and Recurrent Attention.
  • model training is performed based on the existing data to obtain the association model.
  • the corresponding data is directly input into the association model to obtain the association result output by the association model, and the application is simple and convenient.
  • step 34 the target attribute emotional word pair is determined according to the association result.
  • the target attribute emotional word pair refers to the attribute emotional word pair related to the attribute word and the emotional word in the text to be processed.
  • the target emotional word pair related to the attribute word and the emotional word can be selected from it for the user to view or use.
  • the emotional words related to each attribute word can also be extracted from the text to be processed.
  • the information extraction function is more complete and it is convenient for users to view the data. And use.
  • Fig. 4 is a block diagram of a text information extraction device provided according to an embodiment of the present disclosure. As shown in Fig. 4, the device 40 includes:
  • the first determining module 41 is configured to determine a target text matrix corresponding to the text to be processed, where the text matrix corresponding to the text includes a vectorized representation corresponding to each word segmentation in the text;
  • the first processing module 42 is configured to input the target text matrix into a text information extraction model to obtain an information extraction result output by the text information extraction model, and the information extraction result includes the information of each word segmentation in the text to be processed Tag information, the tag information is used to indicate the word segmentation type, the word segmentation type includes an attribute class, and if the word segmentation type of the word segmentation is the attribute class, the tag information of the word segmentation is also used to indicate the emotion type of the word segmentation;
  • the second determining module 43 is configured to determine the attribute words in the to-be-processed text and each of the attribute words in the to-be-processed text according to the first participle of the attribute type in the text to be processed and the emotion type of the first participle.
  • the word segmentation type further includes sentiment
  • the device 40 also includes:
  • the third determining module is configured to determine the emotional words in the to-be-processed text according to the second word-segmentation type of the emotional category in the to-be-processed text, wherein each of the emotional words is composed of at least one of the The second participle constitutes;
  • the fourth determining module is used to combine each attribute word with each emotion word to obtain attribute emotion word pairs, and each attribute emotion word pair includes one attribute word and one emotion word;
  • the second processing module is configured to input the position information of the attribute emotion word pair and the attribute emotion word pair into the association model to obtain the association result output by the association model, and the association result is used to indicate each of the attributes Whether the attribute word and the emotion word in the emotion word pair are related, and the position information of the attribute emotion word pair is used to indicate the position relationship between the attribute word and the emotion word in the attribute emotion word pair in the text to be processed ;
  • the fifth determining module is used to determine the target attribute emotional word pair related to the attribute word and the emotional word according to the association result.
  • association model is obtained in the following manner:
  • the first determining module 41 includes:
  • the first determining sub-module is configured to perform word segmentation processing on the to-be-processed text, and determine the word vector and part-of-speech vector corresponding to each word segmentation in the to-be-processed text;
  • the second determining submodule is configured to determine the target text matrix according to the vectorized representation corresponding to each of the word segmentation, wherein the vectorized representation of each word segmentation in the to-be-processed text corresponds to the target text A row in the matrix.
  • the text information extraction model is obtained in the following manner:
  • the historical text matrix corresponding to the second historical text is used as input data, and the historical label information corresponding to each word segmentation in the second historical text is used as output data, and the deep neural network model is trained to obtain the text information extraction model .
  • the tag information of the word segmentation is also used to indicate whether the word segmentation is in the first position in the preset word;
  • the second determining module 43 includes:
  • the attribute word determination sub-module is used to determine the attribute words in the to-be-processed text according to the tag information of each of the first word segmentation and its position in the to-be-processed text, wherein, in the preset The first participle in the first position in the word is in the first position in the attribute word in which it is located.
  • the second determining module 43 includes:
  • the emotion type determination sub-module is used to determine the emotion type corresponding to the first part of the attribute word as the emotion type corresponding to the attribute word.
  • the text information extraction device includes a processor and a memory, and the above-mentioned first determination module, first processing module, second determination module, third determination module, fourth determination module, second processing module, fifth determination module, etc., all serve as The program unit is stored in the memory, and the above-mentioned program unit stored in the memory is executed by the processor to realize the corresponding function.
  • the processor contains the kernel, and the kernel calls the corresponding program unit from the memory.
  • the kernel can be set to one or more, and the text information can be extracted more quickly and accurately by adjusting the kernel parameters, and the attribute words and their emotion can be obtained quickly.
  • the embodiment of the present invention provides a storage medium on which a program is stored, and when the program is executed by a processor, the text information extraction method is implemented.
  • the embodiment of the present invention provides a processor configured to run a program, wherein the method for extracting text information is executed when the program is running.
  • the device 70 includes at least one processor 701, and at least one memory 702 and a bus 703 connected to the processor 701; wherein the processor 701 and the memory 702 pass through the bus 703 completes mutual communication; the processor 701 is configured to call program instructions in the memory 702 to execute the above-mentioned text information extraction method.
  • the device in this article can be a server, a PC, etc.
  • This application also provides a computer program product, which when executed on a data processing device, is suitable for executing a program that initializes the following method steps:
  • the target text matrix is input to a text information extraction model to obtain an information extraction result output by the text information extraction model.
  • the information extraction result includes the label information of each word segment in the text to be processed, and the label information is used
  • the word segmentation type includes an attribute type, and if the word segmentation type of the word segmentation is the attribute type, the tag information of the word segmentation is also used to indicate the emotion type of the word segmentation;
  • the word segmentation type further includes sentiment
  • the method also includes:
  • each attribute word with each emotion word respectively to obtain attribute emotion word pairs, and each attribute emotion word pair includes one attribute word and one emotion word;
  • a target attribute emotional word pair related to the attribute word and the emotional word is determined.
  • association model is obtained in the following manner:
  • the determining the target text matrix corresponding to the text to be processed includes:
  • the target text matrix is determined according to the vectorized representation corresponding to each word segmentation, wherein the vectorized representation of each word segmentation in the to-be-processed text corresponds to a row in the target text matrix.
  • the text information extraction model is obtained in the following manner:
  • the historical text matrix corresponding to the second historical text is used as input data, and the historical label information corresponding to each word segmentation in the second historical text is used as output data, and the deep neural network model is trained to obtain the text information extraction model .
  • the tag information of the word segmentation is also used to indicate whether the word segmentation is in the first position in the preset word;
  • Determining the attribute words in the to-be-processed text includes:
  • the label information of each of the first word segmentation and its position in the to-be-processed text determine the attribute word in the to-be-processed text, where the first word in the preset word A participle is in the first place among the attribute words in which it is located.
  • determining the sentiment type of the attribute word includes:
  • the sentiment type corresponding to the first said first participle constituting the attribute word is determined as the sentiment type corresponding to the attribute word.
  • the device includes one or more processors (CPUs), memory, and buses.
  • the device may also include input/output interfaces, network interfaces, and so on.
  • the memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), and the memory includes at least one Memory chip.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • the memory is an example of a computer-readable medium.
  • Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
  • this application can be provided as a method, a system, or a computer program product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

Procédé et appareil d'extraction d'informations de texte, support d'enregistrement et dispositif. Le procédé consiste à : déterminer une matrice de texte cible correspondant à un texte à traiter (11) ; entrer la matrice de texte cible dans un modèle d'extraction d'informations de texte afin d'obtenir un résultat d'extraction d'informations délivré par le modèle d'extraction d'informations de texte (12), le résultat d'extraction d'informations comprenant des informations d'étiquette de chaque segment de mot dans un texte à traiter, les informations d'étiquette sont utilisées pour indiquer le type de segment de mot, si le type de segment de mot est de type d'attribut, et les informations d'étiquette du segment de mot sont également utilisées pour indiquer le type d'émotion du segment de mot ; et déterminer, selon un premier segment de mot dans le texte à traiter et le type d'émotion du premier segment de mot, des mots d'attribut dans le texte à traiter et du type d'émotion de chaque mot d'attribut (13). De cette manière, au moyen du modèle d'entraînement, le segment de mot appartenant à la classe d'attribut dans le texte à traiter et le type d'émotion desdits segments de mot peuvent être obtenus en même temps, de telle sorte que les mots d'attribut dans le texte à traiter et le type d'émotion de chaque mot d'attribut peuvent être obtenus, et ainsi l'efficacité est élevée et la précision est assurée.
PCT/CN2020/100483 2019-09-30 2020-07-06 Procédé et appareil d'extraction d'informations de texte, support d'enregistrement et dispositif WO2021063060A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910943335.XA CN112580358A (zh) 2019-09-30 2019-09-30 文本信息提取方法、装置、存储介质及设备
CN201910943335.X 2019-09-30

Publications (1)

Publication Number Publication Date
WO2021063060A1 true WO2021063060A1 (fr) 2021-04-08

Family

ID=75116511

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/100483 WO2021063060A1 (fr) 2019-09-30 2020-07-06 Procédé et appareil d'extraction d'informations de texte, support d'enregistrement et dispositif

Country Status (2)

Country Link
CN (1) CN112580358A (fr)
WO (1) WO2021063060A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114239590B (zh) * 2021-12-01 2023-09-19 马上消费金融股份有限公司 一种数据处理方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140278365A1 (en) * 2013-03-12 2014-09-18 Guangsheng Zhang System and methods for determining sentiment based on context
CN107832305A (zh) * 2017-11-28 2018-03-23 百度在线网络技术(北京)有限公司 用于生成信息的方法和装置
CN109299457A (zh) * 2018-09-06 2019-02-01 北京奇艺世纪科技有限公司 一种观点挖掘方法、装置及设备
CN109829033A (zh) * 2017-11-23 2019-05-31 阿里巴巴集团控股有限公司 数据展示方法和终端设备
CN109885826A (zh) * 2019-01-07 2019-06-14 平安科技(深圳)有限公司 文本词向量获取方法、装置、计算机设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140278365A1 (en) * 2013-03-12 2014-09-18 Guangsheng Zhang System and methods for determining sentiment based on context
CN109829033A (zh) * 2017-11-23 2019-05-31 阿里巴巴集团控股有限公司 数据展示方法和终端设备
CN107832305A (zh) * 2017-11-28 2018-03-23 百度在线网络技术(北京)有限公司 用于生成信息的方法和装置
CN109299457A (zh) * 2018-09-06 2019-02-01 北京奇艺世纪科技有限公司 一种观点挖掘方法、装置及设备
CN109885826A (zh) * 2019-01-07 2019-06-14 平安科技(深圳)有限公司 文本词向量获取方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN112580358A (zh) 2021-03-30

Similar Documents

Publication Publication Date Title
CN106776936B (zh) 智能交互方法和系统
CN105718586B (zh) 分词的方法及装置
WO2020140487A1 (fr) Procédé de reconnaissance vocale pour l'interaction homme-machine d'un appareil intelligent et système
CN108763510B (zh) 意图识别方法、装置、设备及存储介质
WO2019085697A1 (fr) Procédé et système d'interactions homme-machine
WO2018086519A1 (fr) Procédé et dispositif d'identification d'informations textuelles spécifiques
WO2020186712A1 (fr) Procédé et appareil de reconnaissance vocale, et terminal
CN111295661A (zh) 词义消歧方法和设备、词义扩展方法、装置和设备、计算机可读存储介质
CN110990555B (zh) 端到端检索式对话方法与系统及计算机设备
CN111291177A (zh) 一种信息处理方法、装置和计算机存储介质
JP2015529901A (ja) 製品認識に基づく情報分類
JP2023022845A (ja) ビデオ処理方法、ビデオサーチ方法及びモデルトレーニング方法、装置、電子機器、記憶媒体及びコンピュータプログラム
CN110991161B (zh) 相似文本确定方法、神经网络模型获得方法及相关装置
CN108875743B (zh) 一种文本识别方法及装置
CN111078842A (zh) 查询结果的确定方法、装置、服务器及存储介质
CN116304748B (zh) 一种文本相似度计算方法、系统、设备及介质
JP2019082931A (ja) 検索装置、類似度算出方法、およびプログラム
CN113449084A (zh) 基于图卷积的关系抽取方法
CN113901289A (zh) 一种基于无监督学习的推荐方法及系统
WO2021063060A1 (fr) Procédé et appareil d'extraction d'informations de texte, support d'enregistrement et dispositif
CN116522905B (zh) 文本纠错方法、装置、设备、可读存储介质及程序产品
WO2020119346A1 (fr) Procédé et appareil de compréhension sémantique naturelle et dispositif informatique
CN111783475A (zh) 一种基于短语关系传播的语义视觉定位方法及装置
CN116186219A (zh) 一种人机对话交互方法方法、系统及存储介质
CN115859121A (zh) 文本处理模型训练方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20872454

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20872454

Country of ref document: EP

Kind code of ref document: A1