WO2021063060A1 - Text information extraction method and apparatus, storage medium and device - Google Patents

Text information extraction method and apparatus, storage medium and device Download PDF

Info

Publication number
WO2021063060A1
WO2021063060A1 PCT/CN2020/100483 CN2020100483W WO2021063060A1 WO 2021063060 A1 WO2021063060 A1 WO 2021063060A1 CN 2020100483 W CN2020100483 W CN 2020100483W WO 2021063060 A1 WO2021063060 A1 WO 2021063060A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
text
attribute
emotion
processed
Prior art date
Application number
PCT/CN2020/100483
Other languages
French (fr)
Chinese (zh)
Inventor
戴泽辉
Original Assignee
北京国双科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京国双科技有限公司 filed Critical 北京国双科技有限公司
Publication of WO2021063060A1 publication Critical patent/WO2021063060A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to a method, device, storage medium, and equipment for extracting text information.
  • a text for example, a user review
  • the words (or phrases) belonging to the attribute category can be Attribute vocabulary (or phrase) is used to describe emotions.
  • attribute vocabulary (or phrase) is used to describe the function or performance of the subject.
  • the emotion type of emotion vocabulary generally includes three types: positive, neutral, and negative. . For example, if the text is "The appearance of A car is ugly", it can be known that the subject of the text is "A car", the vocabulary belonging to the attribute category is "appearance”, and the vocabulary belonging to the emotional category is "ugly”.
  • attribute sentiment can be obtained from it, that is, specific attributes and sentiment types can be extracted according to the content of the text.
  • attribute sentiment is generally obtained through a two-step method using a pipeline structure, that is, firstly extract the vocabulary (or phrase) belonging to the attribute category in the text through the method of sequence labeling (for example, LSTM-CRF, BERT-CRF, etc.), and then, For each attribute vocabulary (or phrase), use the attribute vocabulary (or phrase) and its sentence as the model training data, and use deep learning (for example, LSTM-attention, BERT-CLS, ATAE, Recurrent Attention, Transformation Network Etc.) method for training to obtain a model for predicting a single attribute vocabulary.
  • sequence labeling for example, LSTM-CRF, BERT-CRF, etc.
  • deep learning for example, LSTM-attention, BERT-CLS, ATAE, Recurrent Attention, Transformation Network Etc.
  • the purpose of the present disclosure is to provide a text information extraction method, device, storage medium, and equipment, which can achieve text information extraction more quickly and accurately, and quickly obtain attribute words and their emotions.
  • a method for extracting text information including:
  • the target text matrix is input to a text information extraction model to obtain an information extraction result output by the text information extraction model.
  • the information extraction result includes the label information of each word segment in the text to be processed, and the label information is used
  • the word segmentation type includes an attribute type, and if the word segmentation type of the word segmentation is the attribute type, the tag information of the word segmentation is also used to indicate the emotion type of the word segmentation;
  • Each of the attribute words is composed of at least one of the first participles.
  • the word segmentation type further includes sentiment
  • the method also includes:
  • each attribute word with each emotion word respectively to obtain attribute emotion word pairs, and each attribute emotion word pair includes one attribute word and one emotion word;
  • a target attribute emotional word pair related to the attribute word and the emotional word is determined.
  • association model is obtained in the following manner:
  • the determining the target text matrix corresponding to the text to be processed includes:
  • the target text matrix is determined according to the vectorized representation corresponding to each word segmentation, wherein the vectorized representation of each word segmentation in the to-be-processed text corresponds to a row in the target text matrix.
  • the text information extraction model is obtained in the following manner:
  • the historical text matrix corresponding to the second historical text is used as input data, and the historical label information corresponding to each word segmentation in the second historical text is used as output data, and the deep neural network model is trained to obtain the text information extraction model .
  • the tag information of the word segmentation is also used to indicate whether the word segmentation is in the first position in the preset word;
  • Determining the attribute words in the to-be-processed text includes:
  • the label information of each of the first word segmentation and its position in the to-be-processed text determine the attribute word in the to-be-processed text, where the first word in the preset word A participle is in the first place among the attribute words in which it is located.
  • determining the sentiment type of the attribute word includes:
  • the sentiment type corresponding to the first said first participle constituting the attribute word is determined as the sentiment type corresponding to the attribute word.
  • a text information extraction device including:
  • the first determining module is used to determine the target text matrix corresponding to the text to be processed, wherein the text matrix corresponding to the text includes the vectorized representation corresponding to each word segmentation in the text;
  • the first processing module is configured to input the target text matrix into a text information extraction model to obtain an information extraction result output by the text information extraction model, and the information extraction result includes the label of each word segment in the text to be processed Information, the tag information is used to indicate the word segmentation type, the word segmentation type includes an attribute class, and, if the word segmentation type of the word segmentation is the attribute class, the tag information of the word segmentation is also used to indicate the emotion type of the word segmentation;
  • the second determining module is configured to determine the attribute words in the to-be-processed text and each of the attributes according to the first participle of the attribute type in the to-be-processed text and the sentiment type of the first participle The emotional type of the word, wherein each of the attribute words is composed of at least one of the first participles.
  • the word segmentation type further includes sentiment
  • the device also includes:
  • the third determining module is configured to determine the emotional words in the to-be-processed text according to the second word-segmentation type of the emotional category in the to-be-processed text, wherein each of the emotional words is composed of at least one of the The second participle constitutes;
  • the fourth determining module is used to combine each attribute word with each emotion word to obtain attribute emotion word pairs, and each attribute emotion word pair includes one attribute word and one emotion word;
  • the second processing module is configured to input the position information of the attribute emotion word pair and the attribute emotion word pair into the association model to obtain the association result output by the association model, and the association result is used to indicate each of the attributes Whether the attribute word and the emotion word in the emotion word pair are related, and the position information of the attribute emotion word pair is used to indicate the position relationship between the attribute word and the emotion word in the attribute emotion word pair in the text to be processed ;
  • the fifth determining module is used to determine the target attribute emotional word pair related to the attribute word and the emotional word according to the association result.
  • association model is obtained in the following manner:
  • the first determining module includes:
  • the first determining sub-module is configured to perform word segmentation processing on the to-be-processed text, and determine the word vector and part-of-speech vector corresponding to each word segmentation in the to-be-processed text;
  • the second determining submodule is configured to determine the target text matrix according to the vectorized representation corresponding to each of the word segmentation, wherein the vectorized representation of each word segmentation in the to-be-processed text corresponds to the target text A row in the matrix.
  • the text information extraction model is obtained in the following manner:
  • the historical text matrix corresponding to the second historical text is used as input data, and the historical label information corresponding to each word segmentation in the second historical text is used as output data, and the deep neural network model is trained to obtain the text information extraction model .
  • the tag information of the word segmentation is also used to indicate whether the word segmentation is in the first position in the preset word;
  • the second determining module includes:
  • the attribute word determination sub-module is used to determine the attribute words in the to-be-processed text according to the tag information of each of the first word segmentation and its position in the to-be-processed text, wherein, in the preset The first participle in the first position in the word is in the first position in the attribute word in which it is located.
  • the second determining module includes:
  • the emotion type determination sub-module is used to determine the emotion type corresponding to the first part of the attribute word as the emotion type corresponding to the attribute word.
  • a storage medium having a program stored thereon, and when the program is executed by a processor, the steps of the method described in the first aspect of the present disclosure are implemented.
  • a device including:
  • At least one processor and at least one memory and bus connected to the processor;
  • processor and the memory complete mutual communication through the bus
  • the processor is configured to call program instructions in the memory to execute the steps of the method described in the first aspect of the present disclosure.
  • a computer program product which when executed on a data processing device, is adapted to perform the steps of the method described in the first aspect of the present disclosure.
  • the target text matrix corresponding to the text to be processed is determined, and the target text matrix is input to the text information extraction model to obtain the information extraction result output by the text information extraction model.
  • the information extraction result includes each word segmentation in the text to be processed Label information.
  • the word segmentation belonging to the attribute category in the text to be processed and the emotional type of these word segmentation can be obtained at the same time, thereby obtaining the attribute words in the text to be processed and the emotional type of each attribute word, which is highly efficient And can guarantee the accuracy rate.
  • Fig. 1 is a flowchart of a method for extracting text information according to an embodiment of the present disclosure
  • FIGS. 2A and 2B are exemplary schematic diagrams of tag information in the text information extraction method provided according to the present disclosure
  • FIG. 3 is a flowchart of a method for extracting text information according to another embodiment of the present disclosure
  • Fig. 4 is a block diagram of a text information extraction device provided according to an embodiment of the present disclosure.
  • Fig. 5 is a block diagram of a device provided according to an embodiment of the present disclosure.
  • Fig. 1 is a flowchart of a method for extracting text information according to an embodiment of the present disclosure. As shown in Figure 1, the method may include the following steps.
  • step 11 the target text matrix corresponding to the text to be processed is determined.
  • the text matrix corresponding to the text includes the vectorized representation corresponding to each word segmentation in the text.
  • first perform word segmentation processing on the text and then obtain the vectorized representation corresponding to each word segmentation of the text, where the vectorized representation corresponding to the word segmentation can reflect the characteristics of the word segmentation itself and the part-of-speech characteristics of the word segmentation.
  • the target text matrix corresponding to the text to be processed includes the vectorized representation corresponding to each word segmentation in the text to be processed.
  • step 11 may include the following steps:
  • the target text matrix is determined.
  • the word vector maps the vocabulary to the vector space, and the similarity relationship between the word vectors can reflect the similarity relationship between the words.
  • the part-of-speech vector can reflect the part-of-speech characteristics of the vocabulary, that is, the part-of-speech vector can be used to determine the part of speech of the vocabulary.
  • the part-of-speech vector can be represented by a random vector of a certain dimension. For example, if there are 30 parts of speech A1 ⁇ A30, they can be represented by the vectors a1 ⁇ a30 in turn ,
  • the dimensions of a1 to a30 are specified fixed values (for example, 20), and each dimension can be a randomly generated decimal close to 0.
  • word segmentation is performed in advance for the text in the relevant corpus (for example, the corpus of the text to be processed), and the word vector model (for example, Word2vec, Glove, ELMo, etc.) is used for word vector training to obtain each vocabulary The corresponding word vector.
  • the vocabulary can be mapped to a 100-dimensional vector space, that is, the word vector corresponding to the vocabulary is a 100-dimensional vector.
  • the word vector corresponding to each word segmentation in the text to be processed is determined, and the part-of-speech vector is determined according to the word segmentation result.
  • the word vector and the part-of-speech vector are spliced together to obtain the vectorized representation corresponding to each word segmentation.
  • the word segmentation corresponds to the vectorization It can be a 120-dimensional vector [B1, B2, B3,..., B100, C1, C2, C3,..., C20].
  • the target text matrix is determined, where the vectorized representation of each word segment in the text to be processed corresponds to a row in the target text matrix.
  • the vectorization of word segmentation means that the order of appearance in the target text matrix is consistent with the order of appearance in the text to be processed. For example, if the order of occurrence of word segmentation in the text to be processed is word segmentation 1, word segmentation 2, word segmentation 3, then in the target text matrix, word segmentation 1, word segmentation 2, and word segmentation 3 correspond to the kth row and kth row in the target text matrix. In line +1, line k+2, k is a positive integer, such as 1.
  • the target text matrix may be formed by direct combination of the vectorized representations corresponding to each word segmentation in the text to be processed. For example, if the text to be processed has a total of 200 word segments, and the vectorized representation corresponding to each word segmentation is a 120-dimensional vector, the target text matrix is a 200*120 matrix.
  • the vectorized representation corresponding to each word segmentation in the text to be processed is combined to obtain a matrix, it can also be appropriately expanded on the basis of the obtained matrix (for example, horizontal expansion, and/or vertical expansion). Expand) to form a target text matrix, where the expanded part can be processed with zero padding. For example, if the text to be processed has a total of 200 word segments, and the vectorization corresponding to each word segmentation is represented as a 120-dimensional vector, a 200*120 matrix is obtained after the combination, and it is expanded to a 200*200 text matrix as the target text matrix. In this way, even if the text lengths are different, the format of the obtained target text matrix is the same, which can ensure that the form of the target text matrix is consistent and facilitate subsequent data processing.
  • the word segmentation feature and the part-of-speech feature of the word segmentation are extracted to obtain the vectorized expression of each word segmentation and form a matrix, which can provide effective data support for subsequent data processing.
  • step 12 the target text matrix is input to the text information extraction model to obtain the information extraction result output by the text information extraction model.
  • the information extraction result includes tag information of each word segmentation in the text to be processed.
  • Tag information can be used to indicate the word segmentation type.
  • the word segmentation type can include attribute type, emotion type, and other types except attribute type and emotion type.
  • the word segmentation belonging to attribute type is used to describe function or performance and belongs to emotion type.
  • the word segmentation of is used to describe the emotion of the word segmentation belonging to the attribute class.
  • the tag information of the word segmentation may also indicate the emotion type of the word segmentation.
  • the label information can reflect the word segmentation type and the emotion type of the word segmentation at the same time.
  • the word segmentation type and the emotion type can be distinguished by identification information such as keywords.
  • emotion types can be divided into three types: positive, neutral, and negative.
  • the tag information of the word segmentation can also be used to indicate whether the word segmentation is in the first position in the preset word.
  • the presupposition word is a word or phrase belonging to the attribute category or emotion category. For example, if the preset word is "engine power" in the attribute category, the tag information of "engine” indicates that it is in the first place in the preset word, and the tag information of "power” indicates that it is not in the preset word. In the first place.
  • the tag information of "very” indicates that it is in the first place in the preset word
  • the tag information of "ugly” indicates that it is not in the preset word. In the first place.
  • Figure 2A is an example of label information for each word segmentation in the text, where the attribute class corresponds to Attr, the sentiment class corresponds to Opin, the other classes correspond to O, the first position in the preset word corresponds to B, and the pre-set word corresponds to B.
  • the first position in the word corresponds to I
  • the positive emotion corresponds to Pos
  • the neutral emotion corresponds to Neu
  • the negative emotion corresponds to Neg.
  • the tag information of the word segmentation "engine” is B_Attr_Pos
  • "engine” belongs to the attribute category
  • the emotion type is positive.
  • the label information of each word segmentation in the text may also be as shown in FIG.
  • label information corresponding to each word has the same meaning as represented in FIG. 2A, and only the form of the label information is different.
  • FIG. 2A and FIG. 2B are only examples.
  • the label information in this method is not limited to the above-mentioned form, and can be distinguished. Other possible examples will not be repeated here.
  • the tag information of the word segmentation it can be determined what kind of word segmentation the word segment is, for example, it belongs to the attribute type or emotion type or other types, and if it belongs to the attribute type, what is its emotion type.
  • the text information extraction model can be obtained in the following manner:
  • the deep neural network model is trained to obtain a text information extraction model.
  • the second historical text can be taken from a corpus related to the text to be processed.
  • the method of obtaining the historical text matrix corresponding to the second historical text has the same principle as the method of obtaining the target text matrix, which has been described in the foregoing and will not be repeated here.
  • the historical label information corresponding to each word segmentation in the second historical text can be manually labeled. The label information is also described in the previous section, and the description will not be repeated here.
  • the historical text matrix corresponding to the second historical text is used as input data
  • the historical label information corresponding to each word segmentation in the second historical text is used as output data
  • the deep neural network model is trained to obtain a text information extraction model.
  • the deep neural network model is trained based on learning frameworks such as tensorflow, mxnet, pytorch, etc., and one or more encoders (for example, LSTM, Transformer, BERT) are used for encoding, and the decoder ( For example, CRF) decodes the position of each word segmentation to extract the label information corresponding to each word segmentation position.
  • the method for training the deep neural network model belongs to the prior art and is well known to those skilled in the art, and will not be repeated here.
  • model training is performed based on the existing data to obtain the text information extraction model.
  • the corresponding data is directly input into the text information extraction model to obtain the information extraction results output by the text information extraction model.
  • the application is simple and convenient.
  • step 13 the attribute words in the text to be processed and the emotion types of each attribute word are determined according to the first word segmentation whose word segmentation type is the attribute type in the text to be processed and its sentiment type.
  • each attribute word is composed of at least one first participle. If the attribute word consists of a first participle, the attribute word is the first participle. If the attribute word is composed of more than one first participle, the attribute word is a compound word formed by these first participles, and when the attribute word is composed of more than one first participle, these first participles that constitute the attribute word are to be processed The position in the text is continuous.
  • determining the attribute words in the text to be processed in step 13 may include the following steps:
  • the attribute words in the text to be processed are determined.
  • the first participle in the first place among the presupposition words is still in the first place among the attribute words in which it is located.
  • the first participle (hereinafter referred to as the "starting word") can be used as the first participle in the attribute word, and the attribute word can be determined continuously The remainder of
  • the start word is determined as an attribute word.
  • the tag information corresponding to a piece of text ⁇ e1, e2, e3 ⁇ is ⁇ O, Begin1, O ⁇ in order, where O indicates that the word segmentation type is other types, and Begin1 indicates that the word segmentation type is an attribute type and is in the preset word In the first place, it can be seen that the start word is e2, and the participle e3 on the right side of the start word e2 does not belong to the attribute category, and the participle e2 can be directly determined as an attribute word.
  • the participle to the right of the start word (hereinafter referred to as "right neighbor”) in the text to be processed belongs to the attribute category, and the label information of the right neighbor indicates that the participle is not in the preset word First, use the right-neighbor word as the starting point to search for consecutive participles that belong to the attribute class and are not in the first place in the presupposition word, and use the search result as the remaining part of the attribute word.
  • the tag information corresponding to a piece of text ⁇ e4, e5, e6, e7, e8 ⁇ is ⁇ O, Begin1, Inside1, Inside1, O ⁇ , where O indicates that the word segmentation type is other types, and Begin1 indicates that the word segmentation type is Attribute category and is in the first place in the presupposition word. Inside1 indicates that the word segmentation type is attribute category and is not in the first place in the presupposition word. It can be seen that the start word is e5, and the right neighbor e6 of the start word e5 belongs to the attribute category and is in the presupposition.
  • the presupposition word is not in the first position, so you can use e6 as the starting point to find a continuous participle that belongs to the attribute class and is not in the first position in the presupposition word.
  • the search result is e6e7, and the remaining part of the attribute word can be determined as e6e7, and finally determined
  • the attribute word is e5e6e7.
  • each attribute word in the text to be processed can be quickly determined, and the efficiency is high.
  • determining the sentiment type of the attribute word in step 13 may include the following steps:
  • the emotion type corresponding to the first first participle constituting the attribute word is determined as the emotion type corresponding to the attribute word.
  • the emotion type corresponding to all the first participles constituting an attribute word is the same. Therefore, the emotion type corresponding to one of the first participles can be directly determined as the emotion type of the attribute word, for example, the first participle One participle. In case of different situations, the above method can also be used to directly determine the emotion type corresponding to the first first participle as the emotion type corresponding to the attribute word.
  • the target text matrix corresponding to the text to be processed is determined, and the target text matrix is input to the text information extraction model to obtain the information extraction result output by the text information extraction model.
  • the information extraction result includes the word segmentation of the text to be processed Label Information.
  • the word segmentation belonging to the attribute category in the text to be processed and the emotional type of these word segmentation can be obtained at the same time, thereby obtaining the attribute words in the text to be processed and the emotional type of each attribute word, which is highly efficient And can guarantee the accuracy rate.
  • Fig. 3 is a flowchart of a method for extracting text information according to another embodiment of the present disclosure. As shown in FIG. 3, based on the steps shown in FIG. 1, the method provided by the present disclosure may further include the following steps.
  • step 31 the emotional word in the text to be processed is determined according to the second word segmentation of the sentiment type in the text to be processed.
  • each emotional word is composed of at least one second participle. If the emotional word consists of a second participle, the emotional word is the second participle. If the emotional word is composed of more than one second participle, the emotional word is a compound word formed by these second participles, and when the emotional word is composed of more than one second participle, the second participles that constitute the emotional word are waiting to be processed The position in the text is continuous.
  • the emotional word in the text to be processed is determined according to the label information of each second word segmentation and its position in the text to be processed.
  • the method of determining the emotional word is the same as the principle of determining the attribute word, which has been described above and will not be repeated here.
  • each attribute word is combined with each emotion word to obtain an attribute emotion word pair.
  • Each attribute affect word pair contains an attribute word and an affect word.
  • 8 attribute emotion word pairs can be obtained after the combination, namely: m1-n1, m1 -n2, m2-n1, m2-n2, m3-n1, m3-n2, m4-n1, m4-n2.
  • step 33 the position information of the attribute emotion word pair and the attribute emotion word pair is input to the association model to obtain the association result output by the association model.
  • the position information of the attribute emotion word pair is used to indicate the position relationship between the attribute word in the attribute emotion word pair and the emotion word in the text to be processed. For example, the position of the attribute word and the emotion word in the text to be processed, the distance between the attribute word and the emotion word in the text to be processed, whether the attribute word and the emotion word are in the same sentence, and so on.
  • the association result is used to indicate whether the attribute words and emotion words in each attribute emotion word pair are related.
  • the correlation between attribute words and emotional words means that the object described by the emotional word is the attribute word. Whether attribute words and emotional words are related can be reflected by their location. For example, related attribute words and emotional words are generally located in the same sentence or in similar locations.
  • association model can be obtained in the following manner:
  • the position information of the historical attribute emotional word pair and the historical attribute emotional word pair corresponding to the first historical text is used as input data, and the historical association result of each historical attribute emotional word pair in the first historical text is used as the output data.
  • the model is trained to obtain an associated model.
  • the first historical text can be taken from a corpus related to the text to be processed.
  • the first historical text and the second historical text in the preceding text may be the same.
  • the method of obtaining the historical attribute emotional word pair corresponding to the first historical text is the same as the method of obtaining the attribute emotional word pair in step 32 (and related steps of how to obtain attribute words, emotional words, etc.), which has been described in the foregoing. Do not repeat it here.
  • the historical association results of each historical attribute emotional word pair in the first historical text can be manually labeled, that is, whether each historical attribute emotional word pair is related or not.
  • the position information of the historical attribute emotional word pair and the historical attribute emotional word pair corresponding to the first historical text is used as input data, and the historical association result of each historical attribute emotional word pair in the first historical text is used as output data.
  • the deep neural network model is trained to obtain the correlation model. For example, during model training, the deep neural network model is trained based on learning methods such as RandomForest, LSTM-attention, and Recurrent Attention.
  • model training is performed based on the existing data to obtain the association model.
  • the corresponding data is directly input into the association model to obtain the association result output by the association model, and the application is simple and convenient.
  • step 34 the target attribute emotional word pair is determined according to the association result.
  • the target attribute emotional word pair refers to the attribute emotional word pair related to the attribute word and the emotional word in the text to be processed.
  • the target emotional word pair related to the attribute word and the emotional word can be selected from it for the user to view or use.
  • the emotional words related to each attribute word can also be extracted from the text to be processed.
  • the information extraction function is more complete and it is convenient for users to view the data. And use.
  • Fig. 4 is a block diagram of a text information extraction device provided according to an embodiment of the present disclosure. As shown in Fig. 4, the device 40 includes:
  • the first determining module 41 is configured to determine a target text matrix corresponding to the text to be processed, where the text matrix corresponding to the text includes a vectorized representation corresponding to each word segmentation in the text;
  • the first processing module 42 is configured to input the target text matrix into a text information extraction model to obtain an information extraction result output by the text information extraction model, and the information extraction result includes the information of each word segmentation in the text to be processed Tag information, the tag information is used to indicate the word segmentation type, the word segmentation type includes an attribute class, and if the word segmentation type of the word segmentation is the attribute class, the tag information of the word segmentation is also used to indicate the emotion type of the word segmentation;
  • the second determining module 43 is configured to determine the attribute words in the to-be-processed text and each of the attribute words in the to-be-processed text according to the first participle of the attribute type in the text to be processed and the emotion type of the first participle.
  • the word segmentation type further includes sentiment
  • the device 40 also includes:
  • the third determining module is configured to determine the emotional words in the to-be-processed text according to the second word-segmentation type of the emotional category in the to-be-processed text, wherein each of the emotional words is composed of at least one of the The second participle constitutes;
  • the fourth determining module is used to combine each attribute word with each emotion word to obtain attribute emotion word pairs, and each attribute emotion word pair includes one attribute word and one emotion word;
  • the second processing module is configured to input the position information of the attribute emotion word pair and the attribute emotion word pair into the association model to obtain the association result output by the association model, and the association result is used to indicate each of the attributes Whether the attribute word and the emotion word in the emotion word pair are related, and the position information of the attribute emotion word pair is used to indicate the position relationship between the attribute word and the emotion word in the attribute emotion word pair in the text to be processed ;
  • the fifth determining module is used to determine the target attribute emotional word pair related to the attribute word and the emotional word according to the association result.
  • association model is obtained in the following manner:
  • the first determining module 41 includes:
  • the first determining sub-module is configured to perform word segmentation processing on the to-be-processed text, and determine the word vector and part-of-speech vector corresponding to each word segmentation in the to-be-processed text;
  • the second determining submodule is configured to determine the target text matrix according to the vectorized representation corresponding to each of the word segmentation, wherein the vectorized representation of each word segmentation in the to-be-processed text corresponds to the target text A row in the matrix.
  • the text information extraction model is obtained in the following manner:
  • the historical text matrix corresponding to the second historical text is used as input data, and the historical label information corresponding to each word segmentation in the second historical text is used as output data, and the deep neural network model is trained to obtain the text information extraction model .
  • the tag information of the word segmentation is also used to indicate whether the word segmentation is in the first position in the preset word;
  • the second determining module 43 includes:
  • the attribute word determination sub-module is used to determine the attribute words in the to-be-processed text according to the tag information of each of the first word segmentation and its position in the to-be-processed text, wherein, in the preset The first participle in the first position in the word is in the first position in the attribute word in which it is located.
  • the second determining module 43 includes:
  • the emotion type determination sub-module is used to determine the emotion type corresponding to the first part of the attribute word as the emotion type corresponding to the attribute word.
  • the text information extraction device includes a processor and a memory, and the above-mentioned first determination module, first processing module, second determination module, third determination module, fourth determination module, second processing module, fifth determination module, etc., all serve as The program unit is stored in the memory, and the above-mentioned program unit stored in the memory is executed by the processor to realize the corresponding function.
  • the processor contains the kernel, and the kernel calls the corresponding program unit from the memory.
  • the kernel can be set to one or more, and the text information can be extracted more quickly and accurately by adjusting the kernel parameters, and the attribute words and their emotion can be obtained quickly.
  • the embodiment of the present invention provides a storage medium on which a program is stored, and when the program is executed by a processor, the text information extraction method is implemented.
  • the embodiment of the present invention provides a processor configured to run a program, wherein the method for extracting text information is executed when the program is running.
  • the device 70 includes at least one processor 701, and at least one memory 702 and a bus 703 connected to the processor 701; wherein the processor 701 and the memory 702 pass through the bus 703 completes mutual communication; the processor 701 is configured to call program instructions in the memory 702 to execute the above-mentioned text information extraction method.
  • the device in this article can be a server, a PC, etc.
  • This application also provides a computer program product, which when executed on a data processing device, is suitable for executing a program that initializes the following method steps:
  • the target text matrix is input to a text information extraction model to obtain an information extraction result output by the text information extraction model.
  • the information extraction result includes the label information of each word segment in the text to be processed, and the label information is used
  • the word segmentation type includes an attribute type, and if the word segmentation type of the word segmentation is the attribute type, the tag information of the word segmentation is also used to indicate the emotion type of the word segmentation;
  • the word segmentation type further includes sentiment
  • the method also includes:
  • each attribute word with each emotion word respectively to obtain attribute emotion word pairs, and each attribute emotion word pair includes one attribute word and one emotion word;
  • a target attribute emotional word pair related to the attribute word and the emotional word is determined.
  • association model is obtained in the following manner:
  • the determining the target text matrix corresponding to the text to be processed includes:
  • the target text matrix is determined according to the vectorized representation corresponding to each word segmentation, wherein the vectorized representation of each word segmentation in the to-be-processed text corresponds to a row in the target text matrix.
  • the text information extraction model is obtained in the following manner:
  • the historical text matrix corresponding to the second historical text is used as input data, and the historical label information corresponding to each word segmentation in the second historical text is used as output data, and the deep neural network model is trained to obtain the text information extraction model .
  • the tag information of the word segmentation is also used to indicate whether the word segmentation is in the first position in the preset word;
  • Determining the attribute words in the to-be-processed text includes:
  • the label information of each of the first word segmentation and its position in the to-be-processed text determine the attribute word in the to-be-processed text, where the first word in the preset word A participle is in the first place among the attribute words in which it is located.
  • determining the sentiment type of the attribute word includes:
  • the sentiment type corresponding to the first said first participle constituting the attribute word is determined as the sentiment type corresponding to the attribute word.
  • the device includes one or more processors (CPUs), memory, and buses.
  • the device may also include input/output interfaces, network interfaces, and so on.
  • the memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), and the memory includes at least one Memory chip.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • the memory is an example of a computer-readable medium.
  • Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
  • this application can be provided as a method, a system, or a computer program product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

A text information extraction method and apparatus, a storage medium and a device. The method comprises: determining a target text matrix corresponding to a text to be processed (11); inputting the target text matrix into a text information extraction model to obtain an information extraction result output by the text information extraction model (12), wherein the information extraction result comprises label information of each word segment in a text to be processed, the label information is used to indicate the type of word segment, if the word segment type is attribute type, and the label information of the word segment is also used to indicate the emotion type of the word segment; and determining, according to a first word segment in the text to be processed and the emotion type of the first word segment, attribute words in the text to be processed and the emotion type of each attribute word (13). In this way, by means of the training model, the word segment belonging to the attribute class in the text to be processed and the emotion type of these word segments can be obtained at the same time, so that the attribute words in the text to be processed and the emotion type of each attribute word can be obtained, and thus the efficiency is high and accuracy is ensured.

Description

文本信息提取方法、装置、存储介质及设备Text information extraction method, device, storage medium and equipment 技术领域Technical field
本公开涉及计算机技术领域,具体地,涉及一种文本信息提取方法、装置、存储介质及设备。The present disclosure relates to the field of computer technology, and in particular, to a method, device, storage medium, and equipment for extracting text information.
背景技术Background technique
在一段文本(例如,用户评论)中,除主语之外,还存在属于属性类的词汇(或词组)和属于情感类的词汇(或词组),属于情感类的词汇(或词组)能够对属于属性类的词汇(或词组)进行情感描述,其中,属于属性类的词汇(或词组)用于描述主语的功能或者性能,属于情感类的词汇的情感类型一般包括正面、中性、负面三种。例如,若文本为“A汽车的外观难看”,可知该段文本中主语为“A汽车”,属于属性类的词汇为“外观”,属于情感类的词汇为“难看”。对于这样的文本,可以从中获得属性情感,也就是根据文本内容,提取出特定的属性及其情感类型。In a text (for example, a user review), in addition to the subject, there are also words (or phrases) belonging to the attribute category and words (or phrases) belonging to the emotion category. The words (or phrases) belonging to the emotion category can be Attribute vocabulary (or phrase) is used to describe emotions. Among them, attribute vocabulary (or phrase) is used to describe the function or performance of the subject. The emotion type of emotion vocabulary generally includes three types: positive, neutral, and negative. . For example, if the text is "The appearance of A car is ugly", it can be known that the subject of the text is "A car", the vocabulary belonging to the attribute category is "appearance", and the vocabulary belonging to the emotional category is "ugly". For such texts, attribute sentiment can be obtained from it, that is, specific attributes and sentiment types can be extracted according to the content of the text.
目前,属性情感一般通过两步法、采用pipeline结构获得,即首先通过序列标注(例如,LSTM-CRF、BERT-CRF等)的方法抽取文本中属于属性类的词汇(或词组),之后,再针对每个属性类词汇(或词组),以该属性类词汇(或词组)及其所在句子为模型训练数据,利用深度学习(例如,LSTM-attention、BERT-CLS、ATAE、Recurrent Attention、Transformation Network等)的方法进行训练,得到用于预测单个属性类词汇的模型。但是,这样两步训练的方式会造成信息损失以及误差叠加,导致属性情感存在偏差,表现为准确率损失。At present, attribute sentiment is generally obtained through a two-step method using a pipeline structure, that is, firstly extract the vocabulary (or phrase) belonging to the attribute category in the text through the method of sequence labeling (for example, LSTM-CRF, BERT-CRF, etc.), and then, For each attribute vocabulary (or phrase), use the attribute vocabulary (or phrase) and its sentence as the model training data, and use deep learning (for example, LSTM-attention, BERT-CLS, ATAE, Recurrent Attention, Transformation Network Etc.) method for training to obtain a model for predicting a single attribute vocabulary. However, such a two-step training method will cause information loss and error stacking, resulting in a bias in attribute emotions, which is manifested as a loss of accuracy.
发明内容Summary of the invention
本公开的目的是提供一种文本信息提取方法、装置、存储介质及设备,能够更加快速且准确地实现文本信息提取,快速获得属性词及其情感。The purpose of the present disclosure is to provide a text information extraction method, device, storage medium, and equipment, which can achieve text information extraction more quickly and accurately, and quickly obtain attribute words and their emotions.
为了实现上述目的,根据本公开的第一方面,提供一种文本信息提取方法,所述方法包括:In order to achieve the foregoing objective, according to the first aspect of the present disclosure, a method for extracting text information is provided, the method including:
确定待处理文本对应的目标文本矩阵,其中,文本对应的文本矩阵包括该文本中各个分词对应的向量化表示;Determine the target text matrix corresponding to the text to be processed, where the text matrix corresponding to the text includes the vectorized representation corresponding to each word segmentation in the text;
将所述目标文本矩阵输入至文本信息提取模型,以获得所述文本信息提取模型输出的信息提取结果,所述信息提取结果包括所述待处理文本中各分词的标签信息,所述标签信息用于指示分词类型,所述分词类型包括属性类,以及,若分词的分词类型为所述属性类,该分词的标签信息还用于指示该分词的情感类型;The target text matrix is input to a text information extraction model to obtain an information extraction result output by the text information extraction model. The information extraction result includes the label information of each word segment in the text to be processed, and the label information is used In order to indicate the word segmentation type, the word segmentation type includes an attribute type, and if the word segmentation type of the word segmentation is the attribute type, the tag information of the word segmentation is also used to indicate the emotion type of the word segmentation;
根据所述待处理文本中分词类型为所述属性类的第一分词及所述第一分词的情感类型,确定所述待处理文本中的属性词以及各个所述属性词的情感类型,其中,每个所述属性词由至少一个所述第一分词构成。Determine the attribute words in the to-be-processed text and the emotion type of each attribute word according to the first participle of the attribute class and the emotion type of the first participle in the to-be-processed text, wherein, Each of the attribute words is composed of at least one of the first participles.
可选地,所述分词类型还包括情感类;Optionally, the word segmentation type further includes sentiment;
所述方法还包括:The method also includes:
根据所述待处理文本中分词类型为所述情感类的第二分词,确定所述待处理文本中的情感词,其中,每个所述情感词由至少一个所述第二分词构成;Determine the emotional word in the to-be-processed text according to the second word segmentation in the to-be-processed text whose word segmentation type is the emotion category, wherein each of the emotional words is composed of at least one of the second participles;
分别将每个属性词与每个情感词组合,获得属性情感词对,每个所述属性情感词对中包含一个所述属性词和一个所述情感词;Combine each attribute word with each emotion word respectively to obtain attribute emotion word pairs, and each attribute emotion word pair includes one attribute word and one emotion word;
将所述属性情感词对和所述属性情感词对的位置信息输入至关联模型,获得所述关联模型输出的关联结果,所述关联结果用于指示各个所述属性情感词对中的属性词和情感词是否相关,以及,所述属性情感词对的 位置信息用于指示所述属性情感词对中的属性词和情感词在所述待处理文本中的位置关系;Input the position information of the attribute emotion word pair and the attribute emotion word pair into the association model to obtain the association result output by the association model, and the association result is used to indicate the attribute words in each attribute emotion word pair Whether it is related to an emotion word, and the position information of the attribute emotion word pair is used to indicate the position relationship between the attribute word in the attribute emotion word pair and the emotion word in the text to be processed;
根据所述关联结果,确定属性词和情感词相关的目标属性情感词对。According to the association result, a target attribute emotional word pair related to the attribute word and the emotional word is determined.
可选地,所述关联模型通过如下方式获得:Optionally, the association model is obtained in the following manner:
将第一历史文本对应的历史属性情感词对和所述历史属性情感词对的位置信息作为输入数据、并将所述第一历史文本中各历史属性情感词对的历史关联结果作为输出数据,对深度神经网络模型进行训练,以获得所述关联模型。Using the historical attribute emotional word pair corresponding to the first historical text and the position information of the historical attribute emotional word pair as input data, and the historical association result of each historical attribute emotional word pair in the first historical text as output data, Training the deep neural network model to obtain the association model.
可选地,所述确定待处理文本对应的目标文本矩阵,包括:Optionally, the determining the target text matrix corresponding to the text to be processed includes:
对所述待处理文本进行分词处理,并确定所述待处理文本中各个分词对应的词向量和词性向量;Perform word segmentation processing on the to-be-processed text, and determine the word vector and part-of-speech vector corresponding to each word segmentation in the to-be-processed text;
对各个所述分词的词向量和词性向量进行拼接,得到每个所述分词对应的向量化表示;Splicing the word vector and the part-of-speech vector of each said word segmentation to obtain a vectorized representation corresponding to each said word segmentation;
根据各个所述分词对应的所述向量化表示,确定所述目标文本矩阵,其中,所述待处理文本中每个所述分词的向量化表示对应所述目标文本矩阵中的一行。The target text matrix is determined according to the vectorized representation corresponding to each word segmentation, wherein the vectorized representation of each word segmentation in the to-be-processed text corresponds to a row in the target text matrix.
可选地,所述文本信息提取模型通过如下方式获得:Optionally, the text information extraction model is obtained in the following manner:
将第二历史文本对应的历史文本矩阵作为输入数据、并将所述第二历史文本中各分词对应的历史标签信息作为输出数据,对深度神经网络模型进行训练,以获得所述文本信息提取模型。The historical text matrix corresponding to the second historical text is used as input data, and the historical label information corresponding to each word segmentation in the second historical text is used as output data, and the deep neural network model is trained to obtain the text information extraction model .
可选地,若分词的分词类型为所述属性类,该分词的标签信息还用于指示该分词在预设词中是否处于首位;Optionally, if the word segmentation type of the word segmentation is the attribute class, the tag information of the word segmentation is also used to indicate whether the word segmentation is in the first position in the preset word;
确定所述待处理文本中的属性词,包括:Determining the attribute words in the to-be-processed text includes:
根据各个所述第一分词的所述标签信息及其在所述待处理文本中的位置,确定所述待处理文本中的所述属性词,其中,在预设词中处于首位的 所述第一分词在其所在的属性词中处于首位。According to the label information of each of the first word segmentation and its position in the to-be-processed text, determine the attribute word in the to-be-processed text, where the first word in the preset word A participle is in the first place among the attribute words in which it is located.
可选地,确定属性词的情感类型,包括:Optionally, determining the sentiment type of the attribute word includes:
将构成所述属性词的首个所述第一分词对应的情感类型确定为该属性词对应的情感类型。The sentiment type corresponding to the first said first participle constituting the attribute word is determined as the sentiment type corresponding to the attribute word.
根据本公开的第二方面,提供一种文本信息提取装置,所述装置包括:According to a second aspect of the present disclosure, there is provided a text information extraction device, the device including:
第一确定模块,用于确定待处理文本对应的目标文本矩阵,其中,文本对应的文本矩阵包括该文本中各个分词对应的向量化表示;The first determining module is used to determine the target text matrix corresponding to the text to be processed, wherein the text matrix corresponding to the text includes the vectorized representation corresponding to each word segmentation in the text;
第一处理模块,用于将所述目标文本矩阵输入至文本信息提取模型,以获得所述文本信息提取模型输出的信息提取结果,所述信息提取结果包括所述待处理文本中各分词的标签信息,所述标签信息用于指示分词类型,所述分词类型包括属性类,以及,若分词的分词类型为所述属性类,该分词的标签信息还用于指示该分词的情感类型;The first processing module is configured to input the target text matrix into a text information extraction model to obtain an information extraction result output by the text information extraction model, and the information extraction result includes the label of each word segment in the text to be processed Information, the tag information is used to indicate the word segmentation type, the word segmentation type includes an attribute class, and, if the word segmentation type of the word segmentation is the attribute class, the tag information of the word segmentation is also used to indicate the emotion type of the word segmentation;
第二确定模块,用于根据所述待处理文本中分词类型为所述属性类的第一分词及所述第一分词的情感类型,确定所述待处理文本中的属性词以及各个所述属性词的情感类型,其中,每个所述属性词由至少一个所述第一分词构成。The second determining module is configured to determine the attribute words in the to-be-processed text and each of the attributes according to the first participle of the attribute type in the to-be-processed text and the sentiment type of the first participle The emotional type of the word, wherein each of the attribute words is composed of at least one of the first participles.
可选地,所述分词类型还包括情感类;Optionally, the word segmentation type further includes sentiment;
所述装置还包括:The device also includes:
第三确定模块,用于根据所述待处理文本中分词类型为所述情感类的第二分词,确定所述待处理文本中的情感词,其中,每个所述情感词由至少一个所述第二分词构成;The third determining module is configured to determine the emotional words in the to-be-processed text according to the second word-segmentation type of the emotional category in the to-be-processed text, wherein each of the emotional words is composed of at least one of the The second participle constitutes;
第四确定模块,用于分别将每个属性词与每个情感词组合,获得属性情感词对,每个所述属性情感词对中包含一个所述属性词和一个所述情感词;The fourth determining module is used to combine each attribute word with each emotion word to obtain attribute emotion word pairs, and each attribute emotion word pair includes one attribute word and one emotion word;
第二处理模块,用于将所述属性情感词对和所述属性情感词对的位置信息输入至关联模型,获得所述关联模型输出的关联结果,所述关联结果用于指示各个所述属性情感词对中的属性词和情感词是否相关,以及,所述属性情感词对的位置信息用于指示所述属性情感词对中的属性词和情感词在所述待处理文本中的位置关系;The second processing module is configured to input the position information of the attribute emotion word pair and the attribute emotion word pair into the association model to obtain the association result output by the association model, and the association result is used to indicate each of the attributes Whether the attribute word and the emotion word in the emotion word pair are related, and the position information of the attribute emotion word pair is used to indicate the position relationship between the attribute word and the emotion word in the attribute emotion word pair in the text to be processed ;
第五确定模块,用于根据所述关联结果,确定属性词和情感词相关的目标属性情感词对。The fifth determining module is used to determine the target attribute emotional word pair related to the attribute word and the emotional word according to the association result.
可选地,所述关联模型通过如下方式获得:Optionally, the association model is obtained in the following manner:
将第一历史文本对应的历史属性情感词对和所述历史属性情感词对的位置信息作为输入数据、并将所述第一历史文本中各历史属性情感词对的历史关联结果作为输出数据,对深度神经网络模型进行训练,以获得所述关联模型。Using the historical attribute emotional word pair corresponding to the first historical text and the position information of the historical attribute emotional word pair as input data, and the historical association result of each historical attribute emotional word pair in the first historical text as output data, Training the deep neural network model to obtain the association model.
可选地,所述第一确定模块包括:Optionally, the first determining module includes:
第一确定子模块,用于对所述待处理文本进行分词处理,并确定所述待处理文本中各个分词对应的词向量和词性向量;The first determining sub-module is configured to perform word segmentation processing on the to-be-processed text, and determine the word vector and part-of-speech vector corresponding to each word segmentation in the to-be-processed text;
处理子模块,用于对各个所述分词的词向量和词性向量进行拼接,得到每个所述分词对应的向量化表示;A processing sub-module for splicing the word vector and part-of-speech vector of each said word segmentation to obtain the vectorized representation corresponding to each said word segmentation;
第二确定子模块,用于根据各个所述分词对应的所述向量化表示,确定所述目标文本矩阵,其中,所述待处理文本中每个所述分词的向量化表示对应所述目标文本矩阵中的一行。The second determining submodule is configured to determine the target text matrix according to the vectorized representation corresponding to each of the word segmentation, wherein the vectorized representation of each word segmentation in the to-be-processed text corresponds to the target text A row in the matrix.
可选地,所述文本信息提取模型通过如下方式获得:Optionally, the text information extraction model is obtained in the following manner:
将第二历史文本对应的历史文本矩阵作为输入数据、并将所述第二历史文本中各分词对应的历史标签信息作为输出数据,对深度神经网络模型进行训练,以获得所述文本信息提取模型。The historical text matrix corresponding to the second historical text is used as input data, and the historical label information corresponding to each word segmentation in the second historical text is used as output data, and the deep neural network model is trained to obtain the text information extraction model .
可选地,若分词的分词类型为所述属性类,该分词的标签信息还用于 指示该分词在预设词中是否处于首位;Optionally, if the word segmentation type of the word segmentation is the attribute class, the tag information of the word segmentation is also used to indicate whether the word segmentation is in the first position in the preset word;
所述第二确定模块包括:The second determining module includes:
属性词确定子模块,用于根据各个所述第一分词的所述标签信息及其在所述待处理文本中的位置,确定所述待处理文本中的所述属性词,其中,在预设词中处于首位的所述第一分词在其所在的属性词中处于首位。The attribute word determination sub-module is used to determine the attribute words in the to-be-processed text according to the tag information of each of the first word segmentation and its position in the to-be-processed text, wherein, in the preset The first participle in the first position in the word is in the first position in the attribute word in which it is located.
可选地,所述第二确定模块包括:Optionally, the second determining module includes:
情感类型确定子模块,用于将构成所述属性词的首个所述第一分词对应的情感类型确定为该属性词对应的情感类型。The emotion type determination sub-module is used to determine the emotion type corresponding to the first part of the attribute word as the emotion type corresponding to the attribute word.
根据本公开的第三方面,提供一种存储介质,其上存储有程序,该程序被处理器执行时实现本公开第一方面所述方法的步骤。According to a third aspect of the present disclosure, there is provided a storage medium having a program stored thereon, and when the program is executed by a processor, the steps of the method described in the first aspect of the present disclosure are implemented.
根据本公开的第四方面,提供一种设备,所述设备包括:According to a fourth aspect of the present disclosure, there is provided a device including:
至少一个处理器、以及与所述处理器连接的至少一个存储器、总线;At least one processor, and at least one memory and bus connected to the processor;
其中,所述处理器、所述存储器通过所述总线完成相互间的通信;Wherein, the processor and the memory complete mutual communication through the bus;
所述处理器用于调用所述存储器中的程序指令,以执行本公开第一方面所述方法的步骤。The processor is configured to call program instructions in the memory to execute the steps of the method described in the first aspect of the present disclosure.
根据本公开的第五方面,提供一种计算机程序产品,当在数据处理设备上执行时,适于执行初始化有如本公开第一方面所述方法的步骤。According to the fifth aspect of the present disclosure, there is provided a computer program product, which when executed on a data processing device, is adapted to perform the steps of the method described in the first aspect of the present disclosure.
通过上述技术方案,确定待处理文本对应的目标文本矩阵,并将目标文本矩阵输入至文本信息提取模型,以获得文本信息提取模型输出的信息提取结果,该信息提取结果包括待处理文本中各分词的标签信息。之后,根据待处理文本中分词类型为属性类的第一分词及第一分词的情感类型,确定待处理文本中的属性词以及各个属性词的情感类型。这样,根据文本信息提取模型的信息提取结果,能够同时得到待处理文本中属于属性类的分词以及这些分词的情感类型,从而得到待处理文本中的属性词及各属性词的情感类型,效率高且能保证准确率。Through the above technical solution, the target text matrix corresponding to the text to be processed is determined, and the target text matrix is input to the text information extraction model to obtain the information extraction result output by the text information extraction model. The information extraction result includes each word segmentation in the text to be processed Label information. After that, the attribute words in the text to be processed and the emotion type of each attribute word are determined according to the first participle of the attribute type and the emotion type of the first participle in the text to be processed. In this way, according to the information extraction result of the text information extraction model, the word segmentation belonging to the attribute category in the text to be processed and the emotional type of these word segmentation can be obtained at the same time, thereby obtaining the attribute words in the text to be processed and the emotional type of each attribute word, which is highly efficient And can guarantee the accuracy rate.
本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。Other features and advantages of the present disclosure will be described in detail in the following specific embodiments.
附图说明Description of the drawings
附图是用来提供对本公开的进一步理解,并且构成说明书的一部分,与下面的具体实施方式一起用于解释本公开,但并不构成对本公开的限制。在附图中:The accompanying drawings are used to provide a further understanding of the present disclosure and constitute a part of the specification. Together with the following specific embodiments, they are used to explain the present disclosure, but do not constitute a limitation to the present disclosure. In the attached picture:
图1是根据本公开的一种实施方式提供的文本信息提取方法的流程图;Fig. 1 is a flowchart of a method for extracting text information according to an embodiment of the present disclosure;
图2A和图2B是根据本公开提供的文本信息提取方法中,标签信息的示例性示意图;2A and 2B are exemplary schematic diagrams of tag information in the text information extraction method provided according to the present disclosure;
图3是根据本公开的另一种实施方式提供的文本信息提取方法的流程图;FIG. 3 is a flowchart of a method for extracting text information according to another embodiment of the present disclosure;
图4是根据本公开的一种实施方式提供的文本信息提取装置的框图;Fig. 4 is a block diagram of a text information extraction device provided according to an embodiment of the present disclosure;
图5是根据本公开的一种实施方式提供的设备的框图。Fig. 5 is a block diagram of a device provided according to an embodiment of the present disclosure.
具体实施方式Detailed ways
以下结合附图对本公开的具体实施方式进行详细说明。应当理解的是,此处所描述的具体实施方式仅用于说明和解释本公开,并不用于限制本公开。The specific embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are only used to illustrate and explain the present disclosure, and are not used to limit the present disclosure.
图1是根据本公开的一种实施方式提供的文本信息提取方法的流程图。如图1所示,该方法可以包括以下步骤。Fig. 1 is a flowchart of a method for extracting text information according to an embodiment of the present disclosure. As shown in Figure 1, the method may include the following steps.
在步骤11中,确定待处理文本对应的目标文本矩阵。In step 11, the target text matrix corresponding to the text to be processed is determined.
文本对应的文本矩阵包括该文本中各个分词对应的向量化表示。对于一段文本,首先对该文本进行分词处理,进而得到该文本各个分词对应的向量化表示,其中,分词对应的向量化表示可以反映该分词本身特征及该分词的词性特征。相应地,待处理文本对应的目标文本矩阵包括待处理文 本中各个分词对应的向量化表示。The text matrix corresponding to the text includes the vectorized representation corresponding to each word segmentation in the text. For a piece of text, first perform word segmentation processing on the text, and then obtain the vectorized representation corresponding to each word segmentation of the text, where the vectorized representation corresponding to the word segmentation can reflect the characteristics of the word segmentation itself and the part-of-speech characteristics of the word segmentation. Correspondingly, the target text matrix corresponding to the text to be processed includes the vectorized representation corresponding to each word segmentation in the text to be processed.
在一种可能的实施方式中,步骤11可以包括以下步骤:In a possible implementation manner, step 11 may include the following steps:
对待处理文本进行分词处理,并确定待处理文本中各个分词对应的词向量和词性向量;Perform word segmentation processing on the text to be processed, and determine the word vector and part-of-speech vector corresponding to each word segmentation in the text to be processed;
对各个分词的词向量和词性向量进行拼接,得到每个分词对应的向量化表示;Join the word vector and part of speech vector of each word segmentation to obtain the vectorized representation corresponding to each word segmentation;
根据各个分词对应的向量化表示,确定目标文本矩阵。According to the vectorized representation corresponding to each word segmentation, the target text matrix is determined.
其中,词向量是将词汇映射到向量空间中,并且,词向量之间的相似关系能够反映词汇之间的相似关系。词性向量能够反映词汇的词性特征,即通过词性向量能够确定词汇的词性,词性向量可以采用一定维度的随机向量进行表示,例如,若共有30种词性A1~A30,可以依次用向量a1~a30表示,a1~a30的维度为指定的固定值(例如,20),其中每个维度都可以为一个随机生成的接近于0的小数。Among them, the word vector maps the vocabulary to the vector space, and the similarity relationship between the word vectors can reflect the similarity relationship between the words. The part-of-speech vector can reflect the part-of-speech characteristics of the vocabulary, that is, the part-of-speech vector can be used to determine the part of speech of the vocabulary. The part-of-speech vector can be represented by a random vector of a certain dimension. For example, if there are 30 parts of speech A1~A30, they can be represented by the vectors a1~a30 in turn , The dimensions of a1 to a30 are specified fixed values (for example, 20), and each dimension can be a randomly generated decimal close to 0.
在本方法中,预先针对相关语料库(例如,待处理文本相关领域的语料库)中的文本进行切词,并使用词向量模型(例如,Word2vec、Glove、ELMo等)进行词向量训练,得到各词汇对应的词向量。示例地,可以将词汇映射到100维向量空间中,即词汇对应的词向量为100维向量。In this method, word segmentation is performed in advance for the text in the relevant corpus (for example, the corpus of the text to be processed), and the word vector model (for example, Word2vec, Glove, ELMo, etc.) is used for word vector training to obtain each vocabulary The corresponding word vector. For example, the vocabulary can be mapped to a 100-dimensional vector space, that is, the word vector corresponding to the vocabulary is a 100-dimensional vector.
在对待处理文本进行处理时,首先对待处理文本进行分词处理,得到分词结果。之后,根据分词结果,以及预先训练所得的各词汇对应的词向量,确定待处理文本中各个分词对应的词向量,并且,根据分词结果确定词性向量。在得到待处理文本中各个分词对应的词向量和词性向量之后,针对每个分词,将其词向量和词性向量进行拼接,得到每个分词对应的向量化表示。示例地,若一个分词的词向量为100维向量[B1,B2,B3,…,B100],词性向量为20维向量[C1,C2,C3,…,C20],则该分词对应的向量化表示可以为120维向量[B1,B2,B3,…,B100,C1,C2,C3,…, C20]。When processing the text to be processed, first perform word segmentation processing on the text to be processed to obtain the word segmentation result. Then, according to the word segmentation result and the word vector corresponding to each vocabulary obtained by pre-training, the word vector corresponding to each word segmentation in the text to be processed is determined, and the part-of-speech vector is determined according to the word segmentation result. After obtaining the word vector and the part-of-speech vector corresponding to each word segmentation in the text to be processed, for each word segmentation, the word vector and the part-of-speech vector are spliced together to obtain the vectorized representation corresponding to each word segmentation. For example, if the word vector of a word segmentation is a 100-dimensional vector [B1, B2, B3,..., B100], and the part-of-speech vector is a 20-dimensional vector [C1, C2, C3,..., C20], then the word segmentation corresponds to the vectorization It can be a 120-dimensional vector [B1, B2, B3,..., B100, C1, C2, C3,..., C20].
根据待处理文本中各个分词对应的向量化表示,确定目标文本矩阵,其中,待处理文本中每个分词的向量化表示对应目标文本矩阵中的一行。另外,分词的向量化表示在目标文本矩阵中出现的顺序与其在待处理文本中出现的顺序一致。示例地,若待处理文本中分词出现的顺序为分词1、分词2、分词3,则在目标文本矩阵中,分词1、分词2、分词3依次对应目标文本矩阵中的第k行、第k+1行、第k+2行,k为正整数,例如1。According to the vectorized representation corresponding to each word segment in the text to be processed, the target text matrix is determined, where the vectorized representation of each word segment in the text to be processed corresponds to a row in the target text matrix. In addition, the vectorization of word segmentation means that the order of appearance in the target text matrix is consistent with the order of appearance in the text to be processed. For example, if the order of occurrence of word segmentation in the text to be processed is word segmentation 1, word segmentation 2, word segmentation 3, then in the target text matrix, word segmentation 1, word segmentation 2, and word segmentation 3 correspond to the kth row and kth row in the target text matrix. In line +1, line k+2, k is a positive integer, such as 1.
在一种可能的实施例中,目标文本矩阵可以由待处理文本中各分词对应的向量化表示直接组合而形成。示例地,若待处理文本共有200个分词,且每个分词对应的向量化表示为120维向量,则目标文本矩阵为200*120的矩阵。In a possible embodiment, the target text matrix may be formed by direct combination of the vectorized representations corresponding to each word segmentation in the text to be processed. For example, if the text to be processed has a total of 200 word segments, and the vectorized representation corresponding to each word segmentation is a 120-dimensional vector, the target text matrix is a 200*120 matrix.
在另一种可能的实施例中,将待处理文本中各分词对应的向量化表示进行组合得到矩阵后,还可以在所得矩阵的基础上进行适当扩充(例如,横向扩充,和/或,纵向扩充),以形成目标文本矩阵,其中,扩充部分可以进行补零处理。示例地,若待处理文本共有200个分词,且每个分词对应的向量化表示为120维向量,组合后得到200*120的矩阵,并将其扩充为200*200的文本矩阵,作为目标文本矩阵。这样,即便文本长度不同,得到的目标文本矩阵的格式也是相同的,能够保证目标文本矩阵的形式一致,便于后续的数据处理。In another possible embodiment, after the vectorized representation corresponding to each word segmentation in the text to be processed is combined to obtain a matrix, it can also be appropriately expanded on the basis of the obtained matrix (for example, horizontal expansion, and/or vertical expansion). Expand) to form a target text matrix, where the expanded part can be processed with zero padding. For example, if the text to be processed has a total of 200 word segments, and the vectorization corresponding to each word segmentation is represented as a 120-dimensional vector, a 200*120 matrix is obtained after the combination, and it is expanded to a 200*200 text matrix as the target text matrix. In this way, even if the text lengths are different, the format of the obtained target text matrix is the same, which can ensure that the form of the target text matrix is consistent and facilitate subsequent data processing.
采用上述方式,对待处理文本分词后,抽取该分词特征以及该分词的词性特征,得到各分词的向量化表达,并形成矩阵,能够为后续的数据处理提供有效的数据支持。Using the above method, after the word segmentation of the text to be processed, the word segmentation feature and the part-of-speech feature of the word segmentation are extracted to obtain the vectorized expression of each word segmentation and form a matrix, which can provide effective data support for subsequent data processing.
在步骤12中,将目标文本矩阵输入至文本信息提取模型,以获得文本信息提取模型输出的信息提取结果。In step 12, the target text matrix is input to the text information extraction model to obtain the information extraction result output by the text information extraction model.
其中,信息提取结果包括待处理文本中各分词的标签信息。标签信息 可以用于指示分词类型,分词类型可以包括属性类、情感类、以及除属性类和情感类的其他类,如前文所述,属于属性类的分词用于描述功能或性能,属于情感类的分词用于对属于属性类的分词进行情感描述。Among them, the information extraction result includes tag information of each word segmentation in the text to be processed. Tag information can be used to indicate the word segmentation type. The word segmentation type can include attribute type, emotion type, and other types except attribute type and emotion type. As mentioned above, the word segmentation belonging to attribute type is used to describe function or performance and belongs to emotion type. The word segmentation of is used to describe the emotion of the word segmentation belonging to the attribute class.
可选地,若分词的分词类型为属性类,该分词的标签信息还可以指示该分词的情感类型。也就是说,对于属性类的分词,其标签信息在能够反映其分词类型的基础上,还能同时反映该分词的情感类型。其中,分词类型和情感类型可以通过如关键字等标识信息加以区分。示例地,情感类型可以分为正面、中性、负面三种。Optionally, if the word segmentation type of the word segmentation is an attribute type, the tag information of the word segmentation may also indicate the emotion type of the word segmentation. In other words, for attribute word segmentation, the label information can reflect the word segmentation type and the emotion type of the word segmentation at the same time. Among them, the word segmentation type and the emotion type can be distinguished by identification information such as keywords. For example, emotion types can be divided into three types: positive, neutral, and negative.
可选地,若分词的分词类型为属性类或情感类,该分词的标签信息还可以用于指示该分词在预设词中是否处于首位。其中,预设词为属于属性类或情感类的词或词组。示例地,若预设词为属性类的“发动机动力”,则其中“发动机”的标签信息指示其在该预设词中处于首位,“动力”的标签信息则指示其未在该预设词中处于首位。再例如,若预设词为情感类的“很难看”,则其中“很”的标签信息指示其在该预设词中处于首位,“难看”的标签信息指示其在该预设词中未处于首位。Optionally, if the word segmentation type of the word segmentation is attribute type or sentiment type, the tag information of the word segmentation can also be used to indicate whether the word segmentation is in the first position in the preset word. Among them, the presupposition word is a word or phrase belonging to the attribute category or emotion category. For example, if the preset word is "engine power" in the attribute category, the tag information of "engine" indicates that it is in the first place in the preset word, and the tag information of "power" indicates that it is not in the preset word. In the first place. For another example, if the preset word is emotional "unsightly", the tag information of "very" indicates that it is in the first place in the preset word, and the tag information of "ugly" indicates that it is not in the preset word. In the first place.
示例地,图2A为文本中各个分词的标签信息示例,其中,属性类对应于Attr,情感类对应于Opin,其他类对应于O,在预设词中处于首位对应于B,未在预设词中处于首位对应于I,正面情感对应于Pos,中性情感对应于Neu,负面情感对应于Neg。如图2A所示,分词“发动机”的标签信息为B_Attr_Pos,“发动机”属于属性类、在预设词中处于首位、且情感类型为正面。再例如,文本中各个分词的标签信息还可以如图2B所示,其中,对应于各词的标签信息与图2A中所表征的意义相同,仅在标签信息的形式上存在区别。需要说明的是,图2A和图2B中的标签信息仅作为示例,本方法中的标签信息并不限于上述形式,能够实现区分即可,对于其他可能的示例,此处不再赘述。For example, Figure 2A is an example of label information for each word segmentation in the text, where the attribute class corresponds to Attr, the sentiment class corresponds to Opin, the other classes correspond to O, the first position in the preset word corresponds to B, and the pre-set word corresponds to B. The first position in the word corresponds to I, the positive emotion corresponds to Pos, the neutral emotion corresponds to Neu, and the negative emotion corresponds to Neg. As shown in FIG. 2A, the tag information of the word segmentation "engine" is B_Attr_Pos, "engine" belongs to the attribute category, is in the first place among the preset words, and the emotion type is positive. For another example, the label information of each word segmentation in the text may also be as shown in FIG. 2B, where the label information corresponding to each word has the same meaning as represented in FIG. 2A, and only the form of the label information is different. It should be noted that the label information in FIG. 2A and FIG. 2B are only examples. The label information in this method is not limited to the above-mentioned form, and can be distinguished. Other possible examples will not be repeated here.
通过分词的标签信息可以确定该分词是什么样的分词,例如,属于属性类或情感类或其他类,若属于属性类它的情感类型是什么。Through the tag information of the word segmentation, it can be determined what kind of word segmentation the word segment is, for example, it belongs to the attribute type or emotion type or other types, and if it belongs to the attribute type, what is its emotion type.
在一种可能的实施方式中,文本信息提取模型可以通过如下方式获得:In a possible implementation, the text information extraction model can be obtained in the following manner:
将第二历史文本对应的历史文本矩阵作为输入数据、并将第二历史文本中各分词对应的历史标签信息作为输出数据,对深度神经网络模型进行训练,以获得文本信息提取模型。Using the historical text matrix corresponding to the second historical text as input data, and the historical label information corresponding to each word segmentation in the second historical text as output data, the deep neural network model is trained to obtain a text information extraction model.
第二历史文本可以取自待处理文本相关的语料库。第二历史文本对应的历史文本矩阵的获得方式与目标文本矩阵的获得方式原理相同,在前文中已有描述,此处不赘述。第二历史文本中各分词对应的历史标签信息可以人工进行标注,标签信息在前文也给出了相关描述,此处不再重复叙述。The second historical text can be taken from a corpus related to the text to be processed. The method of obtaining the historical text matrix corresponding to the second historical text has the same principle as the method of obtaining the target text matrix, which has been described in the foregoing and will not be repeated here. The historical label information corresponding to each word segmentation in the second historical text can be manually labeled. The label information is also described in the previous section, and the description will not be repeated here.
由此,将第二历史文本对应的历史文本矩阵作为输入数据、将第二历史文本中各分词对应的历史标签信息作为输出数据,对深度神经网络模型进行训练,以获得文本信息提取模型。示例地,在模型训练时,基于tensorflow、mxnet、pytorch等学习框架对深度神经网络模型进行训练,采用一种或多种编码器(例如,LSTM、Transformer、BERT)进行编码,并通过解码器(例如,CRF)在每个分词的位置进行解码,以提取各分词位置对应的标签信息。需要说明的是,对深度神经网络模型进行训练的方式属于现有技术,为本领域技术人员公知,此处不赘述。Thus, the historical text matrix corresponding to the second historical text is used as input data, and the historical label information corresponding to each word segmentation in the second historical text is used as output data, and the deep neural network model is trained to obtain a text information extraction model. For example, during model training, the deep neural network model is trained based on learning frameworks such as tensorflow, mxnet, pytorch, etc., and one or more encoders (for example, LSTM, Transformer, BERT) are used for encoding, and the decoder ( For example, CRF) decodes the position of each word segmentation to extract the label information corresponding to each word segmentation position. It should be noted that the method for training the deep neural network model belongs to the prior art and is well known to those skilled in the art, and will not be repeated here.
采用上述方式,基于已有数据进行模型训练以得到文本信息提取模型,在实际应用时,直接将相应数据输入文本信息提取模型,可得到文本信息提取模型输出的信息提取结果,应用简单且方便。In the above manner, model training is performed based on the existing data to obtain the text information extraction model. In actual application, the corresponding data is directly input into the text information extraction model to obtain the information extraction results output by the text information extraction model. The application is simple and convenient.
在步骤13中,根据待处理文本中分词类型为属性类的第一分词及其情感类型,确定待处理文本中的属性词以及各个属性词的情感类型。In step 13, the attribute words in the text to be processed and the emotion types of each attribute word are determined according to the first word segmentation whose word segmentation type is the attribute type in the text to be processed and its sentiment type.
其中,每个属性词由至少一个第一分词构成。属性词若由一个第一分词构成,属性词就是该第一分词。属性词若由一个以上第一分词构成,则属性词为这些第一分词所构成的合成词,并且,当属性词由一个以上第一分词构成时,构成属性词的这些第一分词在待处理文本中位置连续。Among them, each attribute word is composed of at least one first participle. If the attribute word consists of a first participle, the attribute word is the first participle. If the attribute word is composed of more than one first participle, the attribute word is a compound word formed by these first participles, and when the attribute word is composed of more than one first participle, these first participles that constitute the attribute word are to be processed The position in the text is continuous.
在一种可能的实施方式中,步骤13中的确定待处理文本中的属性词,可以包括以下步骤:In a possible implementation manner, determining the attribute words in the text to be processed in step 13 may include the following steps:
根据各个第一分词的标签信息及其在待处理文本中的位置,确定待处理文本中的属性词。According to the label information of each first word segmentation and its position in the text to be processed, the attribute words in the text to be processed are determined.
其中,在预设词中处于首位的第一分词在其所在的属性词中依然处于首位。Among them, the first participle in the first place among the presupposition words is still in the first place among the attribute words in which it is located.
若第一分词的标签信息指示该第一分词在预设词中处于首位,则可以以该第一分词(后简称“开始词”)为属性词中的首个分词,并继续确定该属性词的剩余部分;If the label information of the first participle indicates that the first participle is in the first place in the preset word, the first participle (hereinafter referred to as the "starting word") can be used as the first participle in the attribute word, and the attribute word can be determined continuously The remainder of
在一种情况中,若待处理文本中该开始词右侧的分词不属于属性类,则将该开始词确定为属性词。示例地,若一段文本{e1,e2,e3}对应的标签信息依次为{O,Begin1,O},其中,O表示分词类型为其他类,Begin1表示分词类型为属性类且在预设词中处于首位,可知其中开始词为e2,并且,开始词e2右侧的分词e3不属于属性类,可直接确定分词e2为属性词。In one case, if the participle to the right of the start word in the text to be processed does not belong to the attribute class, then the start word is determined as an attribute word. For example, if the tag information corresponding to a piece of text {e1, e2, e3} is {O, Begin1, O} in order, where O indicates that the word segmentation type is other types, and Begin1 indicates that the word segmentation type is an attribute type and is in the preset word In the first place, it can be seen that the start word is e2, and the participle e3 on the right side of the start word e2 does not belong to the attribute category, and the participle e2 can be directly determined as an attribute word.
在另一种情况中,若待处理文本中该开始词右侧的分词(后简称“右邻词”)属于属性类,且该右邻词的标签信息指示该分词在预设词中不处于首位,则以该右邻词为起点寻找属于属性类、且在预设词中不处于首位的连续的分词,并将寻找结果作为属性词的剩余部分。示例地,若一段文本{e4,e5,e6,e7,e8}对应的标签信息依次为{O,Begin1,Inside1,Inside1,O},其中,O表示分词类型为其他类,Begin1表示分词类型为属性类且在预设词中处于首位,Inside1表示分词类型为属性类且在预设词中 未处于首位,可知其中开始词为e5,并且,开始词e5的右邻词e6属于属性类且在预设词中不处于首位,从而可以以e6为起点寻找属于属性类、且在预设词中不处于首位的连续的分词,寻找结果为e6e7,可确定属性词的剩余部分为e6e7,最终确定的属性词为e5e6e7。In another case, if the participle to the right of the start word (hereinafter referred to as "right neighbor") in the text to be processed belongs to the attribute category, and the label information of the right neighbor indicates that the participle is not in the preset word First, use the right-neighbor word as the starting point to search for consecutive participles that belong to the attribute class and are not in the first place in the presupposition word, and use the search result as the remaining part of the attribute word. For example, if the tag information corresponding to a piece of text {e4, e5, e6, e7, e8} is {O, Begin1, Inside1, Inside1, O}, where O indicates that the word segmentation type is other types, and Begin1 indicates that the word segmentation type is Attribute category and is in the first place in the presupposition word. Inside1 indicates that the word segmentation type is attribute category and is not in the first place in the presupposition word. It can be seen that the start word is e5, and the right neighbor e6 of the start word e5 belongs to the attribute category and is in the presupposition. The presupposition word is not in the first position, so you can use e6 as the starting point to find a continuous participle that belongs to the attribute class and is not in the first position in the presupposition word. The search result is e6e7, and the remaining part of the attribute word can be determined as e6e7, and finally determined The attribute word is e5e6e7.
参照上述方法,即可确定出待处理文本中的所有属性词。By referring to the above method, all the attribute words in the text to be processed can be determined.
采用上述方式,利用第一分词的标签信息以及第一分词在待处理文本中的位置,可以迅速确定待处理文本中的各个属性词,效率高。In the above manner, using the label information of the first word segmentation and the position of the first word segmentation in the text to be processed, each attribute word in the text to be processed can be quickly determined, and the efficiency is high.
在一种可能的实施方式中,步骤13中的确定属性词的情感类型,可以包括以下步骤:In a possible implementation manner, determining the sentiment type of the attribute word in step 13 may include the following steps:
将构成属性词的首个第一分词对应的情感类型确定为属性词对应的情感类型。The emotion type corresponding to the first first participle constituting the attribute word is determined as the emotion type corresponding to the attribute word.
一般来说,构成一个属性词的所有第一分词对应的情感类型是相同的,因此,可以直接将其中的一个第一分词对应的情感类型确定为该属性词的情感类型,例如,首个第一分词。若出现不同的情况,也可以采用上述方式,直接将其中首个第一分词对应的情感类型确定为属性词对应的情感类型。Generally speaking, the emotion type corresponding to all the first participles constituting an attribute word is the same. Therefore, the emotion type corresponding to one of the first participles can be directly determined as the emotion type of the attribute word, for example, the first participle One participle. In case of different situations, the above method can also be used to directly determine the emotion type corresponding to the first first participle as the emotion type corresponding to the attribute word.
通过上述方案,确定待处理文本对应的目标文本矩阵,并将目标文本矩阵输入至文本信息提取模型,以获得文本信息提取模型输出的信息提取结果,该信息提取结果包括待处理文本中各分词的标签信息。之后,根据待处理文本中分词类型为属性类的第一分词及其情感类型,确定待处理文本中的属性词以及各个属性词的情感类型。这样,根据文本信息提取模型的信息提取结果,能够同时得到待处理文本中属于属性类的分词以及这些分词的情感类型,从而得到待处理文本中的属性词及各属性词的情感类型,效率高且能保证准确率。Through the above solution, the target text matrix corresponding to the text to be processed is determined, and the target text matrix is input to the text information extraction model to obtain the information extraction result output by the text information extraction model. The information extraction result includes the word segmentation of the text to be processed Label Information. After that, according to the first word segmentation whose word segmentation type is the attribute type in the text to be processed and its sentiment type, the attribute words in the text to be processed and the sentiment type of each attribute word are determined. In this way, according to the information extraction result of the text information extraction model, the word segmentation belonging to the attribute category in the text to be processed and the emotional type of these word segmentation can be obtained at the same time, thereby obtaining the attribute words in the text to be processed and the emotional type of each attribute word, which is highly efficient And can guarantee the accuracy rate.
图3是根据本公开的另一种实施方式提供的文本信息提取方法的流程 图。如图3所示,在图1所示步骤的基础上,本公开提供的方法还可以包括以下步骤。Fig. 3 is a flowchart of a method for extracting text information according to another embodiment of the present disclosure. As shown in FIG. 3, based on the steps shown in FIG. 1, the method provided by the present disclosure may further include the following steps.
在步骤31中,根据待处理文本中分词类型为情感类的第二分词,确定待处理文本中的情感词。In step 31, the emotional word in the text to be processed is determined according to the second word segmentation of the sentiment type in the text to be processed.
其中,每个情感词由至少一个第二分词构成。情感词若由一个第二分词构成,情感词就是该第二分词。情感词若由一个以上第二分词构成,则情感词为这些第二分词所构成的合成词,并且,当情感词由一个以上第二分词构成时,构成情感词的这些第二分词在待处理文本中位置连续。Among them, each emotional word is composed of at least one second participle. If the emotional word consists of a second participle, the emotional word is the second participle. If the emotional word is composed of more than one second participle, the emotional word is a compound word formed by these second participles, and when the emotional word is composed of more than one second participle, the second participles that constitute the emotional word are waiting to be processed The position in the text is continuous.
在一种可能的实施方式中,根据各个第二分词的标签信息及其在待处理文本中的位置,确定待处理文本中的情感词。其中,确定情感词的方式与确定属性词的原理相同,在上文中已有描述,此处不再赘述。In a possible implementation manner, the emotional word in the text to be processed is determined according to the label information of each second word segmentation and its position in the text to be processed. Among them, the method of determining the emotional word is the same as the principle of determining the attribute word, which has been described above and will not be repeated here.
在步骤32中,分别将每个属性词与每个情感词组合,获得属性情感词对。In step 32, each attribute word is combined with each emotion word to obtain an attribute emotion word pair.
每个属性情感词对中包含一个属性词和一个情感词。Each attribute affect word pair contains an attribute word and an affect word.
示例地,若待处理文本中有属性词{m1,m2,m3,m4}且有情感词{n1,n2},则组合后可得到8个属性情感词对,分别为:m1-n1,m1-n2,m2-n1,m2-n2,m3-n1,m3-n2,m4-n1,m4-n2。For example, if there are attribute words {m1, m2, m3, m4} and emotion words {n1, n2} in the text to be processed, 8 attribute emotion word pairs can be obtained after the combination, namely: m1-n1, m1 -n2, m2-n1, m2-n2, m3-n1, m3-n2, m4-n1, m4-n2.
在步骤33中,将属性情感词对和属性情感词对的位置信息输入至关联模型,以获得关联模型输出的关联结果。In step 33, the position information of the attribute emotion word pair and the attribute emotion word pair is input to the association model to obtain the association result output by the association model.
其中,属性情感词对的位置信息用于指示属性情感词对中的属性词和情感词在待处理文本中的位置关系。例如,属性词和情感词各自在待处理文本中的位置、属性词和情感词在待处理文本中的距离、属性词和情感词是否处于同一个句子等。Among them, the position information of the attribute emotion word pair is used to indicate the position relationship between the attribute word in the attribute emotion word pair and the emotion word in the text to be processed. For example, the position of the attribute word and the emotion word in the text to be processed, the distance between the attribute word and the emotion word in the text to be processed, whether the attribute word and the emotion word are in the same sentence, and so on.
关联结果用于指示各个属性情感词对中的属性词和情感词是否相关。属性词和情感词相关是指该情感词所描述的对象是该属性词。属性词和情 感词是否相关可以通过位置反映,例如,相关的属性词和情感词一般位于同一句子、或者位置相近等。The association result is used to indicate whether the attribute words and emotion words in each attribute emotion word pair are related. The correlation between attribute words and emotional words means that the object described by the emotional word is the attribute word. Whether attribute words and emotional words are related can be reflected by their location. For example, related attribute words and emotional words are generally located in the same sentence or in similar locations.
在一种可能的实施方式中,关联模型可以通过如下方式获得:In a possible implementation, the association model can be obtained in the following manner:
将第一历史文本对应的历史属性情感词对和历史属性情感词对的位置信息作为输入数据、并将第一历史文本中各历史属性情感词对的历史关联结果作为输出数据,对深度神经网络模型进行训练,以获得关联模型。The position information of the historical attribute emotional word pair and the historical attribute emotional word pair corresponding to the first historical text is used as input data, and the historical association result of each historical attribute emotional word pair in the first historical text is used as the output data. The model is trained to obtain an associated model.
第一历史文本可以取自待处理文本相关的语料库。第一历史文本和前文中的第二历史文本可以相同。第一历史文本对应的历史属性情感词对的获得方式与步骤32中获得属性情感词对的方式(以及,如何获得属性词、情感词等相关步骤)原理相同,在前文中已有描述,此处不赘述。第一历史文本中各历史属性情感词对的历史关联结果可以人工进行标注,也就是标注各历史属性情感词对是否相关。示例地,若第一历史文本为“A汽车的发动机动力强,但是外观很难看”,其中属性词为“发动机动力”和“外观”,情感词为“强”和“很难看”,其中,共有4个历史属性情感词对,分别为{发动机动力-强,发动机动力-很难看,外观-强,外观-很难看},在人工标注时,将“发动机动力-强”和“外观-很难看”标注为相关,将“发动机动力-很难看”和“外观-强”标注为不相关。The first historical text can be taken from a corpus related to the text to be processed. The first historical text and the second historical text in the preceding text may be the same. The method of obtaining the historical attribute emotional word pair corresponding to the first historical text is the same as the method of obtaining the attribute emotional word pair in step 32 (and related steps of how to obtain attribute words, emotional words, etc.), which has been described in the foregoing. Do not repeat it here. The historical association results of each historical attribute emotional word pair in the first historical text can be manually labeled, that is, whether each historical attribute emotional word pair is related or not. For example, if the first historical text is "The engine power of A car is strong, but the appearance is ugly", the attribute words are "engine power" and "appearance", and the emotional words are "strong" and "unsightly", among which, There are a total of 4 historical attribute emotional word pairs, namely {engine power-strong, engine power-ugly, appearance-strong, appearance-ugly}. When manually labeling, "engine power-strong" and "appearance-very "Unsightly" is marked as relevant, and "engine power-ugly" and "appearance-strong" are marked as irrelevant.
由此,将第一历史文本对应的历史属性情感词对和历史属性情感词对的位置信息作为输入数据、并将第一历史文本中各历史属性情感词对的历史关联结果作为输出数据,对深度神经网络模型进行训练,以获得关联模型。示例地,在模型训练时,基于RandomForest、LSTM-attention、Recurrent Attention等学习方法对深度神经网络模型进行训练。Therefore, the position information of the historical attribute emotional word pair and the historical attribute emotional word pair corresponding to the first historical text is used as input data, and the historical association result of each historical attribute emotional word pair in the first historical text is used as output data. The deep neural network model is trained to obtain the correlation model. For example, during model training, the deep neural network model is trained based on learning methods such as RandomForest, LSTM-attention, and Recurrent Attention.
采用上述方式,基于已有数据进行模型训练以得到关联模型,在实际应用时,直接将相应数据输入该关联模型,就可以得到关联模型输出的关联结果,应用简单且方便。In the above manner, model training is performed based on the existing data to obtain the association model. In actual application, the corresponding data is directly input into the association model to obtain the association result output by the association model, and the application is simple and convenient.
在步骤34中,根据关联结果,确定目标属性情感词对。In step 34, the target attribute emotional word pair is determined according to the association result.
目标属性情感词对就是指待处理文本中其属性词和情感词相关的属性情感词对。The target attribute emotional word pair refers to the attribute emotional word pair related to the attribute word and the emotional word in the text to be processed.
在得到关联结果后,就可以从中挑选出其属性词和情感词相关的目标情感词对,供用户查看或使用。After the association result is obtained, the target emotional word pair related to the attribute word and the emotional word can be selected from it for the user to view or use.
采用上述方式,在确定待处理文本中的属性词及其情感类型后,还能够从待处理文本中关联提取出与各属性词相关的情感词,信息提取功能更加完善,方便用户对数据的查看及使用。Using the above method, after determining the attribute words in the text to be processed and their emotional types, the emotional words related to each attribute word can also be extracted from the text to be processed. The information extraction function is more complete and it is convenient for users to view the data. And use.
图4是根据本公开的一种实施方式提供的文本信息提取装置的框图。如图4所示,该装置40包括:Fig. 4 is a block diagram of a text information extraction device provided according to an embodiment of the present disclosure. As shown in Fig. 4, the device 40 includes:
第一确定模块41,用于确定待处理文本对应的目标文本矩阵,其中,文本对应的文本矩阵包括该文本中各个分词对应的向量化表示;The first determining module 41 is configured to determine a target text matrix corresponding to the text to be processed, where the text matrix corresponding to the text includes a vectorized representation corresponding to each word segmentation in the text;
第一处理模块42,用于将所述目标文本矩阵输入至文本信息提取模型,以获得所述文本信息提取模型输出的信息提取结果,所述信息提取结果包括所述待处理文本中各分词的标签信息,所述标签信息用于指示分词类型,所述分词类型包括属性类,以及,若分词的分词类型为所述属性类,该分词的标签信息还用于指示该分词的情感类型;The first processing module 42 is configured to input the target text matrix into a text information extraction model to obtain an information extraction result output by the text information extraction model, and the information extraction result includes the information of each word segmentation in the text to be processed Tag information, the tag information is used to indicate the word segmentation type, the word segmentation type includes an attribute class, and if the word segmentation type of the word segmentation is the attribute class, the tag information of the word segmentation is also used to indicate the emotion type of the word segmentation;
第二确定模块43,用于根据所述待处理文本中分词类型为所述属性类的第一分词及所述第一分词的情感类型,确定所述待处理文本中的属性词以及各个所述属性词的情感类型,其中,每个所述属性词由至少一个所述第一分词构成。The second determining module 43 is configured to determine the attribute words in the to-be-processed text and each of the attribute words in the to-be-processed text according to the first participle of the attribute type in the text to be processed and the emotion type of the first participle. The sentiment type of the attribute word, wherein each attribute word is composed of at least one of the first participles.
可选地,所述分词类型还包括情感类;Optionally, the word segmentation type further includes sentiment;
所述装置40还包括:The device 40 also includes:
第三确定模块,用于根据所述待处理文本中分词类型为所述情感类的第二分词,确定所述待处理文本中的情感词,其中,每个所述情感词由至 少一个所述第二分词构成;The third determining module is configured to determine the emotional words in the to-be-processed text according to the second word-segmentation type of the emotional category in the to-be-processed text, wherein each of the emotional words is composed of at least one of the The second participle constitutes;
第四确定模块,用于分别将每个属性词与每个情感词组合,获得属性情感词对,每个所述属性情感词对中包含一个所述属性词和一个所述情感词;The fourth determining module is used to combine each attribute word with each emotion word to obtain attribute emotion word pairs, and each attribute emotion word pair includes one attribute word and one emotion word;
第二处理模块,用于将所述属性情感词对和所述属性情感词对的位置信息输入至关联模型,获得所述关联模型输出的关联结果,所述关联结果用于指示各个所述属性情感词对中的属性词和情感词是否相关,以及,所述属性情感词对的位置信息用于指示所述属性情感词对中的属性词和情感词在所述待处理文本中的位置关系;The second processing module is configured to input the position information of the attribute emotion word pair and the attribute emotion word pair into the association model to obtain the association result output by the association model, and the association result is used to indicate each of the attributes Whether the attribute word and the emotion word in the emotion word pair are related, and the position information of the attribute emotion word pair is used to indicate the position relationship between the attribute word and the emotion word in the attribute emotion word pair in the text to be processed ;
第五确定模块,用于根据所述关联结果,确定属性词和情感词相关的目标属性情感词对。The fifth determining module is used to determine the target attribute emotional word pair related to the attribute word and the emotional word according to the association result.
可选地,所述关联模型通过如下方式获得:Optionally, the association model is obtained in the following manner:
将第一历史文本对应的历史属性情感词对和所述历史属性情感词对的位置信息作为输入数据、并将所述第一历史文本中各历史属性情感词对的历史关联结果作为输出数据,对深度神经网络模型进行训练,以获得所述关联模型。Using the historical attribute emotional word pair corresponding to the first historical text and the position information of the historical attribute emotional word pair as input data, and the historical association result of each historical attribute emotional word pair in the first historical text as output data, Training the deep neural network model to obtain the association model.
可选地,所述第一确定模块41包括:Optionally, the first determining module 41 includes:
第一确定子模块,用于对所述待处理文本进行分词处理,并确定所述待处理文本中各个分词对应的词向量和词性向量;The first determining sub-module is configured to perform word segmentation processing on the to-be-processed text, and determine the word vector and part-of-speech vector corresponding to each word segmentation in the to-be-processed text;
处理子模块,用于对各个所述分词的词向量和词性向量进行拼接,得到每个所述分词对应的向量化表示;A processing sub-module for splicing the word vector and part-of-speech vector of each said word segmentation to obtain the vectorized representation corresponding to each said word segmentation;
第二确定子模块,用于根据各个所述分词对应的所述向量化表示,确定所述目标文本矩阵,其中,所述待处理文本中每个所述分词的向量化表示对应所述目标文本矩阵中的一行。The second determining submodule is configured to determine the target text matrix according to the vectorized representation corresponding to each of the word segmentation, wherein the vectorized representation of each word segmentation in the to-be-processed text corresponds to the target text A row in the matrix.
可选地,所述文本信息提取模型通过如下方式获得:Optionally, the text information extraction model is obtained in the following manner:
将第二历史文本对应的历史文本矩阵作为输入数据、并将所述第二历史文本中各分词对应的历史标签信息作为输出数据,对深度神经网络模型进行训练,以获得所述文本信息提取模型。The historical text matrix corresponding to the second historical text is used as input data, and the historical label information corresponding to each word segmentation in the second historical text is used as output data, and the deep neural network model is trained to obtain the text information extraction model .
可选地,若分词的分词类型为所述属性类,该分词的标签信息还用于指示该分词在预设词中是否处于首位;Optionally, if the word segmentation type of the word segmentation is the attribute class, the tag information of the word segmentation is also used to indicate whether the word segmentation is in the first position in the preset word;
所述第二确定模块43包括:The second determining module 43 includes:
属性词确定子模块,用于根据各个所述第一分词的所述标签信息及其在所述待处理文本中的位置,确定所述待处理文本中的所述属性词,其中,在预设词中处于首位的所述第一分词在其所在的属性词中处于首位。The attribute word determination sub-module is used to determine the attribute words in the to-be-processed text according to the tag information of each of the first word segmentation and its position in the to-be-processed text, wherein, in the preset The first participle in the first position in the word is in the first position in the attribute word in which it is located.
可选地,所述第二确定模块43包括:Optionally, the second determining module 43 includes:
情感类型确定子模块,用于将构成所述属性词的首个所述第一分词对应的情感类型确定为该属性词对应的情感类型。The emotion type determination sub-module is used to determine the emotion type corresponding to the first part of the attribute word as the emotion type corresponding to the attribute word.
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。Regarding the device in the foregoing embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and detailed description will not be given here.
所述文本信息提取装置包括处理器和存储器,上述第一确定模块、第一处理模块、第二确定模块、第三确定模块、第四确定模块、第二处理模块、第五确定模块等均作为程序单元存储在存储器中,由处理器执行存储在存储器中的上述程序单元来实现相应的功能。The text information extraction device includes a processor and a memory, and the above-mentioned first determination module, first processing module, second determination module, third determination module, fourth determination module, second processing module, fifth determination module, etc., all serve as The program unit is stored in the memory, and the above-mentioned program unit stored in the memory is executed by the processor to realize the corresponding function.
处理器中包含内核,由内核去存储器中调取相应的程序单元。内核可以设置一个或以上,通过调整内核参数来更加快速且准确地实现文本信息提取,快速获得属性词及其情感。The processor contains the kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to one or more, and the text information can be extracted more quickly and accurately by adjusting the kernel parameters, and the attribute words and their emotion can be obtained quickly.
本发明实施例提供了一种存储介质,其上存储有程序,该程序被处理器执行时实现所述文本信息提取方法。The embodiment of the present invention provides a storage medium on which a program is stored, and when the program is executed by a processor, the text information extraction method is implemented.
本发明实施例提供了一种处理器,所述处理器用于运行程序,其中,所述程序运行时执行所述文本信息提取方法。The embodiment of the present invention provides a processor configured to run a program, wherein the method for extracting text information is executed when the program is running.
本发明实施例提供了一种设备,如图5所示,设备70包括至少一个处理器701、以及与处理器701连接的至少一个存储器702、总线703;其中,处理器701、存储器702通过总线703完成相互间的通信;处理器701用于调用存储器702中的程序指令,以执行上述的文本信息提取方法。本文中的设备可以是服务器、PC等。An embodiment of the present invention provides a device. As shown in FIG. 5, the device 70 includes at least one processor 701, and at least one memory 702 and a bus 703 connected to the processor 701; wherein the processor 701 and the memory 702 pass through the bus 703 completes mutual communication; the processor 701 is configured to call program instructions in the memory 702 to execute the above-mentioned text information extraction method. The device in this article can be a server, a PC, etc.
本申请还提供了一种计算机程序产品,当在数据处理设备上执行时,适于执行初始化有如下方法步骤的程序:This application also provides a computer program product, which when executed on a data processing device, is suitable for executing a program that initializes the following method steps:
确定待处理文本对应的目标文本矩阵,其中,文本对应的文本矩阵包括该文本中各个分词对应的向量化表示;Determine the target text matrix corresponding to the text to be processed, where the text matrix corresponding to the text includes the vectorized representation corresponding to each word segmentation in the text;
将所述目标文本矩阵输入至文本信息提取模型,以获得所述文本信息提取模型输出的信息提取结果,所述信息提取结果包括所述待处理文本中各分词的标签信息,所述标签信息用于指示分词类型,所述分词类型包括属性类,以及,若分词的分词类型为所述属性类,该分词的标签信息还用于指示该分词的情感类型;The target text matrix is input to a text information extraction model to obtain an information extraction result output by the text information extraction model. The information extraction result includes the label information of each word segment in the text to be processed, and the label information is used In order to indicate the word segmentation type, the word segmentation type includes an attribute type, and if the word segmentation type of the word segmentation is the attribute type, the tag information of the word segmentation is also used to indicate the emotion type of the word segmentation;
根据所述待处理文本中分词类型为所述属性类的第一分词及其情感类型,确定所述待处理文本中的属性词以及各个所述属性词的情感类型,其中,每个所述属性词由至少一个所述第一分词构成。Determine the attribute words in the to-be-processed text and the emotion type of each attribute word according to the first participle of the attribute type in the text to be processed and the emotion type thereof, where each attribute The word is composed of at least one of the first participles.
可选地,所述分词类型还包括情感类;Optionally, the word segmentation type further includes sentiment;
所述方法还包括:The method also includes:
根据所述待处理文本中分词类型为所述情感类的第二分词,确定所述待处理文本中的情感词,其中,每个所述情感词由至少一个所述第二分词构成;Determine the emotional word in the to-be-processed text according to the second word segmentation in the to-be-processed text whose word segmentation type is the emotion category, wherein each of the emotional words is composed of at least one of the second participles;
分别将每个属性词与每个情感词组合,获得属性情感词对,每个所述属性情感词对中包含一个所述属性词和一个所述情感词;Combine each attribute word with each emotion word respectively to obtain attribute emotion word pairs, and each attribute emotion word pair includes one attribute word and one emotion word;
将所述属性情感词对和所述属性情感词对的位置信息输入至关联模 型,以获得所述关联模型输出的关联结果,所述关联结果用于指示各个所述属性情感词对中的属性词和情感词是否相关,以及,所述属性情感词对的位置信息用于指示所述属性情感词对中的属性词和情感词在所述待处理文本中的位置关系;Input the position information of the attribute emotion word pair and the attribute emotion word pair into the association model to obtain the association result output by the association model, and the association result is used to indicate the attribute in each attribute emotion word pair Whether the word and the emotion word are related, and the position information of the attribute emotion word pair is used to indicate the position relationship between the attribute word and the emotion word in the attribute emotion word pair in the text to be processed;
根据所述关联结果,确定属性词和情感词相关的目标属性情感词对。According to the association result, a target attribute emotional word pair related to the attribute word and the emotional word is determined.
可选地,所述关联模型通过如下方式获得:Optionally, the association model is obtained in the following manner:
将第一历史文本对应的历史属性情感词对和所述历史属性情感词对的位置信息作为输入数据、并将所述第一历史文本中各历史属性情感词对的历史关联结果作为输出数据,对深度神经网络模型进行训练,以获得所述关联模型。Using the historical attribute emotional word pair corresponding to the first historical text and the position information of the historical attribute emotional word pair as input data, and the historical association result of each historical attribute emotional word pair in the first historical text as output data, Training the deep neural network model to obtain the association model.
可选地,所述确定待处理文本对应的目标文本矩阵,包括:Optionally, the determining the target text matrix corresponding to the text to be processed includes:
对所述待处理文本进行分词处理,并确定所述待处理文本中各个分词对应的词向量和词性向量;Perform word segmentation processing on the to-be-processed text, and determine the word vector and part-of-speech vector corresponding to each word segmentation in the to-be-processed text;
对各个所述分词的词向量和词性向量进行拼接,得到每个所述分词对应的向量化表示;Splicing the word vector and the part-of-speech vector of each said word segmentation to obtain a vectorized representation corresponding to each said word segmentation;
根据各个所述分词对应的所述向量化表示,确定所述目标文本矩阵,其中,所述待处理文本中每个所述分词的向量化表示对应所述目标文本矩阵中的一行。The target text matrix is determined according to the vectorized representation corresponding to each word segmentation, wherein the vectorized representation of each word segmentation in the to-be-processed text corresponds to a row in the target text matrix.
可选地,所述文本信息提取模型通过如下方式获得:Optionally, the text information extraction model is obtained in the following manner:
将第二历史文本对应的历史文本矩阵作为输入数据、并将所述第二历史文本中各分词对应的历史标签信息作为输出数据,对深度神经网络模型进行训练,以获得所述文本信息提取模型。The historical text matrix corresponding to the second historical text is used as input data, and the historical label information corresponding to each word segmentation in the second historical text is used as output data, and the deep neural network model is trained to obtain the text information extraction model .
可选地,若分词的分词类型为所述属性类,该分词的标签信息还用于指示该分词在预设词中是否处于首位;Optionally, if the word segmentation type of the word segmentation is the attribute class, the tag information of the word segmentation is also used to indicate whether the word segmentation is in the first position in the preset word;
确定所述待处理文本中的属性词,包括:Determining the attribute words in the to-be-processed text includes:
根据各个所述第一分词的所述标签信息及其在所述待处理文本中的位置,确定所述待处理文本中的所述属性词,其中,在预设词中处于首位的所述第一分词在其所在的属性词中处于首位。According to the label information of each of the first word segmentation and its position in the to-be-processed text, determine the attribute word in the to-be-processed text, where the first word in the preset word A participle is in the first place among the attribute words in which it is located.
可选地,确定属性词的情感类型,包括:Optionally, determining the sentiment type of the attribute word includes:
将构成所述属性词的首个所述第一分词对应的情感类型确定为该属性词对应的情感类型。The sentiment type corresponding to the first said first participle constituting the attribute word is determined as the sentiment type corresponding to the attribute word.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。This application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
在一个典型的配置中,设备包括一个或多个处理器(CPU)、存储器和总线。设备还可以包括输入/输出接口、网络接口等。In a typical configuration, the device includes one or more processors (CPUs), memory, and buses. The device may also include input/output interfaces, network interfaces, and so on.
存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM),存储器包括至少一个存储芯片。存储器是计算机可读介质的示例。The memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), and the memory includes at least one Memory chip. The memory is an example of a computer-readable medium.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只 读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or equipment including a series of elements not only includes those elements, but also includes Other elements that are not explicitly listed, or they also include elements inherent to such processes, methods, commodities, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, commodity, or equipment that includes the element.
本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application can be provided as a method, a system, or a computer program product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
以上仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。The above are only examples of the application, and are not used to limit the application. For those skilled in the art, this application can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.

Claims (17)

  1. 一种文本信息提取方法,其特征在于,所述方法包括:A method for extracting text information, characterized in that the method includes:
    确定待处理文本对应的目标文本矩阵,其中,文本对应的文本矩阵包括该文本中各个分词对应的向量化表示;Determine the target text matrix corresponding to the text to be processed, where the text matrix corresponding to the text includes the vectorized representation corresponding to each word segmentation in the text;
    将所述目标文本矩阵输入至文本信息提取模型,以获得所述文本信息提取模型输出的信息提取结果,所述信息提取结果包括所述待处理文本中各分词的标签信息,所述标签信息用于指示分词类型,所述分词类型包括属性类,以及,若分词的分词类型为所述属性类,该分词的标签信息还用于指示该分词的情感类型;The target text matrix is input to a text information extraction model to obtain an information extraction result output by the text information extraction model. The information extraction result includes the label information of each word segment in the text to be processed, and the label information is used In order to indicate the word segmentation type, the word segmentation type includes an attribute type, and if the word segmentation type of the word segmentation is the attribute type, the tag information of the word segmentation is also used to indicate the emotion type of the word segmentation;
    根据所述待处理文本中分词类型为所述属性类的第一分词及所述第一分词的情感类型,确定所述待处理文本中的属性词以及各个所述属性词的情感类型,其中,每个所述属性词由至少一个所述第一分词构成。Determine the attribute words in the to-be-processed text and the emotion type of each attribute word according to the first participle of the attribute class and the emotion type of the first participle in the to-be-processed text, wherein, Each of the attribute words is composed of at least one of the first participles.
  2. 根据权利要求1所述的方法,其特征在于,所述分词类型还包括情感类;The method according to claim 1, wherein the word segmentation type further includes emotion type;
    所述方法还包括:The method also includes:
    根据所述待处理文本中分词类型为所述情感类的第二分词,确定所述待处理文本中的情感词,其中,每个所述情感词由至少一个所述第二分词构成;Determine the emotional word in the to-be-processed text according to the second word segmentation in the to-be-processed text whose word segmentation type is the emotion category, wherein each of the emotional words is composed of at least one of the second participles;
    分别将每个属性词与每个情感词组合,获得属性情感词对,每个所述属性情感词对中包含一个所述属性词和一个所述情感词;Combine each attribute word with each emotion word respectively to obtain attribute emotion word pairs, and each attribute emotion word pair includes one attribute word and one emotion word;
    将所述属性情感词对和所述属性情感词对的位置信息输入至关联模型,获得所述关联模型输出的关联结果,所述关联结果用于指示各个所述属性情感词对中的属性词和情感词是否相关,以及,所述属性情感词对的位置信息用于指示所述属性情感词对中的属性词和情感词在所述待处理文 本中的位置关系;Input the position information of the attribute emotion word pair and the attribute emotion word pair into the association model to obtain the association result output by the association model, and the association result is used to indicate the attribute words in each attribute emotion word pair Whether it is related to an emotion word, and the position information of the attribute emotion word pair is used to indicate the position relationship between the attribute word in the attribute emotion word pair and the emotion word in the text to be processed;
    根据所述关联结果,确定属性词和情感词相关的目标属性情感词对。According to the association result, a target attribute emotional word pair related to the attribute word and the emotional word is determined.
  3. 根据权利要求2所述的方法,其特征在于,所述关联模型通过如下方式获得:The method according to claim 2, wherein the association model is obtained in the following manner:
    将第一历史文本对应的历史属性情感词对和所述历史属性情感词对的位置信息作为输入数据、并将所述第一历史文本中各历史属性情感词对的历史关联结果作为输出数据,对深度神经网络模型进行训练,以获得所述关联模型。Using the historical attribute emotional word pair corresponding to the first historical text and the position information of the historical attribute emotional word pair as input data, and the historical association result of each historical attribute emotional word pair in the first historical text as output data, Training the deep neural network model to obtain the association model.
  4. 根据权利要求1所述的方法,其特征在于,所述确定待处理文本对应的目标文本矩阵,包括:The method according to claim 1, wherein said determining the target text matrix corresponding to the text to be processed comprises:
    对所述待处理文本进行分词处理,并确定所述待处理文本中各个分词对应的词向量和词性向量;Perform word segmentation processing on the to-be-processed text, and determine the word vector and part-of-speech vector corresponding to each word segmentation in the to-be-processed text;
    对各个所述分词的词向量和词性向量进行拼接,得到每个所述分词对应的向量化表示;Splicing the word vector and the part-of-speech vector of each said word segmentation to obtain a vectorized representation corresponding to each said word segmentation;
    根据各个所述分词对应的所述向量化表示,确定所述目标文本矩阵,其中,所述待处理文本中每个所述分词的向量化表示对应所述目标文本矩阵中的一行。The target text matrix is determined according to the vectorized representation corresponding to each word segmentation, wherein the vectorized representation of each word segmentation in the to-be-processed text corresponds to a row in the target text matrix.
  5. 根据权利要求1所述的方法,其特征在于,所述文本信息提取模型通过如下方式获得:The method according to claim 1, wherein the text information extraction model is obtained in the following manner:
    将第二历史文本对应的历史文本矩阵作为输入数据、并将所述第二历史文本中各分词对应的历史标签信息作为输出数据,对深度神经网络模型进行训练,以获得所述文本信息提取模型。The historical text matrix corresponding to the second historical text is used as input data, and the historical label information corresponding to each word segmentation in the second historical text is used as output data, and the deep neural network model is trained to obtain the text information extraction model .
  6. 根据权利要求1所述的方法,其特征在于,若分词的分词类型为所述属性类,该分词的标签信息还用于指示该分词在预设词中是否处于首位;The method according to claim 1, wherein if the word segmentation type of the word segmentation is the attribute class, the tag information of the word segmentation is also used to indicate whether the word segmentation is in the first position in the preset word;
    确定所述待处理文本中的属性词,包括:Determining the attribute words in the to-be-processed text includes:
    根据各个所述第一分词的所述标签信息及其在所述待处理文本中的位置,确定所述待处理文本中的所述属性词,其中,在预设词中处于首位的所述第一分词在其所在的属性词中处于首位。According to the label information of each of the first word segmentation and its position in the to-be-processed text, determine the attribute word in the to-be-processed text, where the first word in the preset word A participle is in the first place among the attribute words in which it is located.
  7. 根据权利要求1所述的方法,其特征在于,确定属性词的情感类型,包括:The method according to claim 1, wherein determining the sentiment type of the attribute word comprises:
    将构成所述属性词的首个所述第一分词对应的情感类型确定为该属性词对应的情感类型。The sentiment type corresponding to the first said first participle constituting the attribute word is determined as the sentiment type corresponding to the attribute word.
  8. 一种文本信息提取装置,其特征在于,所述装置包括:A text information extraction device, characterized in that the device comprises:
    第一确定模块,用于确定待处理文本对应的目标文本矩阵,其中,文本对应的文本矩阵包括该文本中各个分词对应的向量化表示;The first determining module is used to determine the target text matrix corresponding to the text to be processed, wherein the text matrix corresponding to the text includes the vectorized representation corresponding to each word segmentation in the text;
    第一处理模块,用于将所述目标文本矩阵输入至文本信息提取模型,以获得所述文本信息提取模型输出的信息提取结果,所述信息提取结果包括所述待处理文本中各分词的标签信息,所述标签信息用于指示分词类型,所述分词类型包括属性类,以及,若分词的分词类型为所述属性类,该分词的标签信息还用于指示该分词的情感类型;The first processing module is configured to input the target text matrix into a text information extraction model to obtain an information extraction result output by the text information extraction model, and the information extraction result includes the label of each word segment in the text to be processed Information, the tag information is used to indicate the word segmentation type, the word segmentation type includes an attribute class, and, if the word segmentation type of the word segmentation is the attribute class, the tag information of the word segmentation is also used to indicate the emotion type of the word segmentation;
    第二确定模块,用于根据所述待处理文本中分词类型为所述属性类的第一分词及所述第一分词的情感类型,确定所述待处理文本中的属性词以及各个所述属性词的情感类型,其中,每个所述属性词由至少一个所述第一分词构成。The second determining module is configured to determine the attribute words in the to-be-processed text and each of the attributes according to the first participle of the attribute type in the to-be-processed text and the sentiment type of the first participle The emotional type of the word, wherein each of the attribute words is composed of at least one of the first participles.
  9. 根据权利要求8所述的装置,其特征在于,所述分词类型还包括情感类;8. The device according to claim 8, wherein the word segmentation type further includes sentiment;
    所述装置还包括:The device also includes:
    第三确定模块,用于根据所述待处理文本中分词类型为所述情感类的第二分词,确定所述待处理文本中的情感词,其中,每个所述情感词由至少一个所述第二分词构成;The third determining module is configured to determine the emotional words in the to-be-processed text according to the second word-segmentation type of the emotional category in the to-be-processed text, wherein each of the emotional words is composed of at least one of the The second participle constitutes;
    第四确定模块,用于分别将每个属性词与每个情感词组合,获得属性情感词对,每个所述属性情感词对中包含一个所述属性词和一个所述情感词;The fourth determining module is used to combine each attribute word with each emotion word to obtain attribute emotion word pairs, and each attribute emotion word pair includes one attribute word and one emotion word;
    第二处理模块,用于将所述属性情感词对和所述属性情感词对的位置信息输入至关联模型,获得所述关联模型输出的关联结果,所述关联结果用于指示各个所述属性情感词对中的属性词和情感词是否相关,以及,所述属性情感词对的位置信息用于指示所述属性情感词对中的属性词和情感词在所述待处理文本中的位置关系;The second processing module is configured to input the position information of the attribute emotion word pair and the attribute emotion word pair into the association model to obtain the association result output by the association model, and the association result is used to indicate each of the attributes Whether the attribute word and the emotion word in the emotion word pair are related, and the position information of the attribute emotion word pair is used to indicate the position relationship between the attribute word and the emotion word in the attribute emotion word pair in the text to be processed ;
    第五确定模块,用于根据所述关联结果,确定属性词和情感词相关的目标属性情感词对。The fifth determining module is used to determine the target attribute emotional word pair related to the attribute word and the emotional word according to the association result.
  10. 根据权利要求9所述的装置,其特征在于,所述关联模型通过如下方式获得:The device according to claim 9, wherein the association model is obtained in the following manner:
    将第一历史文本对应的历史属性情感词对和所述历史属性情感词对的位置信息作为输入数据、并将所述第一历史文本中各历史属性情感词对的历史关联结果作为输出数据,对深度神经网络模型进行训练,以获得所述关联模型。Using the historical attribute emotional word pair corresponding to the first historical text and the position information of the historical attribute emotional word pair as input data, and the historical association result of each historical attribute emotional word pair in the first historical text as output data, Training the deep neural network model to obtain the association model.
  11. 根据权利要求8所述的装置,其特征在于,所述第一确定模块包括:The device according to claim 8, wherein the first determining module comprises:
    第一确定子模块,用于对所述待处理文本进行分词处理,并确定所述待处理文本中各个分词对应的词向量和词性向量;The first determining sub-module is configured to perform word segmentation processing on the to-be-processed text, and determine the word vector and part-of-speech vector corresponding to each word segmentation in the to-be-processed text;
    处理子模块,用于对各个所述分词的词向量和词性向量进行拼接,得到每个所述分词对应的向量化表示;A processing sub-module for splicing the word vector and part-of-speech vector of each said word segmentation to obtain the vectorized representation corresponding to each said word segmentation;
    第二确定子模块,用于根据各个所述分词对应的所述向量化表示,确定所述目标文本矩阵,其中,所述待处理文本中每个所述分词的向量化表示对应所述目标文本矩阵中的一行。The second determining submodule is configured to determine the target text matrix according to the vectorized representation corresponding to each of the word segmentation, wherein the vectorized representation of each word segmentation in the to-be-processed text corresponds to the target text A row in the matrix.
  12. 根据权利要求8所述的装置,其特征在于,所述文本信息提取模型通过如下方式获得:The device according to claim 8, wherein the text information extraction model is obtained in the following manner:
    将第二历史文本对应的历史文本矩阵作为输入数据、并将所述第二历史文本中各分词对应的历史标签信息作为输出数据,对深度神经网络模型进行训练,以获得所述文本信息提取模型。The historical text matrix corresponding to the second historical text is used as input data, and the historical label information corresponding to each word segmentation in the second historical text is used as output data, and the deep neural network model is trained to obtain the text information extraction model .
  13. 根据权利要求8所述的装置,其特征在于,若分词的分词类型为所述属性类,该分词的标签信息还用于指示该分词在预设词中是否处于首位;8. The device according to claim 8, wherein if the word segmentation type of the word segmentation is the attribute class, the tag information of the word segmentation is also used to indicate whether the word segmentation is in the first position in the preset word;
    所述第二确定模块包括:The second determining module includes:
    属性词确定子模块,用于根据各个所述第一分词的所述标签信息及其在所述待处理文本中的位置,确定所述待处理文本中的所述属性词,其中,在预设词中处于首位的所述第一分词在其所在的属性词中处于首位。The attribute word determination sub-module is used to determine the attribute words in the to-be-processed text according to the tag information of each of the first word segmentation and its position in the to-be-processed text, wherein, in the preset The first participle in the first position in the word is in the first position in the attribute word in which it is located.
  14. 根据权利要求8所述的装置,其特征在于,所述第二确定模块包括:The device according to claim 8, wherein the second determining module comprises:
    情感类型确定子模块,用于将构成所述属性词的首个所述第一分词对应的情感类型确定为该属性词对应的情感类型。The emotion type determination sub-module is used to determine the emotion type corresponding to the first part of the attribute word as the emotion type corresponding to the attribute word.
  15. 一种存储介质,其上存储有程序,其特征在于,该程序被处理器执行时实现权利要求1-7中任一项所述方法的步骤。A storage medium having a program stored thereon, wherein the program is executed by a processor to implement the steps of the method according to any one of claims 1-7.
  16. 一种设备,其特征在于,所述设备包括:A device, characterized in that the device includes:
    至少一个处理器、以及与所述处理器连接的至少一个存储器、总线;At least one processor, and at least one memory and bus connected to the processor;
    其中,所述处理器、所述存储器通过所述总线完成相互间的通信;Wherein, the processor and the memory complete mutual communication through the bus;
    所述处理器用于调用所述存储器中的程序指令,以执行权利要求1-7中任一项所述方法的步骤。The processor is used to call program instructions in the memory to execute the steps of the method in any one of claims 1-7.
  17. 一种计算机程序产品,其特征在于,当在数据处理设备上执行时,适于执行初始化有如权利要求1-7中任一项所述方法的步骤。A computer program product, characterized in that, when executed on a data processing device, it is adapted to perform the steps of initializing the method according to any one of claims 1-7.
PCT/CN2020/100483 2019-09-30 2020-07-06 Text information extraction method and apparatus, storage medium and device WO2021063060A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910943335.XA CN112580358A (en) 2019-09-30 2019-09-30 Text information extraction method, device, storage medium and equipment
CN201910943335.X 2019-09-30

Publications (1)

Publication Number Publication Date
WO2021063060A1 true WO2021063060A1 (en) 2021-04-08

Family

ID=75116511

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/100483 WO2021063060A1 (en) 2019-09-30 2020-07-06 Text information extraction method and apparatus, storage medium and device

Country Status (2)

Country Link
CN (1) CN112580358A (en)
WO (1) WO2021063060A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114239590B (en) * 2021-12-01 2023-09-19 马上消费金融股份有限公司 Data processing method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140278365A1 (en) * 2013-03-12 2014-09-18 Guangsheng Zhang System and methods for determining sentiment based on context
CN107832305A (en) * 2017-11-28 2018-03-23 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN109299457A (en) * 2018-09-06 2019-02-01 北京奇艺世纪科技有限公司 A kind of opining mining method, device and equipment
CN109829033A (en) * 2017-11-23 2019-05-31 阿里巴巴集团控股有限公司 Method for exhibiting data and terminal device
CN109885826A (en) * 2019-01-07 2019-06-14 平安科技(深圳)有限公司 Text term vector acquisition methods, device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140278365A1 (en) * 2013-03-12 2014-09-18 Guangsheng Zhang System and methods for determining sentiment based on context
CN109829033A (en) * 2017-11-23 2019-05-31 阿里巴巴集团控股有限公司 Method for exhibiting data and terminal device
CN107832305A (en) * 2017-11-28 2018-03-23 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN109299457A (en) * 2018-09-06 2019-02-01 北京奇艺世纪科技有限公司 A kind of opining mining method, device and equipment
CN109885826A (en) * 2019-01-07 2019-06-14 平安科技(深圳)有限公司 Text term vector acquisition methods, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN112580358A (en) 2021-03-30

Similar Documents

Publication Publication Date Title
CN106776936B (en) Intelligent interaction method and system
CN105718586B (en) The method and device of participle
CN108763510B (en) Intention recognition method, device, equipment and storage medium
WO2018049960A1 (en) Method and apparatus for matching resource for text information
WO2020140487A1 (en) Speech recognition method for human-machine interaction of smart apparatus, and system
WO2019085697A1 (en) Man-machine interaction method and system
WO2020114100A1 (en) Information processing method and apparatus, and computer storage medium
WO2018086519A1 (en) Method and device for identifying specific text information
WO2020186712A1 (en) Voice recognition method and apparatus, and terminal
CN111295661A (en) Word sense disambiguation method and apparatus, word sense expansion method, device and apparatus, computer readable storage medium
CN110990555B (en) End-to-end retrieval type dialogue method and system and computer equipment
JP2015529901A (en) Information classification based on product recognition
CN110991161B (en) Similar text determination method, neural network model obtaining method and related device
CN111078842A (en) Method, device, server and storage medium for determining query result
CN116304748B (en) Text similarity calculation method, system, equipment and medium
JP2019082931A (en) Retrieval device, similarity calculation method, and program
CN113449084A (en) Relationship extraction method based on graph convolution
CN113901289A (en) Unsupervised learning-based recommendation method and system
CN108875743B (en) Text recognition method and device
WO2021063060A1 (en) Text information extraction method and apparatus, storage medium and device
CN116522905B (en) Text error correction method, apparatus, device, readable storage medium, and program product
CN117076636A (en) Information query method, system and equipment for intelligent customer service
WO2020119346A1 (en) Natural semantic comprehension method and apparatus, and computing device
CN116186219A (en) Man-machine dialogue interaction method, system and storage medium
CN115859121A (en) Text processing model training method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20872454

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20872454

Country of ref document: EP

Kind code of ref document: A1