WO2022267454A1 - Method and apparatus for analyzing text, device and storage medium - Google Patents

Method and apparatus for analyzing text, device and storage medium Download PDF

Info

Publication number
WO2022267454A1
WO2022267454A1 PCT/CN2022/071433 CN2022071433W WO2022267454A1 WO 2022267454 A1 WO2022267454 A1 WO 2022267454A1 CN 2022071433 W CN2022071433 W CN 2022071433W WO 2022267454 A1 WO2022267454 A1 WO 2022267454A1
Authority
WO
WIPO (PCT)
Prior art keywords
analyzed
text
entities
entity
attribute
Prior art date
Application number
PCT/CN2022/071433
Other languages
French (fr)
Chinese (zh)
Inventor
陈凯
徐冰
汪伟
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202110705319.4A external-priority patent/CN113420122B/en
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022267454A1 publication Critical patent/WO2022267454A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Definitions

  • the present application belongs to the technical field of artificial intelligence, and in particular relates to a method, device, equipment and storage medium for analyzing text.
  • Sentiment analysis holds great promise in natural language processing applications. For example, users' satisfaction with products, companies, services, etc. can be evaluated through the comments posted by users on Internet platforms. Therefore, sentiment analysis is particularly important in natural language processing.
  • the inventor realizes that in the existing sentiment analysis, the extracted analysis points are not comprehensive, which leads to inaccurate sentiment analysis results.
  • One of the purposes of the embodiments of the present application is to provide a method, device, device and storage medium for analyzing text, so as to solve the problem of inaccurate sentiment analysis results due to incomplete extracted analysis points in existing sentiment analysis.
  • the embodiment of the present application provides a method for analyzing text, wherein the method includes:
  • the text to be analyzed includes comment sentences containing at least two entities;
  • the at least two entities, the attribute information, and the text to be analyzed are analyzed by using a pre-trained sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities.
  • the embodiment of the present application provides a device for analyzing text, wherein the device includes:
  • an acquisition unit configured to acquire the text to be analyzed
  • An identification unit configured to identify at least two entities in the text to be analyzed, the text to be analyzed includes commentary sentences containing at least two entities;
  • An extraction unit configured to extract attribute information in the text to be analyzed through a pre-trained attribute extraction model
  • the analysis unit is configured to analyze the at least two entities, the attribute information, and the text to be analyzed by using a pre-trained sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities.
  • an embodiment of the present application provides a device for analyzing text, including a memory, a processor, and a computer program stored in the memory and operable on the processor, wherein the processor executes the Realize when describing a computer program:
  • the text to be analyzed includes comment sentences containing at least two entities;
  • the at least two entities, the attribute information, and the text to be analyzed are analyzed by using a pre-trained sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities.
  • the embodiment of the present application provides a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium stores a computer program, and the computer program Implemented when executed by a processor:
  • the text to be analyzed includes comment sentences containing at least two entities;
  • the at least two entities, the attribute information, and the text to be analyzed are analyzed by using a pre-trained sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities.
  • the embodiment of the present application has the following beneficial effects: by obtaining the text to be analyzed; identifying at least two entities in the text to be analyzed, the text to be analyzed includes comment sentences containing at least two entities;
  • the trained attribute extraction model extracts the attribute information in the text to be analyzed; through the pre-trained sentiment analysis model, at least two entities, the attribute information and the text to be analyzed are analyzed to obtain the sentiment analysis corresponding to at least two entities result.
  • the entity in the text to be analyzed is identified, and the attribute information in the text to be analyzed is extracted through the attribute extraction model; then the entity, attribute information, and the text to be analyzed are analyzed through the sentiment analysis model, and the analysis and comparison process is added.
  • the attribute factor converts the simple "entity-advantages and disadvantages" comparison in the prior art into an "entity-attribute information-advantages and disadvantages” comparison, and the extracted analysis points are comprehensive and accurate, making the final entity comparison results more accurate.
  • Fig. 1 is a schematic flowchart of a method for analyzing text provided by an exemplary embodiment of the present application
  • FIG. 2 is a specific flowchart of step S102 of the method for analyzing text shown in an exemplary embodiment of the present application;
  • FIG. 3 is a schematic flowchart of a method for analyzing text provided by another embodiment of the present application.
  • Fig. 4 is a specific flowchart of step S204 of the method for analyzing text shown in an exemplary embodiment of the present application;
  • Fig. 5 is a schematic flowchart of a method for analyzing text shown in an exemplary embodiment of the present application
  • Fig. 6 is a schematic diagram of a device for analyzing text provided by an embodiment of the present application.
  • Fig. 7 is a schematic diagram of a device for analyzing text provided by another embodiment of the present application.
  • a relationship means that there may be three kinds of relationships, for example, A and/or B means: A exists alone, A and B exist simultaneously, and B exists alone.
  • plural refers to two or more than two.
  • first and second are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features. In the description of this embodiment, unless otherwise specified, “plurality” means two or more.
  • Sentiment analysis holds great promise in natural language processing applications. For example, users' satisfaction with products, companies, services, etc. can be evaluated through the comments posted by users on Internet platforms. Therefore, sentiment analysis is particularly important in natural language processing.
  • the present application provides a method for analyzing text, obtaining the text to be analyzed; identifying at least two entities in the text to be analyzed, and the text to be analyzed includes comment sentences containing at least two entities;
  • the attribute extraction model extracts attribute information in the text to be analyzed; at least two entities, the attribute information, and the text to be analyzed are analyzed by a pre-trained sentiment analysis model, and sentiment analysis results corresponding to at least two entities are obtained.
  • the entity in the text to be analyzed is identified, and the attribute information in the text to be analyzed is extracted through the attribute extraction model; then the entity, attribute information, and the text to be analyzed are analyzed through the sentiment analysis model, and the analysis and comparison process is added.
  • the attribute factor converts the simple "entity-advantages and disadvantages" comparison in the prior art into an "entity-attribute information-advantages and disadvantages” comparison, and the extracted analysis points are comprehensive and accurate, making the final entity comparison results more accurate.
  • FIG. 1 is a schematic flowchart of a method for analyzing text provided by an exemplary embodiment of the present application.
  • the subject of execution of the method for analyzing text provided in this application is a device for analyzing text, wherein the device includes but is not limited to smart phones, tablet computers, computers, personal digital assistants (Personal Digital Assistant, PDA), desktop computers and other terminals, and may also include various types of servers.
  • the terminal is used as an example for illustration.
  • the method for analyzing text as shown in Figure 1 may include: S101 ⁇ S104, specifically as follows:
  • the text to be analyzed refers to the text that needs to perform sentiment analysis on the entities in the text. Since the sentiment analysis in this embodiment refers to the comparison of entities, the comparison is only necessary when there are at least two entities, so the text to be analyzed includes comment sentences containing at least two entities. There is no limit to the length and number of comments. For example, a certain text to be analyzed may be "the market value of company A exceeds that of company B", "the market value of company A exceeds that of company B, but the reputation of company B exceeds that of company A”, etc.
  • the text to be analyzed may also be an article, a paragraph of text, etc. composed of comment sentences containing at least two entities. The description here is only for illustration and not for limitation.
  • the terminal when the terminal detects the analysis instruction, it acquires the text to be analyzed.
  • the analysis instruction may be triggered by the user, for example, the user clicks an analysis option in the terminal.
  • Acquiring the text to be analyzed may be the text to be analyzed uploaded by the user to the terminal, or the terminal may obtain the text file corresponding to the file ID according to the file ID contained in the analysis instruction to obtain the text to be analyzed.
  • S102 Identify at least two entities in the text to be analyzed.
  • Entities refer to things that exist objectively and can be distinguished from each other. All entities in the text to be analyzed can be identified through a pre-trained named entity recognition model.
  • Word segmentation processing is performed on the text to be analyzed to obtain multiple word segmentations.
  • Word segmentation processing refers to dividing a continuous word sequence in the text to be analyzed into multiple word sequences, that is, multiple word segmentations, through a word segmentation algorithm.
  • the attribute extraction model may include a word segmentation algorithm, through which the word segmentation process is performed on the text to be analyzed to obtain multiple word segments corresponding to the text to be analyzed. That is, the content in the text to be analyzed is divided into multiple word segmentations through a word segmentation algorithm.
  • the participle may be a word or a single character.
  • multiple word segmentation methods corresponding to the text to be analyzed can be determined according to the word segmentation algorithm, and the most suitable word segmentation method is selected to perform word segmentation on the text to be analyzed to obtain multiple word segmentations corresponding to the text to be analyzed. For example, word segmentation processing is performed on "the market value of company A exceeds that of company B" to obtain "company A/market value/exceeded/company B".
  • Pre-trained attribute extraction models include Bert network, Dense network and CRF network.
  • the Bert network is used to convert multiple word segments corresponding to the text to be analyzed into word vectors corresponding to each word;
  • the Dense network is used to classify each word vector, and output each word vector belongs to the category of attribute information The probability of ;
  • the CRF network is used to label the word vectors belonging to attribute information.
  • multiple word segments are input into the Bert network for processing, and the Bert network maps each word segment to a common semantic space, and outputs a word vector corresponding to each word segment.
  • the Bert network maps each word segment to a common semantic space, and outputs a word vector corresponding to each word segment.
  • the description here is only for illustration and not for limitation.
  • the word vector corresponding to each word segment is input into the Dense network for processing, and the Dense network judges each word. Whether the vector belongs to the attribute information, and output the probability that each word vector belongs to the attribute information.
  • the probabilities of the word vectors corresponding to the participles of A company, market value, exceeding, and B company belonging to the attribute information are 0.2, 0.9, 0.1, and 0.2 in sequence.
  • the output of the Dense network is input into the CRF network, and the CRF network labels the word vector with the highest probability, and outputs the attribute information corresponding to the word vector. For example, the probability corresponding to the market value is the highest, and it is most likely to be attribute information.
  • the word vector corresponding to "market value” is marked with the "BIO" label through the CRF network, where B is used to mark the initial character of the attribute information, and I is used to mark The middle character of attribute information, O is used to mark non-attribute information characters. For example, B is used to mark "city”, I is used to mark "value”, and O is used to mark after "value” and before “exceeding”. This is only an exemplary description and is not limited to this.
  • S104 Analyze at least two entities, attribute information, and text to be analyzed by using a pre-trained sentiment analysis model to obtain sentiment analysis results corresponding to at least two entities.
  • one piece of attribute information corresponds to one kind of sentiment analysis result, and when there is multiple pieces of attribute information, multiple sentiment analysis results are correspondingly output.
  • each sentiment analysis result judges the advantages and disadvantages of the two entities based on each attribute information. For example, if the text to be analyzed is "Company A's market value exceeds that of Company B, but Company B has a good reputation", the corresponding entities in the text to be analyzed are Company A and Company B, and the attribute information is market value and word of mouth.
  • the corresponding entities of the text to be analyzed are The final sentiment analysis result can be: the market value of company A is better than that of company B, the reputation of company B is better than that of company A, or the market value of company A is better than that of company B, and the reputation of company A is worse than that of company B.
  • the description here is only for illustration and not for limitation.
  • the text to be analyzed is obtained; at least two entities in the text to be analyzed are identified, and the text to be analyzed includes a comment sentence containing at least two entities; Attribute information; analyze at least two entities, the attribute information, and the text to be analyzed through a pre-trained sentiment analysis model, and obtain sentiment analysis results corresponding to at least two entities.
  • the attribute information in the text to be analyzed is extracted through the attribute extraction model; and then the entity, attribute information, and the text to be analyzed are analyzed through the sentiment analysis model, and the analysis and comparison process is added.
  • the attribute factor converts the simple "entity-advantages and disadvantages" comparison in the prior art into an "entity-attribute information-advantages and disadvantages” comparison, and the extracted analysis points are comprehensive and accurate, making the final entity comparison results more accurate.
  • FIG. 2 is a specific flowchart of step S102 of the method for analyzing text shown in an exemplary embodiment of the present application; in some possible implementations of the present application, the above S101 may include S1021 ⁇ S1022, specifically as follows:
  • S1021 Perform word segmentation processing on the text to be analyzed to obtain multiple first word segments.
  • a word segmentation algorithm is used to perform word segmentation processing on the text to be analyzed to obtain a plurality of first word segments corresponding to the text to be analyzed.
  • word segmentation process please refer to the word segmentation process in S103, which will not be repeated here.
  • the text to be analyzed may also be preprocessed to obtain a preprocessing result.
  • preprocessing refers to extracting and removing redundant information in the text to be analyzed.
  • Redundant information refers to information that has no practical meaning in the text to be analyzed.
  • redundant information may be stop words, punctuation marks, etc. in the text to be analyzed. Stop words are usually determiners, modal particles, adverbs, prepositions, conjunctions, English characters, numbers, mathematical characters, etc. Among them, English characters are letters that exist alone and have no practical meaning.
  • the English character is a combination of letters and has meaning, at this time, the English character is considered as a valid character and will not be removed.
  • the English characters are CPU, MAC, HR, etc., they will be reserved as valid characters and will not be removed.
  • Word segmentation processing is performed on the preprocessing result to obtain multiple first word segmentations.
  • the text to be analyzed is preprocessed, and the redundant information in the text to be analyzed is removed in advance, so that when the subsequent named entity recognition model processes the preprocessed text to be analyzed, the redundancy of redundant information is reduced. Interference speeds up the processing speed of the named entity recognition model and improves the accuracy of the processing results.
  • S1022 Process multiple first participles based on the pre-trained named entity recognition model to obtain at least two entities in the text to be analyzed.
  • the named entity recognition model is used to identify entities in the text to be analyzed.
  • the type of the named entity recognition model is not limited.
  • the named entity recognition model can be a BERT+CRF model or a BERT+BiLSTM+CRF model.
  • first participles are input into the named entity recognition model, and if there are many first participles input, the first several participles are intercepted. For example, if the total length of all input first participle exceeds the preset length, the first participle with the preset length is intercepted. Alternatively, if the total characters of all input first participles exceed the preset character length, the first participle with the preset character length is intercepted. For example, if the total characters of all input first participles exceed 512 characters, the first participle corresponding to the length of the first 512 characters is intercepted.
  • the intercepted first participles are input to the Bert network in the named entity recognition model for processing, and the Bert network maps each first participle to the public semantic space, and outputs the word vector corresponding to each first participle.
  • the output of the Bert network is input into the CRF network, and the CRF network in the named entity recognition model labels the entities in these word vectors and outputs the recognized entities.
  • the word vector corresponding to "market value” is tagged with "bio" through the CRF network, where b is used to mark the starting character of the entity, i is used to mark the middle character of the entity, and o is used to mark the non-entity character.
  • b is used to mark "A”
  • i is used to mark "company”
  • o is marked after "division” and before “city”. This is only an illustration and not limited.
  • training a named entity recognition model may also be included.
  • the named entity recognition model is obtained by training the training set using a machine learning algorithm.
  • a plurality of sample comment sentences are collected in advance, and entities in each sample comment sentence are marked.
  • a training set is formed based on these sample comment sentences and the labeled entities in the sample comment sentences.
  • a part of the data in the training set can also be used as a test set to facilitate subsequent testing of the model.
  • several sample comment sentences are selected in the training set, and the sample entities corresponding to these sample comment sentences are used as the test set.
  • each sample comment sentence in the training set is processed by an initial named entity recognition network (named entity recognition model before training), to obtain the entity corresponding to each sample comment sentence.
  • an initial named entity recognition network named entity recognition model before training
  • the initial named entity recognition network When the preset number of training times is reached, the initial named entity recognition network at this time is tested.
  • the sample comment sentence in the test set is input into the current initial named entity recognition network for processing, and the current initial named entity recognition network outputs the entity corresponding to the sample comment sentence.
  • a first loss value between the entity corresponding to the sample comment sentence and the sample entity corresponding to the sample comment sentence in the test set is calculated based on the loss function.
  • the loss function may be a cross-entropy loss function.
  • the training of the initial named entity recognition network is stopped, and the trained initial named entity recognition network is used as a trained named entity recognition model.
  • the first preset condition is that the loss value is less than or equal to a preset loss value threshold.
  • the first loss value is greater than the loss value threshold.
  • the parameters of the initial named entity recognition network and continue to train the initial named entity recognition network.
  • the first loss value is less than or equal to the loss value threshold.
  • the loss function convergence means that the value of the loss function tends to be stable.
  • the named entity recognition model is obtained by using the machine learning algorithm to train the training set, and then the entity in the text to be analyzed is identified through the named entity recognition model, which can accurately and quickly identify the entity in the text to be analyzed, which is convenient follow up the entity for sentiment analysis, and then get accurate sentiment analysis results.
  • the above S104 may include S1041 ⁇ S1044, specifically as follows:
  • S1041 Acquire an entity tag group, where the entity tag group includes tags corresponding to the entities to be compared.
  • At least two entities corresponding to the text to be analyzed include a group of entities to be compared.
  • these two entities are entities that can be compared, and it can be understood that these two entities are entities of different subjects.
  • at least one group of entities can be compared.
  • the entity label group refers to the labels corresponding to the two entities to be compared.
  • the text to be analyzed is "the market value of company A exceeds that of company B", and the corresponding entities are "company A” and "company B”.
  • “Company A” and “Company B” are a group of entities to be compared.
  • the entity tag group refers to the entity tag corresponding to "Company A” and the entity tag corresponding to "Company B".
  • the entity in the text to be analyzed is identified by the named entity recognition model, the entity in the text to be analyzed is marked with a "bio" label, and the position of each entity in the text to be analyzed can be determined through the label mark. Sets the entity labels for each entity in the order in which they were determined. Extract the entity labels corresponding to the two entities to be compared.
  • the attribute information in the text to be analyzed is extracted through the attribute extraction model, the attribute information in the text to be analyzed is marked with a "BIO" label, and the position of each attribute information in the text to be analyzed can be determined through the label mark.
  • Set entity tags for each attribute information is
  • the text to be analyzed is "the market value of company A exceeds that of company B", the corresponding attribute information is "market value”, and the attribute tag " ⁇ asp> ⁇ /asp>” is set for "market value”.
  • the entity labels corresponding to the two entities are added to the text to be analyzed, and the attribute information and the attribute The attribute label corresponding to the information is added to the beginning of the text to be analyzed to obtain the second target text to be analyzed.
  • the attribute information and the attribute tag corresponding to the attribute information can also be added to the end of the text to be analyzed to obtain " ⁇ s>A company ⁇ /s> market value exceeds ⁇ o>B company ⁇ /o> ⁇ asp> Market Cap ⁇ /asp>".
  • the description here is only for illustration and not for limitation.
  • S1044 Analyze the second target text to be analyzed by using the sentiment analysis model, and obtain sentiment analysis results corresponding to at least two entities.
  • mapping processing is performed on the second target text to be analyzed to obtain a semantic vector corresponding to the second target text to be analyzed.
  • To classify the semantic vector is to judge which emotional tendency the semantic vector belongs to.
  • the second target text to be analyzed is analyzed through the sentiment analysis model, since the second target text to be analyzed contains attribute tags corresponding to attribute information and entity tags corresponding to the two entities to be compared. Attribute factors are considered in the process, and the extracted analysis points are comprehensive and accurate, which makes the entity comparison results obtained by analysis more accurate.
  • the above S1044 may include S10441 ⁇ S10444, specifically as follows:
  • S10441 Perform word segmentation processing on the second target text to be analyzed to obtain multiple third word segments.
  • S10442 Perform mapping processing on each third participle through the sentiment analysis model to obtain a word vector corresponding to each third participle.
  • multiple third word segmentations are input into the Bert network in the sentiment analysis model for processing, and the Bert network maps each word segmentation to a common semantic space, and outputs a word vector corresponding to each third segmentation word.
  • S10443 Based on the processing sequence of performing word segmentation processing on the second target text to be analyzed, combine the word vectors corresponding to each third word segment to obtain a target word vector set.
  • the long short-term memory network processes the word vectors corresponding to each third word, and the network will combine the word vectors corresponding to each third word based on the processing order of the second target text to be analyzed, and output A collection of target word vectors.
  • LSTM Long Short-Term Memory
  • S10444 Analyze the target word vector set to obtain a sentiment analysis result.
  • the target word vector set is input to the Dense network in the sentiment analysis model for processing.
  • the Dense network judges the probability that the target word vector set belongs to each emotional tendency, and outputs the emotional tendency with the highest probability, that is, the output sentiment analysis result.
  • the final sentiment analysis result corresponding to the text to be analyzed may be: Company A's market value is higher than that of Company B, Company A is in an advantage, Company B's market value is inferior to Company A, Company B is in a disadvantage, and so on.
  • the description here is only for illustration and not for limitation.
  • the second target text to be analyzed is analyzed through the sentiment analysis model, since the second target text to be analyzed contains attribute tags corresponding to attribute information and entity tags corresponding to the two entities to be compared. Attribute factors are considered in the process, and the extracted analysis points are comprehensive and accurate, which makes the entity comparison results obtained by analysis more accurate.
  • Fig. 3 is a schematic flowchart of a method for analyzing text provided by another embodiment of the present application.
  • the method for analyzing text as shown in FIG. 3 may include: S201 ⁇ S206, specifically as follows:
  • S201 Obtain a text to be analyzed, where the text to be analyzed includes a comment sentence containing at least two entities.
  • S202 Identify at least two entities in the text to be analyzed.
  • the entity in the text to be analyzed is identified by the named entity recognition model, the entity in the text to be analyzed is marked with a "bio" label, and the position of each entity in the text to be analyzed can be determined through the label mark. Sets the entity labels for each entity in the order in which they were determined.
  • the entity label corresponding to each entity is added to the text to be analyzed to obtain the first target text to be analyzed. For example, add “ ⁇ s> ⁇ /s>” and " ⁇ o> ⁇ /o>” to the text to be analyzed to get the first target text to be analyzed, that is, " ⁇ s>Company A ⁇ /s> market value exceeds ⁇ o>Company B ⁇ /o>".
  • the description here is only for illustration and not for limitation.
  • S206 Analyze at least two entities, attribute information, and text to be analyzed by using a pre-trained sentiment analysis model to obtain sentiment analysis results corresponding to at least two entities.
  • entity tags are added to the entities.
  • the attribute information in the first target text to be analyzed is extracted through the attribute extraction model, the participle with the added entity tags can be ignored, and only other participle are processed. Due to the lack of The interference of entities improves the accuracy and speed of extracting attribute information.
  • Fig. 4 is a specific flowchart of step S204 of the method for analyzing text shown in an exemplary embodiment of the present application; in some possible implementations of the present application, the above S204 may include S2041 ⁇ S2043, specifically as follows:
  • S2041 Perform word segmentation processing on the text to be analyzed to obtain multiple second word segments.
  • S2042 Perform mapping processing on each second participle through the attribute extraction model to obtain a word vector corresponding to each second participle.
  • multiple second word segmentations are input into the Bert network in the attribute extraction model for processing, and the Bert network maps each word segmentation to a common semantic space, and outputs a word vector corresponding to each second word segmentation.
  • the entity label corresponding to each entity is added to each word vector, which enhances the connection between each word vector and the entity, and facilitates the attribute information and entity height in the text to be analyzed extracted by the attribute extraction model It also improves the accuracy of extracting attribute information.
  • Fig. 5 is a schematic flowchart of a method for analyzing text shown in an exemplary embodiment of the present application; it mainly involves the process of obtaining an attribute extraction model before executing the method for analyzing text as shown in Fig. 1 .
  • the method includes: S301 ⁇ S303, specifically as follows:
  • S301 Obtain a sample training set, where the sample training set includes multiple sample texts and an attribute label corresponding to each sample text.
  • the sample training set may come from data published in the network. Collect multiple sample texts, and set attribute labels for the attribute information in each sample text. It is worth noting that the sample text here may be the same as or different from the sample comment sentences used in training the named entity recognition model, and there is no limitation on this.
  • a part of the data in the sample training set can also be used as a sample test set to facilitate subsequent testing of the attribute extraction model in training. For example, several sample texts are selected in the sample training set, and the respective attribute labels corresponding to these sample texts are used as the sample test set.
  • each sample text in the sample training set is processed through an initial attribute extraction network (attribute extraction model before training), to obtain attribute information corresponding to each sample text.
  • an initial attribute extraction network attribute extraction model before training
  • the initial attribute extraction network When the preset number of training times is reached, the initial attribute extraction network at this time is tested.
  • the sample text in the sample test set is input into the current initial attribute extraction network for processing, and the current initial attribute extraction network outputs the actual attribute information corresponding to the sample text.
  • a second loss value between the actual attribute information corresponding to the sample text and the attribute information corresponding to the sample text in the sample test set is calculated based on a loss function.
  • the loss function may be a cross-entropy loss function.
  • the second loss value does not meet the second preset condition, adjust the parameters of the initial attribute extraction network (for example, adjust the weight values corresponding to each network layer of the initial attribute extraction network), and continue to train the initial attribute extraction network.
  • the training of the initial attribute extraction network is stopped, and the trained initial attribute extraction network is used as a trained attribute extraction model.
  • the second preset condition is that the loss value is less than or equal to a preset loss value threshold. Then, when the second loss value is greater than the loss value threshold, adjust the parameters of the initial attribute extraction network, and continue to train the initial attribute extraction network. When the second loss value is less than or equal to the loss value threshold, stop training the initial attribute extraction network, and use the trained initial attribute extraction network as a trained attribute extraction model.
  • the description here is only for illustration and not for limitation.
  • the loss function convergence means that the value of the loss function tends to be stable.
  • the description here is only for illustration and not for limitation.
  • the method for analyzing text provided in this application may further include training a sentiment analysis model.
  • the sentiment analysis model is obtained by training the training set using a machine learning algorithm.
  • a plurality of sample sentiment analysis sentences containing emotional tendencies are collected in advance, and a sample sentiment analysis result corresponding to each sample sentiment analysis sentence is set.
  • a training set is formed based on these sample sentiment analysis sentences and sample sentiment analysis results corresponding to the sample sentiment analysis sentences.
  • a part of the data in the training set can also be used as a test set to facilitate subsequent testing of the sentiment analysis model. For example, several sample sentiment analysis sentences are selected in the training set, and the sample sentiment analysis results corresponding to these sample sentiment analysis sentences are used as the test set.
  • each sample sentiment analysis sentence in the training set is processed by an initial sentiment analysis network (sentiment analysis model before training), to obtain an actual sentiment analysis result corresponding to each sample sentiment analysis sentence.
  • an initial sentiment analysis network sentence analysis model before training
  • the initial sentiment analysis network When the preset number of training times is reached, the initial sentiment analysis network at this time is tested.
  • the sample sentiment analysis sentence in the test set is input into the current initial sentiment analysis network for processing, and the current initial sentiment analysis network outputs the actual sentiment analysis result corresponding to the sample sentiment analysis sentence.
  • a third loss value between the actual sentiment analysis result corresponding to the sample sentiment analysis sentence and the sample sentiment analysis result corresponding to the sample sentiment analysis sentence in the test set is calculated based on the loss function.
  • the loss function may be a cross-entropy loss function.
  • the third loss value does not meet the third preset condition, adjust the parameters of the initial sentiment analysis network (for example, adjust the weight values corresponding to each network layer of the initial sentiment analysis network), and continue to train the initial sentiment analysis network.
  • the training of the initial sentiment analysis network is stopped, and the trained initial sentiment analysis network is used as a trained sentiment analysis model.
  • the third preset condition is that the loss value is less than or equal to a preset loss value threshold.
  • the third loss value is greater than the loss value threshold
  • adjust the parameters of the initial sentiment analysis network and continue to train the initial sentiment analysis network.
  • the training of the initial sentiment analysis network is stopped, and the trained initial sentiment analysis network is used as a trained sentiment analysis model.
  • the description here is only for illustration and not for limitation.
  • the loss function convergence means that the value of the loss function tends to be stable.
  • the named entity recognition model, the attribute extraction model and the sentiment analysis model are trained simultaneously.
  • the training sample sets used by the three models can be similar. For example, they can all be sample analysis texts.
  • the labels corresponding to the sample analysis texts are different.
  • the specific training process please refer to the previous section for each The process of training the model individually.
  • the corresponding loss values of the three models can be weighted and superimposed, and when comparing whether the weighted and superimposed loss value satisfies the fourth preset condition, if the fourth preset condition is not satisfied , adjust the corresponding parameters of the three models during the training process, and continue to train the three models; if the loss value after weighted superposition meets the fourth preset condition, stop training the three models, and obtain the three trained models.
  • the fourth preset condition is that the loss value is less than or equal to a preset loss value threshold. Then, when the loss value after weighted superposition is greater than the loss value threshold, adjust the corresponding parameters of the three models during the training process, and continue to train the three models. When the loss value after weighted superposition is less than or equal to the loss value threshold, the training of these three models is stopped, and three trained models are obtained.
  • the description here is only for illustration and not for limitation.
  • training the three models at the same time can improve the fit of the three models when processing data, and the three models supervise each other, so that in actual use, the entity comparison results obtained by analysis are more accurate.
  • FIG. 6 is a schematic diagram of an apparatus for analyzing text provided by an embodiment of the present application.
  • the units included in the device are used to execute the steps in the embodiments corresponding to FIG. 1 to FIG. 5 .
  • FIG. 6 For details, please refer to the relevant descriptions in the embodiments corresponding to FIG. 1 to FIG. 5 .
  • Figure 6 only the parts related to this embodiment are shown. See Figure 6, including:
  • An acquisition unit 410 configured to acquire text to be analyzed
  • An identification unit 420 configured to identify at least two entities in the text to be analyzed, where the text to be analyzed includes comment sentences containing at least two entities;
  • An extraction unit 430 configured to extract attribute information in the text to be analyzed through a pre-trained attribute extraction model
  • the analysis unit 440 is configured to analyze the at least two entities, the attribute information, and the text to be analyzed by using a pre-trained sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities.
  • the identification unit 420 is specifically configured to:
  • the device also includes:
  • a label acquisition unit configured to acquire an entity label corresponding to each entity
  • an adding unit configured to add an entity label corresponding to each entity to the text to be analyzed to obtain a first target text to be analyzed
  • the extraction unit 430 is specifically used for:
  • the attribute information in the first target text to be analyzed is extracted by using a pre-trained attribute extraction model.
  • the adding unit is specifically used for:
  • An entity label corresponding to each entity is added to each word vector to obtain the first target text to be analyzed.
  • the at least two entities include a group of entities to be compared, and the analysis unit 440 is specifically configured to:
  • the entity tag group includes tags corresponding to the entities to be compared;
  • the second target text to be analyzed is analyzed by the sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities.
  • the analyzing unit 440 is also used for:
  • Each third participle is mapped through the sentiment analysis model to obtain a word vector corresponding to each third participle;
  • the word vectors corresponding to each third word segment are combined to obtain a target word vector set
  • the target word vector set is analyzed to obtain the sentiment analysis result.
  • the device also includes a training unit, specifically for:
  • the sample training set includes a plurality of sample texts, and an attribute label corresponding to each sample text;
  • the attribute extraction model is obtained.
  • FIG. 7 is a schematic diagram of a device for analyzing text provided by another embodiment of the present application.
  • the text analysis device 5 of this embodiment includes: a processor 50 , a memory 51 , and computer instructions 52 stored in the memory 51 and operable on the processor 50 .
  • the processor 50 executes the computer instruction 52, it implements the steps in the above embodiments of the text analysis method, for example, S101 to S104 shown in FIG. 1 .
  • the processor 50 executes the computer instruction 52
  • the functions of the units in the above embodiments are realized, for example, the functions of the units 410 to 440 shown in FIG. 6 .
  • the computer instruction 52 may be divided into one or more units, and the one or more units are stored in the memory 51 and executed by the processor 50 to complete the present application.
  • the one or more units may be a series of computer instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer instruction 52 in the text analysis device 5 .
  • the computer instruction 52 may be divided into an acquisition unit, an identification unit, an extraction unit and an analysis unit, and the specific functions of each unit are as described above.
  • the device for analyzing text may include, but not limited to, a processor 50 and a memory 51 .
  • FIG. 7 is only an example of the device 5 for analyzing text, and does not constitute a limitation to the device for analyzing text. It may include more or less components than those shown in the figure, or combine certain components, or Different components, for example, the device for analyzing text may also include input and output devices, network access devices, buses, and so on.
  • the so-called processor 50 may be a central processing unit (Central Processing Unit, CPU), and may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the storage 51 may be an internal storage unit of the device for analyzing text, such as a hard disk or memory of the device for analyzing text.
  • the memory 51 can also be an external storage terminal of the device for analyzing text, such as a plug-in hard disk equipped on the device for analyzing text, a smart memory card (Smart memory card) Media Card, SMC), Secure Digital (Secure Digital, SD) card, Flash Card (Flash Card), etc.
  • the memory 51 may also include both an internal storage unit of the device for analyzing text and an external storage terminal.
  • the memory 51 is used to store the computer instructions and other programs and data required by the terminal.
  • the memory 51 can also be used to temporarily store data that has been output or will be output.
  • the embodiment of the present application also provides a computer storage medium.
  • the computer storage medium may be non-volatile or volatile.
  • the computer storage medium stores a computer program. When the computer program is executed by a processor, the above-mentioned analysis Steps in the method examples of the text.
  • the present application also provides a computer program product.
  • the computer program product is run on the device, the device is made to execute the steps in the above embodiments of the method for analyzing text.
  • the embodiment of the present application also provides a chip or integrated circuit, the chip or integrated circuit includes: a processor, used to call and run a computer program from the memory, so that the device installed with the chip or integrated circuit executes the above-mentioned analysis texts The steps in the method embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and apparatus for analyzing text, a device and a storage medium, which are applicable to the technical field of artificial intelligence. The method comprises: acquiring text to be analyzed (S101); identifying at least two entities in the text (S102), the text comprising a comment sentence which includes the at least two entities; extracting attribute information in the text by means of a pre-trained attribute extraction model (S103); and analyzing the at least two entities, the attribute information and the text by means of a pre-trained sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities (S104). According to the method, attribute factors are added during a comparison process, simple "entity-advantage and disadvantage" comparison in the prior art is converted into "entity-attribute information-advantage and disadvantage" comparison, and extracted analysis points are comprehensive and accurate, so that the entity comparison results obtained from the analysis are more accurate.

Description

分析文本的方法、装置、设备及存储介质Method, device, equipment and storage medium for analyzing text
本申请要求于2021年06月24日在中华人民共和国国家知识产权局专利局提交的、申请号为202110705319.4、发明名称为“分析文本的方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202110705319.4 and the title of the invention "method, device, equipment and storage medium for analyzing text" submitted at the Patent Office of the State Intellectual Property Office of the People's Republic of China on June 24, 2021 rights, the entire contents of which are incorporated in this application by reference.
技术领域technical field
本申请属于人工智能技术领域,尤其涉及分析文本的方法、装置、设备及存储介质。The present application belongs to the technical field of artificial intelligence, and in particular relates to a method, device, equipment and storage medium for analyzing text.
背景技术Background technique
在自然语言处理应用中,情感分析有着巨大的前景。比如通过用户在互联网平台上发表的评论可以评估用户对产品、公司、服务等的满意程度。因此,情感分析在自然语言处理中显得尤为重要。Sentiment analysis holds great promise in natural language processing applications. For example, users' satisfaction with products, companies, services, etc. can be evaluated through the comments posted by users on Internet platforms. Therefore, sentiment analysis is particularly important in natural language processing.
发明人意识到,现有的情感分析中,提取的分析要点不全面,进而导致情感分析结果不准确。The inventor realizes that in the existing sentiment analysis, the extracted analysis points are not comprehensive, which leads to inaccurate sentiment analysis results.
技术问题technical problem
本申请实施例的目的之一在于提供分析文本的方法、装置、设备及存储介质,以解决现有的情感分析中,提取的分析要点不全面,进而导致情感分析结果不准确的问题。One of the purposes of the embodiments of the present application is to provide a method, device, device and storage medium for analyzing text, so as to solve the problem of inaccurate sentiment analysis results due to incomplete extracted analysis points in existing sentiment analysis.
技术解决方案technical solution
第一方面,本申请实施例提供了一种分析文本的方法,其中,该方法包括:In the first aspect, the embodiment of the present application provides a method for analyzing text, wherein the method includes:
获取待分析文本,所述待分析文本包括包含至少两个实体的评论句;Obtaining the text to be analyzed, the text to be analyzed includes comment sentences containing at least two entities;
识别所述待分析文本中的至少两个实体;identifying at least two entities in the text to be analyzed;
通过预先训练好的属性抽取模型提取所述待分析文本中的属性信息;extracting attribute information in the text to be analyzed through a pre-trained attribute extraction model;
通过预先训练好的情感分析模型对所述至少两个实体、所述属性信息以及所述待分析文本进行分析,得到所述至少两个实体对应的情感分析结果。The at least two entities, the attribute information, and the text to be analyzed are analyzed by using a pre-trained sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities.
第二方面,本申请实施例提供了一种分析文本的装置,其中,该装置包括:In the second aspect, the embodiment of the present application provides a device for analyzing text, wherein the device includes:
获取单元,用于获取待分析文本;an acquisition unit, configured to acquire the text to be analyzed;
识别单元,用于识别所述待分析文本中的至少两个实体,所述待分析文本包括包含至少两个实体的评论句;An identification unit, configured to identify at least two entities in the text to be analyzed, the text to be analyzed includes commentary sentences containing at least two entities;
提取单元,用于通过预先训练好的属性抽取模型提取所述待分析文本中的属性信息;An extraction unit, configured to extract attribute information in the text to be analyzed through a pre-trained attribute extraction model;
分析单元,用于通过预先训练好的情感分析模型对所述至少两个实体、所述属性信息以及所述待分析文本进行分析,得到所述至少两个实体对应的情感分析结果。The analysis unit is configured to analyze the at least two entities, the attribute information, and the text to be analyzed by using a pre-trained sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities.
第三方面,本申请实施例提供了一种分析文本的设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现:In a third aspect, an embodiment of the present application provides a device for analyzing text, including a memory, a processor, and a computer program stored in the memory and operable on the processor, wherein the processor executes the Realize when describing a computer program:
获取待分析文本,所述待分析文本包括包含至少两个实体的评论句;Obtaining the text to be analyzed, the text to be analyzed includes comment sentences containing at least two entities;
识别所述待分析文本中的至少两个实体;identifying at least two entities in the text to be analyzed;
通过预先训练好的属性抽取模型提取所述待分析文本中的属性信息;extracting attribute information in the text to be analyzed through a pre-trained attribute extraction model;
通过预先训练好的情感分析模型对所述至少两个实体、所述属性信息以及所述待分析文本进行分析,得到所述至少两个实体对应的情感分析结果。The at least two entities, the attribute information, and the text to be analyzed are analyzed by using a pre-trained sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities.
第四方面,本申请实施例提供了一种计算机可读存储介质,计算机可读存储介质可以是非易失性,也可以是易失性,计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现:In a fourth aspect, the embodiment of the present application provides a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium stores a computer program, and the computer program Implemented when executed by a processor:
获取待分析文本,所述待分析文本包括包含至少两个实体的评论句;Obtaining the text to be analyzed, the text to be analyzed includes comment sentences containing at least two entities;
识别所述待分析文本中的至少两个实体;identifying at least two entities in the text to be analyzed;
通过预先训练好的属性抽取模型提取所述待分析文本中的属性信息;extracting attribute information in the text to be analyzed through a pre-trained attribute extraction model;
通过预先训练好的情感分析模型对所述至少两个实体、所述属性信息以及所述待分析文本进行分析,得到所述至少两个实体对应的情感分析结果。The at least two entities, the attribute information, and the text to be analyzed are analyzed by using a pre-trained sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities.
有益效果Beneficial effect
本申请实施例与现有技术相比存在的有益效果是:通过获取待分析文本;识别该待分析文本中的至少两个实体,该待分析文本包括包含至少两个实体的评论句;通过预先训练好的属性抽取模型提取该待分析文本中的属性信息;通过预先训练好的情感分析模型对至少两个实体、该属性信息以及该待分析文本进行分析,得到至少两个实体对应的情感分析结果。上述方案中,识别待分析文本中的实体,通过属性抽取模型提取该待分析文本中的属性信息;再通过情感分析模型对实体、属性信息以及待分析文本进行分析,在分析比较过程中加入了属性因素,将现有技术中简单的“实体-优劣势”比较,转换为“实体-属性信息-优劣势”比较,提取的分析要点全面、准确,使最终得到的实体比较结果更加准确。Compared with the prior art, the embodiment of the present application has the following beneficial effects: by obtaining the text to be analyzed; identifying at least two entities in the text to be analyzed, the text to be analyzed includes comment sentences containing at least two entities; The trained attribute extraction model extracts the attribute information in the text to be analyzed; through the pre-trained sentiment analysis model, at least two entities, the attribute information and the text to be analyzed are analyzed to obtain the sentiment analysis corresponding to at least two entities result. In the above scheme, the entity in the text to be analyzed is identified, and the attribute information in the text to be analyzed is extracted through the attribute extraction model; then the entity, attribute information, and the text to be analyzed are analyzed through the sentiment analysis model, and the analysis and comparison process is added. The attribute factor converts the simple "entity-advantages and disadvantages" comparison in the prior art into an "entity-attribute information-advantages and disadvantages" comparison, and the extracted analysis points are comprehensive and accurate, making the final entity comparison results more accurate.
附图说明Description of drawings
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present application more clearly, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can also obtain other drawings based on these drawings on the premise of not paying creative work.
图1是本申请一示例性实施例提供的分析文本的方法的示意性流程图;Fig. 1 is a schematic flowchart of a method for analyzing text provided by an exemplary embodiment of the present application;
图2是本申请一示例性实施例示出的分析文本的方法的步骤S102的具体流程图;FIG. 2 is a specific flowchart of step S102 of the method for analyzing text shown in an exemplary embodiment of the present application;
图3是本申请另一实施例提供的一种分析文本的方法的示意性流程图;FIG. 3 is a schematic flowchart of a method for analyzing text provided by another embodiment of the present application;
图4是本申请一示例性实施例示出的分析文本的方法的步骤S204的具体流程图;Fig. 4 is a specific flowchart of step S204 of the method for analyzing text shown in an exemplary embodiment of the present application;
图5是本申请一示例性实施例示出的分析文本的方法的示意流程图;Fig. 5 is a schematic flowchart of a method for analyzing text shown in an exemplary embodiment of the present application;
图6是本申请一实施例提供的一种分析文本的装置的示意图;Fig. 6 is a schematic diagram of a device for analyzing text provided by an embodiment of the present application;
图7是本申请另一实施例提供的分析文本的设备的示意图。Fig. 7 is a schematic diagram of a device for analyzing text provided by another embodiment of the present application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional features and advantages of the present application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
本发明的实施方式Embodiments of the present invention
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not intended to limit the present application.
在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,在本申请实施例的描述中,“多个”是指两个或多于两个。In the description of the embodiments of this application, unless otherwise specified, "/" means or, for example, A/B can mean A or B; "and/or" in this article is only a description of the association of associated objects A relationship means that there may be three kinds of relationships, for example, A and/or B means: A exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the embodiments of the present application, "plurality" refers to two or more than two.
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。Hereinafter, the terms "first" and "second" are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, a feature defined as "first" and "second" may explicitly or implicitly include one or more of these features. In the description of this embodiment, unless otherwise specified, "plurality" means two or more.
在自然语言处理应用中,情感分析有着巨大的前景。比如通过用户在互联网平台上发表的评论可以评估用户对产品、公司、服务等的满意程度。因此,情感分析在自然语言处理中显得尤为重要。Sentiment analysis holds great promise in natural language processing applications. For example, users' satisfaction with products, companies, services, etc. can be evaluated through the comments posted by users on Internet platforms. Therefore, sentiment analysis is particularly important in natural language processing.
然而,现有的情感分析中,常常将问题简化为“实体-优劣势”的比较,这样提取的分析要点不全面,进而导致情感分析结果不准确。例如,在某个评论句“品牌A的手机价格比品牌B的贵,但性能更好”,对比实体指“品牌A”、”品牌B”,对于“价格”,品牌A是劣势方,但对于“性能”,品牌A则是优势方。现有技术中并未关注“价格”、“性能”这两个属性信息,只能得出一种对比结果,此时的对比结果在“价格”、“性能”这两个属性方面,定有一种是错误的,因此,该对比结果并不准确。However, in the existing sentiment analysis, the problem is often simplified to the comparison of "entity-advantages and disadvantages", so that the extracted analysis points are not comprehensive, which leads to inaccurate sentiment analysis results. For example, in a comment sentence "The mobile phone of brand A is more expensive than that of brand B, but its performance is better", the comparison entity refers to "brand A" and "brand B". For "price", brand A is the inferior party, but For "performance", brand A is the dominant side. The existing technology does not pay attention to the two attribute information of "price" and "performance", and can only get a comparison result. At this time, the comparison result must have a certain value in terms of the two attributes of "price" and "performance". is wrong, so this comparison is not accurate.
有鉴于此,本申请提供一种分析文本的方法,获取待分析文本;识别该待分析文本中的至少两个实体,该待分析文本包括包含至少两个实体的评论句;通过预先训练好的属性抽取模型提取该待分析文本中的属性信息;通过预先训练好的情感分析模型对至少两个实体、该属性信息以及该待分析文本进行分析,得到至少两个实体对应的情感分析结果。上述方案中,识别待分析文本中的实体,通过属性抽取模型提取该待分析文本中的属性信息;再通过情感分析模型对实体、属性信息以及待分析文本进行分析,在分析比较过程中加入了属性因素,将现有技术中简单的“实体-优劣势”比较,转换为“实体-属性信息-优劣势”比较,提取的分析要点全面、准确,使最终得到的实体比较结果更加准确。In view of this, the present application provides a method for analyzing text, obtaining the text to be analyzed; identifying at least two entities in the text to be analyzed, and the text to be analyzed includes comment sentences containing at least two entities; The attribute extraction model extracts attribute information in the text to be analyzed; at least two entities, the attribute information, and the text to be analyzed are analyzed by a pre-trained sentiment analysis model, and sentiment analysis results corresponding to at least two entities are obtained. In the above scheme, the entity in the text to be analyzed is identified, and the attribute information in the text to be analyzed is extracted through the attribute extraction model; then the entity, attribute information, and the text to be analyzed are analyzed through the sentiment analysis model, and the analysis and comparison process is added. The attribute factor converts the simple "entity-advantages and disadvantages" comparison in the prior art into an "entity-attribute information-advantages and disadvantages" comparison, and the extracted analysis points are comprehensive and accurate, making the final entity comparison results more accurate.
请参见图1,图1是本申请一示例性实施例提供的分析文本的方法的示意性流程图。本申请提供的分析文本的方法的执行主体为分析文本的设备,其中,该设备包括但不限于智能手机、平板电脑、计算机、个人数字助理(Personal Digital Assistant,PDA)、台式电脑等终端,还可以包括各种类型的服务器。本示例中,以终端为例进行说明。如图1所示的分析文本的方法可包括:S101~S104,具体如下:Please refer to FIG. 1 . FIG. 1 is a schematic flowchart of a method for analyzing text provided by an exemplary embodiment of the present application. The subject of execution of the method for analyzing text provided in this application is a device for analyzing text, wherein the device includes but is not limited to smart phones, tablet computers, computers, personal digital assistants (Personal Digital Assistant, PDA), desktop computers and other terminals, and may also include various types of servers. In this example, the terminal is used as an example for illustration. The method for analyzing text as shown in Figure 1 may include: S101~S104, specifically as follows:
S101:获取待分析文本。S101: Obtain text to be analyzed.
待分析文本指需要对文本中的实体进行情感分析的文本。由于本实施方式中的情感分析是指对实体的比较,至少存在两个实体时,才有比较的必要性,因此待分析文本包括包含至少两个实体的评论句。对评论句的长短以及数量不进行限定。例如,某个待分析文本可以为“A公司市值超过B公司”、“A公司市值超过B公司,但B公司口碑超过A公司”等。可选地,待分析文本也可以是由包含至少两个实体的评论句构成的一篇文章、一段文字等。此处仅为示例性说明,对此不做限定。The text to be analyzed refers to the text that needs to perform sentiment analysis on the entities in the text. Since the sentiment analysis in this embodiment refers to the comparison of entities, the comparison is only necessary when there are at least two entities, so the text to be analyzed includes comment sentences containing at least two entities. There is no limit to the length and number of comments. For example, a certain text to be analyzed may be "the market value of company A exceeds that of company B", "the market value of company A exceeds that of company B, but the reputation of company B exceeds that of company A", etc. Optionally, the text to be analyzed may also be an article, a paragraph of text, etc. composed of comment sentences containing at least two entities. The description here is only for illustration and not for limitation.
示例性地,终端在检测到分析指令时,获取待分析文本。分析指令可以由用户触发,如用户点击终端中的分析选项。获取待分析文本可以是用户上传至终端的待分析文本,也可以是终端根据分析指令中包含的文件标识,获取该文件标识对应的文本文件,得到待分析文本。Exemplarily, when the terminal detects the analysis instruction, it acquires the text to be analyzed. The analysis instruction may be triggered by the user, for example, the user clicks an analysis option in the terminal. Acquiring the text to be analyzed may be the text to be analyzed uploaded by the user to the terminal, or the terminal may obtain the text file corresponding to the file ID according to the file ID contained in the analysis instruction to obtain the text to be analyzed.
S102:识别该待分析文本中的至少两个实体。S102: Identify at least two entities in the text to be analyzed.
实体是指客观存在并可相互区别的事物。可通过预先训练好的命名实体识别模型,识别出待分析文本中的所有实体。Entities refer to things that exist objectively and can be distinguished from each other. All entities in the text to be analyzed can be identified through a pre-trained named entity recognition model.
S103:通过预先训练好的属性抽取模型提取该待分析文本中的属性信息。S103: Extract attribute information in the text to be analyzed by using a pre-trained attribute extraction model.
对待分析文本进行分词处理,得到多个分词。分词处理是指通过分词算法将待分析文本中连续的字序列划分为多个词序列,即多个分词。属性抽取模型可以包括分词算法,通过分词算法对待分析文本进行分词处理,得到待分析文本对应的多个分词。即通过分词算法将待分析文本中的内容划分为多个分词。其中,分词可以为词语或者单字。示例性地,根据分词算法可以确定待分析文本对应的多种分词方式,选取其中最合适的分词方式对该待分析文本进行分词,得到该待分析文本对应的多个分词。例如,对“A公司市值超过B公司”进行分词处理,得到“A公司/市值/超过/B公司”。Word segmentation processing is performed on the text to be analyzed to obtain multiple word segmentations. Word segmentation processing refers to dividing a continuous word sequence in the text to be analyzed into multiple word sequences, that is, multiple word segmentations, through a word segmentation algorithm. The attribute extraction model may include a word segmentation algorithm, through which the word segmentation process is performed on the text to be analyzed to obtain multiple word segments corresponding to the text to be analyzed. That is, the content in the text to be analyzed is divided into multiple word segmentations through a word segmentation algorithm. Wherein, the participle may be a word or a single character. Exemplarily, multiple word segmentation methods corresponding to the text to be analyzed can be determined according to the word segmentation algorithm, and the most suitable word segmentation method is selected to perform word segmentation on the text to be analyzed to obtain multiple word segmentations corresponding to the text to be analyzed. For example, word segmentation processing is performed on "the market value of company A exceeds that of company B" to obtain "company A/market value/exceeded/company B".
预先训练好的属性抽取模型包括Bert网络、Dense网络以及CRF网络。其中,Bert网络用于将待分析文本对应的多个分词分别转换为每个分词对应的词向量;Dense网络用于对每个词向量进行分类,并输出每个词向量属于属性信息这一类别的概率;CRF网络用于给属于属性信息的词向量标记标签。Pre-trained attribute extraction models include Bert network, Dense network and CRF network. Among them, the Bert network is used to convert multiple word segments corresponding to the text to be analyzed into word vectors corresponding to each word; the Dense network is used to classify each word vector, and output each word vector belongs to the category of attribute information The probability of ; the CRF network is used to label the word vectors belonging to attribute information.
示例性地,将多个分词输入到Bert网络中进行处理,Bert网络将每个分词映射到公共语义空间,输出每个分词对应的词向量。对每个分词的处理顺序不做限定,可以是按照分词的顺序依次输入每个分词,对每个分词进行映射,得到每个分词对应的词向量;也可以是乱序输入每个分词,对每个分词进行映射,得到每个分词对应的词向量。此处仅为示例性说明,对此不做限定。Exemplarily, multiple word segments are input into the Bert network for processing, and the Bert network maps each word segment to a common semantic space, and outputs a word vector corresponding to each word segment. There is no limit to the processing order of each word segment. You can input each word segment in sequence according to the order of the word segments, and map each word segment to obtain the word vector corresponding to each word segment; you can also input each word segment out of order, and Each word is mapped to obtain the word vector corresponding to each word. The description here is only for illustration and not for limitation.
由于预先训练好的属性抽取模型,在训练过程中学习到了判断每个分词是否属于属性信息的能力,因此,将每个分词对应的词向量输入到Dense网络中进行处理,Dense网络判断每个词向量是否属于属性信息,并输出每个词向量属于属性信息的概率。例如,A公司、市值、超过、B公司这几个分词对应的词向量,所属于属性信息的概率依次为0.2、0.9、0.1、0.2。Due to the pre-trained attribute extraction model, the ability to judge whether each word segment belongs to attribute information has been learned during the training process. Therefore, the word vector corresponding to each word segment is input into the Dense network for processing, and the Dense network judges each word. Whether the vector belongs to the attribute information, and output the probability that each word vector belongs to the attribute information. For example, the probabilities of the word vectors corresponding to the participles of A company, market value, exceeding, and B company belonging to the attribute information are 0.2, 0.9, 0.1, and 0.2 in sequence.
将Dense网络的输出结果输入CRF网络中,CRF网络为概率最大的词向量打上标签,并输出该词向量对应的属性信息。例如,市值对应的概率最大,其最有可能为属性信息,通过CRF网络为“市值”对应的词向量打上“BIO”标签,其中,B用于标记属性信息的起始字符,I用于标记属性信息的中间字符,O用于标记非属性信息字符。例如,B用于标记“市”、I用于标记“值”、O标记在“值”后“超”前,此处仅为示例性说明,对此不做限定。The output of the Dense network is input into the CRF network, and the CRF network labels the word vector with the highest probability, and outputs the attribute information corresponding to the word vector. For example, the probability corresponding to the market value is the highest, and it is most likely to be attribute information. The word vector corresponding to "market value" is marked with the "BIO" label through the CRF network, where B is used to mark the initial character of the attribute information, and I is used to mark The middle character of attribute information, O is used to mark non-attribute information characters. For example, B is used to mark "city", I is used to mark "value", and O is used to mark after "value" and before "exceeding". This is only an exemplary description and is not limited to this.
S104:通过预先训练好的情感分析模型对至少两个实体、属性信息以及待分析文本进行分析,得到至少两个实体对应的情感分析结果。S104: Analyze at least two entities, attribute information, and text to be analyzed by using a pre-trained sentiment analysis model to obtain sentiment analysis results corresponding to at least two entities.
获取每个实体对应的标签以及属性信息对应的属性标签,将每个实体对应的标签以及属性信息对应的属性标签添加至待分析文本中,将添加标签后的文本输入到预先训练好的情感分析模型中进行处理,输出情感分析结果。Obtain the tag corresponding to each entity and the attribute tag corresponding to the attribute information, add the tag corresponding to each entity and the attribute tag corresponding to the attribute information to the text to be analyzed, and input the tagged text into the pre-trained sentiment analysis The model is processed and the sentiment analysis results are output.
示例性地,一个属性信息对应一种情感分析结果,当有多个属性信息时,对应输出多个情感分析结果。其中,每种情感分析结果以每种属性信息为依据,对两个实体的优劣势进行评判。例如,待分析文本为“A公司市值超过B公司,但B公司口碑好”,该待分析文本中对应的实体分别为A公司和B公司,属性信息为市值和口碑,该待分析文本对应的最终情感分析结果可以为:A公司市值优于B公司,B公司口碑优于A公司,或者,A公司市值优于B公司,A公司口碑劣于B公司等。此处仅为示例性说明,对此不做限定。Exemplarily, one piece of attribute information corresponds to one kind of sentiment analysis result, and when there is multiple pieces of attribute information, multiple sentiment analysis results are correspondingly output. Wherein, each sentiment analysis result judges the advantages and disadvantages of the two entities based on each attribute information. For example, if the text to be analyzed is "Company A's market value exceeds that of Company B, but Company B has a good reputation", the corresponding entities in the text to be analyzed are Company A and Company B, and the attribute information is market value and word of mouth. The corresponding entities of the text to be analyzed are The final sentiment analysis result can be: the market value of company A is better than that of company B, the reputation of company B is better than that of company A, or the market value of company A is better than that of company B, and the reputation of company A is worse than that of company B. The description here is only for illustration and not for limitation.
上述实施例中,获取待分析文本;识别该待分析文本中的至少两个实体,该待分析文本包括包含至少两个实体的评论句;通过预先训练好的属性抽取模型提取该待分析文本中的属性信息;通过预先训练好的情感分析模型对至少两个实体、该属性信息以及该待分析文本进行分析,得到至少两个实体对应的情感分析结果。本实施中通过识别待分析文本中的实体,通过属性抽取模型提取该待分析文本中的属性信息;再通过情感分析模型对实体、属性信息以及待分析文本进行分析,在分析比较过程中加入了属性因素,将现有技术中简单的“实体-优劣势”比较,转换为“实体-属性信息-优劣势”比较,提取的分析要点全面、准确,使最终得到的实体比较结果更加准确。In the above-described embodiment, the text to be analyzed is obtained; at least two entities in the text to be analyzed are identified, and the text to be analyzed includes a comment sentence containing at least two entities; Attribute information; analyze at least two entities, the attribute information, and the text to be analyzed through a pre-trained sentiment analysis model, and obtain sentiment analysis results corresponding to at least two entities. In this implementation, by identifying the entity in the text to be analyzed, the attribute information in the text to be analyzed is extracted through the attribute extraction model; and then the entity, attribute information, and the text to be analyzed are analyzed through the sentiment analysis model, and the analysis and comparison process is added. The attribute factor converts the simple "entity-advantages and disadvantages" comparison in the prior art into an "entity-attribute information-advantages and disadvantages" comparison, and the extracted analysis points are comprehensive and accurate, making the final entity comparison results more accurate.
图2是本申请一示例性实施例示出的分析文本的方法的步骤S102的具体流程图;在本申请一些可能的实现方式中,上述S101可包括S1021~S1022,具体如下:FIG. 2 is a specific flowchart of step S102 of the method for analyzing text shown in an exemplary embodiment of the present application; in some possible implementations of the present application, the above S101 may include S1021~S1022, specifically as follows:
S1021:对待分析文本进行分词处理,得到多个第一分词。S1021: Perform word segmentation processing on the text to be analyzed to obtain multiple first word segments.
示例性地,通过分词算法对待分析文本进行分词处理,得到待分析文本对应的多个第一分词。具体的分词处理过程可参考S103中进行分词处理的过程,此处不再赘述。Exemplarily, a word segmentation algorithm is used to perform word segmentation processing on the text to be analyzed to obtain a plurality of first word segments corresponding to the text to be analyzed. For the specific word segmentation process, please refer to the word segmentation process in S103, which will not be repeated here.
可选地,在一种可能的实现方式中,在S1021之前,还可对待分析文本进行预处理,得到预处理结果。其中,预处理指提取去除待分析文本中的冗余信息。冗余信息是指待分析文本中没有实际意义的信息。例如,冗余信息可以是待分析文本中的停用词、标点符号等。停用词通常为限定词、语气助词、副词、介词、连接词、英文字符、数字、数学字符等。其中,英文字符为单独存在的字母,且没有实际意义。若英文字符为字母组合且具有意义时,此时,该英文字符被认定为有效字符,不会被去除。例如,当英文字符为CPU、MAC、HR等时,会作为有效字符保留下来,不会被去除。此处仅为示例性说明,对此不做限定。对预处理的结果进行分词处理,得到多个第一分词。Optionally, in a possible implementation manner, before S1021, the text to be analyzed may also be preprocessed to obtain a preprocessing result. Among them, preprocessing refers to extracting and removing redundant information in the text to be analyzed. Redundant information refers to information that has no practical meaning in the text to be analyzed. For example, redundant information may be stop words, punctuation marks, etc. in the text to be analyzed. Stop words are usually determiners, modal particles, adverbs, prepositions, conjunctions, English characters, numbers, mathematical characters, etc. Among them, English characters are letters that exist alone and have no practical meaning. If the English character is a combination of letters and has meaning, at this time, the English character is considered as a valid character and will not be removed. For example, when the English characters are CPU, MAC, HR, etc., they will be reserved as valid characters and will not be removed. The description here is only for illustration and not for limitation. Word segmentation processing is performed on the preprocessing result to obtain multiple first word segmentations.
这种实现方式中,对待分析文本进行了预处理,预先去除了待分析文本中的冗余信息,使后续命名实体识别模型对经过预处理的待分析文本进行处理时,少了冗余信息的干扰,加快了命名实体识别模型的处理速度,以及提升了处理结果的准确度。In this implementation method, the text to be analyzed is preprocessed, and the redundant information in the text to be analyzed is removed in advance, so that when the subsequent named entity recognition model processes the preprocessed text to be analyzed, the redundancy of redundant information is reduced. Interference speeds up the processing speed of the named entity recognition model and improves the accuracy of the processing results.
 S1022:基于预先训练好的命名实体识别模型对多个第一分词进行处理,得到待分析文本中的至少两个实体。S1022: Process multiple first participles based on the pre-trained named entity recognition model to obtain at least two entities in the text to be analyzed.
命名实体识别模型用于识别出待分析文本中的实体。对命名实体识别模型的类型不做限定,例如,命名实体识别模型具体可以是BERT+CRF模型,也可以是BERT+BiLSTM+CRF模型。The named entity recognition model is used to identify entities in the text to be analyzed. The type of the named entity recognition model is not limited. For example, the named entity recognition model can be a BERT+CRF model or a BERT+BiLSTM+CRF model.
示例性地,将多个第一分词输入至命名实体识别模型中,若输入的多个第一分词较多,则截取前若干个分词。例如,若输入的所有第一分词的总长度超过预设长度,则截取预设长度的第一分词。也可以是,若输入的所有第一分词的总字符超过预设字符长度,则截取预设字符长度的第一分词。例如,若输入的所有第一分词的总字符超过512字符,则截取前512个字符长度所对应的第一分词。Exemplarily, multiple first participles are input into the named entity recognition model, and if there are many first participles input, the first several participles are intercepted. For example, if the total length of all input first participle exceeds the preset length, the first participle with the preset length is intercepted. Alternatively, if the total characters of all input first participles exceed the preset character length, the first participle with the preset character length is intercepted. For example, if the total characters of all input first participles exceed 512 characters, the first participle corresponding to the length of the first 512 characters is intercepted.
将截取后的若干个第一分词输入到命名实体识别模型中的Bert网络中进行处理,Bert网络将每个第一分词映射到公共语义空间,输出每个第一分词对应的词向量。将Bert网络的输出结果输入CRF网络中,命名实体识别模型中的CRF网络为这些词向量中的实体打上标签,并输出识别的实体。例如,通过CRF网络为“市值”对应的词向量打上“bio”标签,其中,b用于标记实体的起始字符,i用于标记实体的中间字符,o用于标记非实体字符。例如,b用于标记“A”、i用于标记“公”、o标记在“司”后“市”前,此处仅为示例性说明,对此不做限定。The intercepted first participles are input to the Bert network in the named entity recognition model for processing, and the Bert network maps each first participle to the public semantic space, and outputs the word vector corresponding to each first participle. The output of the Bert network is input into the CRF network, and the CRF network in the named entity recognition model labels the entities in these word vectors and outputs the recognized entities. For example, the word vector corresponding to "market value" is tagged with "bio" through the CRF network, where b is used to mark the starting character of the entity, i is used to mark the middle character of the entity, and o is used to mark the non-entity character. For example, b is used to mark "A", i is used to mark "company", and o is marked after "division" and before "city". This is only an illustration and not limited.
可选地,在S1021之前,还可包括训练命名实体识别模型。该命名实体识别模型是通过使用机器学习算法对训练集进行训练得到。示例性地,预先采集多个样本评论句,标记每个样本评论句中的实体。基于这些样本评论句以及样本评论句中标记的实体构成训练集。Optionally, before S1021, training a named entity recognition model may also be included. The named entity recognition model is obtained by training the training set using a machine learning algorithm. Exemplarily, a plurality of sample comment sentences are collected in advance, and entities in each sample comment sentence are marked. A training set is formed based on these sample comment sentences and the labeled entities in the sample comment sentences.
可选地,还可将训练集中的一部分数据作为测试集,便于后续对模型进行测试。例如,在训练集中选取若干个样本评论句,以及这些样本评论句各自对应的样本实体作为测试集。Optionally, a part of the data in the training set can also be used as a test set to facilitate subsequent testing of the model. For example, several sample comment sentences are selected in the training set, and the sample entities corresponding to these sample comment sentences are used as the test set.
示例性地,通过初始命名实体识别网络(训练前的命名实体识别模型)对训练集中的每个样本评论句进行处理,得到每个样本评论句对应的实体。初始命名实体识别网络对样本评论句进行处理的具体过程,可参考上述S1021~S1022中的具体过程,此处不再赘述。Exemplarily, each sample comment sentence in the training set is processed by an initial named entity recognition network (named entity recognition model before training), to obtain the entity corresponding to each sample comment sentence. For the specific process of processing the sample comment sentence by the initial named entity recognition network, refer to the specific process in S1021-S1022 above, and will not be repeated here.
在达到预设的训练次数时,对此时的初始命名实体识别网络进行测试。示例性地,将测试集中的样本评论句输入此时的初始命名实体识别网络中进行处理,此时的初始命名实体识别网络输出该样本评论句对应的实体。基于损失函数计算该样本评论句对应的实体与测试集中该样本评论句对应的样本实体之间的第一损失值。其中,损失函数可以为交叉熵损失函数。When the preset number of training times is reached, the initial named entity recognition network at this time is tested. Exemplarily, the sample comment sentence in the test set is input into the current initial named entity recognition network for processing, and the current initial named entity recognition network outputs the entity corresponding to the sample comment sentence. A first loss value between the entity corresponding to the sample comment sentence and the sample entity corresponding to the sample comment sentence in the test set is calculated based on the loss function. Wherein, the loss function may be a cross-entropy loss function.
当第一损失值不满足第一预设条件时,调整初始命名实体识别网络的参数(例如,调整初始命名实体识别网络的各个网络层对应的权重值),并继续训练该初始命名实体识别网络。当第一损失值满足第一预设条件时,停止训练该初始命名实体识别网络,并将训练后的该初始命名实体识别网络作为已训练好的命名实体识别模型。例如,假设第一预设条件为损失值小于或等于预设的损失值阈值。那么,当第一损失值大于损失值阈值时,调整初始命名实体识别网络的参数,并继续训练该初始命名实体识别网络。当第一损失值小于或等于损失值阈值时,停止训练该初始命名实体识别网络,并将训练后的该初始命名实体识别网络作为已训练好的命名实体识别模型。此处仅为示例性说明,对此不做限定。When the first loss value does not meet the first preset condition, adjust the parameters of the initial named entity recognition network (for example, adjust the weight values corresponding to each network layer of the initial named entity recognition network), and continue to train the initial named entity recognition network . When the first loss value satisfies the first preset condition, the training of the initial named entity recognition network is stopped, and the trained initial named entity recognition network is used as a trained named entity recognition model. For example, assume that the first preset condition is that the loss value is less than or equal to a preset loss value threshold. Then, when the first loss value is greater than the loss value threshold, adjust the parameters of the initial named entity recognition network, and continue to train the initial named entity recognition network. When the first loss value is less than or equal to the loss value threshold, stop training the initial named entity recognition network, and use the trained initial named entity recognition network as a trained named entity recognition model. The description here is only for illustration and not for limitation.
可选地,也可以是在训练初始命名实体识别网络的过程中,观察初始命名实体识别网络对应的损失函数收敛情况。当损失函数未收敛时,调整初始命名实体识别网络的参数,并基于训练集继续训练该初始命名实体识别网络。当损失函数收敛时,停止训练该初始命名实体识别网络,并将训练后的该初始命名实体识别网络作为已训练好的命名实体识别模型。其中,损失函数收敛是指损失函数的值趋于稳定。此处仅为示例性说明,对此不做限定。Optionally, during the process of training the initial named entity recognition network, observe the convergence of the loss function corresponding to the initial named entity recognition network. When the loss function does not converge, adjust the parameters of the initial named entity recognition network, and continue to train the initial named entity recognition network based on the training set. When the loss function converges, stop training the initial named entity recognition network, and use the trained initial named entity recognition network as a trained named entity recognition model. Among them, the loss function convergence means that the value of the loss function tends to be stable. The description here is only for illustration and not for limitation.
上述实现方式中,通过使用机器学习算法对训练集进行训练得到命名实体识别模型,再通过命名实体识别模型识别待分析文本中的实体,可准确、快速地识别出待分析文本中的实体,便于后续跟进该实体进行情感分析,进而得到准确的情感分析结果。In the above implementation method, the named entity recognition model is obtained by using the machine learning algorithm to train the training set, and then the entity in the text to be analyzed is identified through the named entity recognition model, which can accurately and quickly identify the entity in the text to be analyzed, which is convenient Follow up the entity for sentiment analysis, and then get accurate sentiment analysis results.
可选地,在本申请一些可能的实现方式中,上述S104可包括S1041~S1044,具体如下:Optionally, in some possible implementations of the present application, the above S104 may include S1041~S1044, specifically as follows:
S1041:获取实体标签组,实体标签组包括待比较的实体各自对应的标签。S1041: Acquire an entity tag group, where the entity tag group includes tags corresponding to the entities to be compared.
本实施方式中,待分析文本对应的至少两个实体中包含一组待比较的实体。示例性地,当待分析文本对应的实体为两个时,这两个实体是可以进行比较的实体,可以理解为这两个实体是不同主体的实体。当待分析文本对应的实体有多个时,其中,至少有一组实体是可以进行比较的实体。In this embodiment, at least two entities corresponding to the text to be analyzed include a group of entities to be compared. Exemplarily, when there are two entities corresponding to the text to be analyzed, these two entities are entities that can be compared, and it can be understood that these two entities are entities of different subjects. When there are multiple entities corresponding to the text to be analyzed, at least one group of entities can be compared.
实体标签组是指待比较的两个实体各自对应的标签。例如,待分析文本为“A公司市值超过B公司”,其对应的实体为“A公司”和“B公司”。其中,“A公司”和“B公司”是一组待比较的实体。实体标签组指“A公司”对应的实体标签以及“B公司”对应的实体标签。The entity label group refers to the labels corresponding to the two entities to be compared. For example, the text to be analyzed is "the market value of company A exceeds that of company B", and the corresponding entities are "company A" and "company B". Wherein, "Company A" and "Company B" are a group of entities to be compared. The entity tag group refers to the entity tag corresponding to "Company A" and the entity tag corresponding to "Company B".
通过命名实体识别模型识别待分析文本中的实体时,对待分析文本中的实体进行了“bio”标签标记,可通过该标签标记确定每个实体在待分析文本中的位置。按照确定每个实体时的顺序,为每个实体设置实体标签。提取待比较的两个实体各自对应的实体标签。When the entity in the text to be analyzed is identified by the named entity recognition model, the entity in the text to be analyzed is marked with a "bio" label, and the position of each entity in the text to be analyzed can be determined through the label mark. Sets the entity labels for each entity in the order in which they were determined. Extract the entity labels corresponding to the two entities to be compared.
S1042:获取属性信息对应的属性标签。S1042: Obtain an attribute label corresponding to the attribute information.
通过属性抽取模型提取待分析文本中的属性信息时,对待分析文本中的属性信息进行了“BIO”标签标记,可通过该标签标记确定每个属性信息在待分析文本中的位置。为每个属性信息设置实体标签。When the attribute information in the text to be analyzed is extracted through the attribute extraction model, the attribute information in the text to be analyzed is marked with a "BIO" label, and the position of each attribute information in the text to be analyzed can be determined through the label mark. Set entity tags for each attribute information.
例如,待分析文本为“A公司市值超过B公司”,其对应的属性信息为“市值”,为“市值”设定属性标签“<asp></asp>”。此处仅为示例性说明,对此不做限定。For example, the text to be analyzed is "the market value of company A exceeds that of company B", the corresponding attribute information is "market value", and the attribute tag "<asp></asp>" is set for "market value". The description here is only for illustration and not for limitation.
S1043:将实体标签组以及属性标签添加至待分析文本中,得到第二目标待分析文本。S1043: Add entity tag groups and attribute tags to the text to be analyzed to obtain a second target text to be analyzed.
根据待比较的两个实体在待分析文本中的位置,以及这两个实体各自对应的实体标签,将这两个实体各自对应的实体标签添加至待分析文本中,同时将属性信息以及该属性信息对应的属性标签添加至待分析文本的开头,得到第二目标待分析文本。According to the positions of the two entities to be compared in the text to be analyzed, and the corresponding entity labels of the two entities, the entity labels corresponding to the two entities are added to the text to be analyzed, and the attribute information and the attribute The attribute label corresponding to the information is added to the beginning of the text to be analyzed to obtain the second target text to be analyzed.
例如,将“<s></s>”、“<o></o>”、“<asp>市值</asp>”添加至待分析文本中,得到“<asp>市值</asp><s>A公司</s>市值超过<o>B公司</o>”。For example, add "<s></s>", "<o></o>", "<asp>market value</asp>" to the text to be analyzed, and get "<asp>market value</asp> <s>Company A</s> has more market value than <o>Company B</o>".
可选地,也可将属性信息和该属性信息对应的属性标签添加至待分析文本的结尾,得到“<s>A公司</s>市值超过<o>B公司</o><asp>市值</asp>”。将此处仅为示例性说明,对此不做限定。Optionally, the attribute information and the attribute tag corresponding to the attribute information can also be added to the end of the text to be analyzed to obtain "<s>A company</s> market value exceeds <o>B company</o><asp> Market Cap</asp>". The description here is only for illustration and not for limitation.
S1044:通过情感分析模型对第二目标待分析文本进行分析,得到至少两个实体对应的情感分析结果。S1044: Analyze the second target text to be analyzed by using the sentiment analysis model, and obtain sentiment analysis results corresponding to at least two entities.
示例性地,对第二目标待分析文本进行映射处理,得到第二目标待分析文本对应的语义向量。对该语义向量进行分类,即判断该语义向量属于哪种情感倾向。Exemplarily, mapping processing is performed on the second target text to be analyzed to obtain a semantic vector corresponding to the second target text to be analyzed. To classify the semantic vector is to judge which emotional tendency the semantic vector belongs to.
上述实现方式中,通过情感分析模型对第二目标待分析文本进行分析,由于第二目标待分析文本中包含了属性信息对应的属性标签、待对比的两个实体各自对应的实体标签,在分析过程中考虑到了属性因素,提取的分析要点全面、准确,使分析得到的实体比较结果更加准确。In the above implementation, the second target text to be analyzed is analyzed through the sentiment analysis model, since the second target text to be analyzed contains attribute tags corresponding to attribute information and entity tags corresponding to the two entities to be compared. Attribute factors are considered in the process, and the extracted analysis points are comprehensive and accurate, which makes the entity comparison results obtained by analysis more accurate.
可选地,在本申请一些可能的实现方式中,上述S1044可包括S10441~S10444,具体如下:Optionally, in some possible implementations of the present application, the above S1044 may include S10441~S10444, specifically as follows:
S10441:对第二目标待分析文本进行分词处理,得到多个第三分词。S10441: Perform word segmentation processing on the second target text to be analyzed to obtain multiple third word segments.
对第二目标待分析文本进行分词处理,得到多个第三分词的具体实现过程,可参考S103中进行分词处理的过程,此处不再赘述。For the specific implementation process of performing word segmentation processing on the second target text to be analyzed to obtain multiple third word segmentations, please refer to the process of word segmentation processing in S103 , which will not be repeated here.
S10442:通过情感分析模型对每个第三分词进行映射处理,得到每个第三分词对应的词向量。S10442: Perform mapping processing on each third participle through the sentiment analysis model to obtain a word vector corresponding to each third participle.
示例性地,将多个第三分词输入到情感分析模型中的Bert网络中进行处理,Bert网络将每个分词映射到公共语义空间,输出每个第三分词对应的词向量。Exemplarily, multiple third word segmentations are input into the Bert network in the sentiment analysis model for processing, and the Bert network maps each word segmentation to a common semantic space, and outputs a word vector corresponding to each third segmentation word.
S10443:基于对第二目标待分析文本进行分词处理的处理顺序,将每个第三分词对应的词向量组合,得到目标词向量集合。S10443: Based on the processing sequence of performing word segmentation processing on the second target text to be analyzed, combine the word vectors corresponding to each third word segment to obtain a target word vector set.
示例性地,可利用长短期记忆网络(Long Short-Term Memory,LSTM)对每个第三分词对应的词向量进行处理,该网络会基于第二目标待分析文本进行分词处理的处理顺序,将每个第三分词对应的词向量组合,输出目标词向量集合。Exemplarily, the long short-term memory network (Long Short-Term Memory, LSTM) processes the word vectors corresponding to each third word, and the network will combine the word vectors corresponding to each third word based on the processing order of the second target text to be analyzed, and output A collection of target word vectors.
S10444:对目标词向量集合进行分析,得到情感分析结果。S10444: Analyze the target word vector set to obtain a sentiment analysis result.
将目标词向量集合输入到情感分析模型中的Dense网络中进行处理,Dense网络判断目标词向量集合属于每种情感倾向的概率,并输出概率最大的情感倾向,即输出情感分析结果。例如,该待分析文本对应的最终情感分析结果可以为:A公司市值优于B公司、A公司处于优势、B公司市值劣于A公司、B公司处于劣势等。此处仅为示例性说明,对此不做限定。The target word vector set is input to the Dense network in the sentiment analysis model for processing. The Dense network judges the probability that the target word vector set belongs to each emotional tendency, and outputs the emotional tendency with the highest probability, that is, the output sentiment analysis result. For example, the final sentiment analysis result corresponding to the text to be analyzed may be: Company A's market value is higher than that of Company B, Company A is in an advantage, Company B's market value is inferior to Company A, Company B is in a disadvantage, and so on. The description here is only for illustration and not for limitation.
上述实现方式中,通过情感分析模型对第二目标待分析文本进行分析,由于第二目标待分析文本中包含了属性信息对应的属性标签、待对比的两个实体各自对应的实体标签,在分析过程中考虑到了属性因素,提取的分析要点全面、准确,使分析得到的实体比较结果更加准确。In the above implementation, the second target text to be analyzed is analyzed through the sentiment analysis model, since the second target text to be analyzed contains attribute tags corresponding to attribute information and entity tags corresponding to the two entities to be compared. Attribute factors are considered in the process, and the extracted analysis points are comprehensive and accurate, which makes the entity comparison results obtained by analysis more accurate.
图3是本申请另一实施例提供的一种分析文本的方法的示意性流程图。示例性地,在本申请一些可能的实现方式中,如图3所示的分析文本的方法可包括:S201~S206,具体如下:Fig. 3 is a schematic flowchart of a method for analyzing text provided by another embodiment of the present application. Exemplarily, in some possible implementations of the present application, the method for analyzing text as shown in FIG. 3 may include: S201~S206, specifically as follows:
S201:获取待分析文本,该待分析文本包括包含至少两个实体的评论句。S201: Obtain a text to be analyzed, where the text to be analyzed includes a comment sentence containing at least two entities.
S202:识别待分析文本中的至少两个实体。S202: Identify at least two entities in the text to be analyzed.
本示例中的S201~S202可参考图1对应的实施例中S101~S102的描述,此处不再赘述。For S201-S202 in this example, reference may be made to the description of S101-S102 in the embodiment corresponding to FIG. 1 , which will not be repeated here.
S203:获取每个实体对应的实体标签。S203: Obtain an entity tag corresponding to each entity.
通过命名实体识别模型识别待分析文本中的实体时,对待分析文本中的实体进行了“bio”标签标记,可通过该标签标记确定每个实体在待分析文本中的位置。按照确定每个实体时的顺序,为每个实体设置实体标签。When the entity in the text to be analyzed is identified by the named entity recognition model, the entity in the text to be analyzed is marked with a "bio" label, and the position of each entity in the text to be analyzed can be determined through the label mark. Sets the entity labels for each entity in the order in which they were determined.
例如,待分析文本为“A公司市值超过B公司”,其对应的实体为“A公司”和“B公司”,为“A公司”设定实体标签“<s></s>”,为“B公司”设定实体标签“<o></o>”。此处仅为示例性说明,对此不做限定。For example, the text to be analyzed is "Company A's market capitalization exceeds Company B", and its corresponding entities are "Company A" and "Company B". Set the entity label "<s></s>" for "Company A", and for "Company B" sets the entity tag "<o></o>". The description here is only for illustration and not for limitation.
S204:将每个实体对应的实体标签添加至待分析文本中,得到第一目标待分析文本。S204: Add the entity label corresponding to each entity to the text to be analyzed to obtain the first target text to be analyzed.
根据每个实体在待分析文本中的位置,以及每个实体对应的实体标签,将每个实体对应的实体标签添加至待分析文本中,得到第一目标待分析文本。例如,将“<s></s>”、“<o></o>”添加到待分析文本中,得到第一目标待分析文本即“<s>A公司</s>市值超过<o>B公司</o>”。此处仅为示例性说明,对此不做限定。According to the position of each entity in the text to be analyzed and the entity label corresponding to each entity, the entity label corresponding to each entity is added to the text to be analyzed to obtain the first target text to be analyzed. For example, add "<s></s>" and "<o></o>" to the text to be analyzed to get the first target text to be analyzed, that is, "<s>Company A</s> market value exceeds < o>Company B</o>". The description here is only for illustration and not for limitation.
S205:通过预先训练好的属性抽取模型提取第一目标待分析文本中的属性信息。S205: Extract attribute information in the first target text to be analyzed by using a pre-trained attribute extraction model.
通过属性抽取模型提取第一目标待分析文本中的属性信息的具体过程,可参考S103中通过属性抽取模型提取待分析文本中的属性信息的具体过程。值得说明的是,本实施方式中为实体添加了实体标签,在通过属性抽取模型提取第一目标待分析文本中的属性信息时,可忽略掉添加了实体标签的分词,只对其他分词进行处理,由于缺少了实体的干扰,提升了提取属性信息的准确度和速率。For the specific process of extracting the attribute information in the first target text to be analyzed by using the attribute extraction model, please refer to the specific process of extracting the attribute information in the text to be analyzed by using the attribute extraction model in S103. It is worth noting that, in this embodiment, an entity tag is added to the entity. When the attribute information in the first target text to be analyzed is extracted through the attribute extraction model, the participle with the added entity tag can be ignored, and only other participle are processed. , due to the lack of entity interference, the accuracy and speed of extracting attribute information are improved.
S206:通过预先训练好的情感分析模型对至少两个实体、属性信息以及待分析文本进行分析,得到至少两个实体对应的情感分析结果。S206: Analyze at least two entities, attribute information, and text to be analyzed by using a pre-trained sentiment analysis model to obtain sentiment analysis results corresponding to at least two entities.
本示例中的S206可参考图1对应的实施例中S104的描述,此处不再赘述。For S206 in this example, reference may be made to the description of S104 in the embodiment corresponding to FIG. 1 , and details are not repeated here.
上述实施方式中,为实体添加了实体标签,在通过属性抽取模型提取第一目标待分析文本中的属性信息时,可忽略掉添加了实体标签的分词,只对其他分词进行处理,由于缺少了实体的干扰,提升了提取属性信息的准确度和速率。In the above embodiment, entity tags are added to the entities. When the attribute information in the first target text to be analyzed is extracted through the attribute extraction model, the participle with the added entity tags can be ignored, and only other participle are processed. Due to the lack of The interference of entities improves the accuracy and speed of extracting attribute information.
图4是本申请一示例性实施例示出的分析文本的方法的步骤S204的具体流程图;在本申请一些可能的实现方式中,上述S204可包括S2041~S2043,具体如下:Fig. 4 is a specific flowchart of step S204 of the method for analyzing text shown in an exemplary embodiment of the present application; in some possible implementations of the present application, the above S204 may include S2041~S2043, specifically as follows:
S2041:对待分析文本进行分词处理,得到多个第二分词。S2041: Perform word segmentation processing on the text to be analyzed to obtain multiple second word segments.
对待分析文本进行分词处理,得到多个第二分词的具体实现过程,可参考S103中进行分词处理的过程,此处不再赘述。For the specific implementation process of performing word segmentation processing on the text to be analyzed to obtain multiple second word segmentations, please refer to the process of word segmentation processing in S103, which will not be repeated here.
S2042:通过属性抽取模型对每个第二分词进行映射处理,得到每个第二分词对应的词向量。S2042: Perform mapping processing on each second participle through the attribute extraction model to obtain a word vector corresponding to each second participle.
示例性地,将多个第二分词输入到属性抽取模型中的Bert网络中进行处理,Bert网络将每个分词映射到公共语义空间,输出每个第二分词对应的词向量。Exemplarily, multiple second word segmentations are input into the Bert network in the attribute extraction model for processing, and the Bert network maps each word segmentation to a common semantic space, and outputs a word vector corresponding to each second word segmentation.
S2043:为每个词向量添加每个实体对应的实体标签,得到第一目标待分析文本。S2043: Add an entity label corresponding to each entity to each word vector to obtain the first target text to be analyzed.
为每个第二分词对应的词向量添加每个实体对应的实体标签,得到第一目标待分析文本。例如,为每个第二分词对应的词向量添加“<s></s>”、“<o></o>”实体标签,得到第一目标待分析文本。此处仅为示例性说明,对此不做限定。Add the entity label corresponding to each entity to the word vector corresponding to each second word segment to obtain the first target text to be analyzed. For example, add "<s></s>" and "<o></o>" entity tags to the word vectors corresponding to each second word segment to obtain the first target text to be analyzed. The description here is only for illustration and not for limitation.
本实施方式中,为每个词向量添加每个实体对应的实体标签,增强了每个词向量与实体之间的联系,便于通过属性抽取模型提取出的待分析文本中的属性信息与实体高度相关,也提升了提取属性信息的准确度。In this embodiment, the entity label corresponding to each entity is added to each word vector, which enhances the connection between each word vector and the entity, and facilitates the attribute information and entity height in the text to be analyzed extracted by the attribute extraction model It also improves the accuracy of extracting attribute information.
图5是本申请一示例性实施例示出的分析文本的方法的示意流程图;主要涉及在执行如图1所示的分析文本的方法之前,获得属性抽取模型的过程。该方法包括:S301~S303,具体如下:Fig. 5 is a schematic flowchart of a method for analyzing text shown in an exemplary embodiment of the present application; it mainly involves the process of obtaining an attribute extraction model before executing the method for analyzing text as shown in Fig. 1 . The method includes: S301~S303, specifically as follows:
S301:获取样本训练集,样本训练集包括多个样本文本,以及每个样本文本对应的属性标签。S301: Obtain a sample training set, where the sample training set includes multiple sample texts and an attribute label corresponding to each sample text.
示例性的,样本训练集可以来自网络中公开的数据。采集多个样本文本,为每个样本文本中的属性信息设置属性标签。值得说明的是,这里的样本文本可以与训练命名实体识别模型时用到的样本评论句相同,也可以不同,对此不做限定。Exemplarily, the sample training set may come from data published in the network. Collect multiple sample texts, and set attribute labels for the attribute information in each sample text. It is worth noting that the sample text here may be the same as or different from the sample comment sentences used in training the named entity recognition model, and there is no limitation on this.
可选地,还可将样本训练集中的一部分数据作为样本测试集,便于后续对训练中的属性抽取模型进行测试。例如,在样本训练集中选取若干个样本文本,以及这些样本文本各自对应的属性标签作为样本测试集。Optionally, a part of the data in the sample training set can also be used as a sample test set to facilitate subsequent testing of the attribute extraction model in training. For example, several sample texts are selected in the sample training set, and the respective attribute labels corresponding to these sample texts are used as the sample test set.
S302:基于样本训练集对初始属性抽取网络进行训练,并基于训练结果更新初始属性抽取网络的参数。S302: Train the initial attribute extraction network based on the sample training set, and update the parameters of the initial attribute extraction network based on the training result.
示例性地,通过初始属性抽取网络(训练前的属性抽取模型)对样本训练集中的每个样本文本进行处理,得到每个样本文本对应的属性信息。初始属性抽取网络对样本文本进行处理的具体过程,可参考上述S103中的具体过程,此处不再赘述。Exemplarily, each sample text in the sample training set is processed through an initial attribute extraction network (attribute extraction model before training), to obtain attribute information corresponding to each sample text. For the specific process of processing the sample text by the initial attribute extraction network, refer to the specific process in S103 above, which will not be repeated here.
在达到预设的训练次数时,对此时的初始属性抽取网络进行测试。示例性地,将样本测试集中的样本文本输入此时的初始属性抽取网络中进行处理,此时的初始属性抽取网络输出该样本文本对应的实际属性信息。基于损失函数计算该样本文本对应的实际属性信息与样本测试集中该样本文本对应的属性信息之间的第二损失值。其中,损失函数可以为交叉熵损失函数。When the preset number of training times is reached, the initial attribute extraction network at this time is tested. Exemplarily, the sample text in the sample test set is input into the current initial attribute extraction network for processing, and the current initial attribute extraction network outputs the actual attribute information corresponding to the sample text. A second loss value between the actual attribute information corresponding to the sample text and the attribute information corresponding to the sample text in the sample test set is calculated based on a loss function. Wherein, the loss function may be a cross-entropy loss function.
当第二损失值不满足第二预设条件时,调整初始属性抽取网络的参数(例如,调整初始属性抽取网络的各个网络层对应的权重值),并继续训练该初始属性抽取网络。当第二损失值满足第二预设条件时,停止训练该初始属性抽取网络,并将训练后的该初始属性抽取网络作为已训练好的属性抽取模型。When the second loss value does not meet the second preset condition, adjust the parameters of the initial attribute extraction network (for example, adjust the weight values corresponding to each network layer of the initial attribute extraction network), and continue to train the initial attribute extraction network. When the second loss value satisfies the second preset condition, the training of the initial attribute extraction network is stopped, and the trained initial attribute extraction network is used as a trained attribute extraction model.
例如,假设第二预设条件为损失值小于或等于预设的损失值阈值。那么,当第二损失值大于损失值阈值时,调整初始属性抽取网络的参数,并继续训练该初始属性抽取网络。当第二损失值小于或等于损失值阈值时,停止训练该初始属性抽取网络,并将训练后的该初始属性抽取网络作为已训练好的属性抽取模型。此处仅为示例性说明,对此不做限定。For example, assume that the second preset condition is that the loss value is less than or equal to a preset loss value threshold. Then, when the second loss value is greater than the loss value threshold, adjust the parameters of the initial attribute extraction network, and continue to train the initial attribute extraction network. When the second loss value is less than or equal to the loss value threshold, stop training the initial attribute extraction network, and use the trained initial attribute extraction network as a trained attribute extraction model. The description here is only for illustration and not for limitation.
S303:当检测到初始属性抽取网络对应的损失函数收敛时,得到属性抽取模型。S303: When it is detected that the loss function corresponding to the initial attribute extraction network converges, an attribute extraction model is obtained.
示例性地,也可以是在训练初始属性抽取网络的过程中,观察初始属性抽取网络对应的损失函数收敛情况。当损失函数未收敛时,调整初始属性抽取网络的参数,并基于样本训练集继续训练该初始属性抽取网络。当损失函数收敛时,停止训练该初始属性抽取网络,并将训练后的该初始属性抽取网络作为已训练好的属性抽取模型。其中,损失函数收敛是指损失函数的值趋于稳定。此处仅为示例性说明,对此不做限定。Exemplarily, during the process of training the initial attribute extraction network, it is also possible to observe the convergence of the loss function corresponding to the initial attribute extraction network. When the loss function does not converge, adjust the parameters of the initial attribute extraction network, and continue to train the initial attribute extraction network based on the sample training set. When the loss function converges, stop training the initial attribute extraction network, and use the trained initial attribute extraction network as a trained attribute extraction model. Among them, the loss function convergence means that the value of the loss function tends to be stable. The description here is only for illustration and not for limitation.
可选地,本申请提供的分析文本的方法还可包括训练情感分析模型。该情感分析模型是通过使用机器学习算法对训练集进行训练得到。示例性地,预先采集多个包含情感倾向的样本情感分析句,设置每个样本情感分析句对应的样本情感分析结果。基于这些样本情感分析句以及样本情感分析句对应的样本情感分析结果构成训练集。Optionally, the method for analyzing text provided in this application may further include training a sentiment analysis model. The sentiment analysis model is obtained by training the training set using a machine learning algorithm. Exemplarily, a plurality of sample sentiment analysis sentences containing emotional tendencies are collected in advance, and a sample sentiment analysis result corresponding to each sample sentiment analysis sentence is set. A training set is formed based on these sample sentiment analysis sentences and sample sentiment analysis results corresponding to the sample sentiment analysis sentences.
可选地,还可将训练集中的一部分数据作为测试集,便于后续对情感分析模型进行测试。例如,在训练集中选取若干个样本情感分析句,以及这些样本情感分析句各自对应的样本情感分析结果作为测试集。Optionally, a part of the data in the training set can also be used as a test set to facilitate subsequent testing of the sentiment analysis model. For example, several sample sentiment analysis sentences are selected in the training set, and the sample sentiment analysis results corresponding to these sample sentiment analysis sentences are used as the test set.
示例性地,通过初始情感分析网络(训练前的情感分析模型)对训练集中的每个样本情感分析句进行处理,得到每个样本情感分析句对应的实际情感分析结果。初始情感分析网络对样本情感分析句进行处理的具体过程,可参考上述S104中的具体过程,此处不再赘述。Exemplarily, each sample sentiment analysis sentence in the training set is processed by an initial sentiment analysis network (sentiment analysis model before training), to obtain an actual sentiment analysis result corresponding to each sample sentiment analysis sentence. For the specific process of processing the sample sentiment analysis sentence by the initial sentiment analysis network, refer to the specific process in S104 above, which will not be repeated here.
在达到预设的训练次数时,对此时的初始情感分析网络进行测试。示例性地,将测试集中的样本情感分析句输入此时的初始情感分析网络中进行处理,此时的初始情感分析网络输出该样本情感分析句对应的实际情感分析结果。基于损失函数计算该样本情感分析句对应的实际情感分析结果与测试集中该样本情感分析句对应的样本情感分析结果之间的第三损失值。其中,损失函数可以为交叉熵损失函数。When the preset number of training times is reached, the initial sentiment analysis network at this time is tested. Exemplarily, the sample sentiment analysis sentence in the test set is input into the current initial sentiment analysis network for processing, and the current initial sentiment analysis network outputs the actual sentiment analysis result corresponding to the sample sentiment analysis sentence. A third loss value between the actual sentiment analysis result corresponding to the sample sentiment analysis sentence and the sample sentiment analysis result corresponding to the sample sentiment analysis sentence in the test set is calculated based on the loss function. Wherein, the loss function may be a cross-entropy loss function.
当第三损失值不满足第三预设条件时,调整初始情感分析网络的参数(例如,调整初始情感分析网络的各个网络层对应的权重值),并继续训练该初始情感分析网络。当第三损失值满足第三预设条件时,停止训练该初始情感分析网络,并将训练后的该初始情感分析网络作为已训练好的情感分析模型。例如,假设第三预设条件为损失值小于或等于预设的损失值阈值。那么,当第三损失值大于损失值阈值时,调整初始情感分析网络的参数,并继续训练该初始情感分析网络。当第三损失值小于或等于损失值阈值时,停止训练该初始情感分析网络,并将训练后的该初始情感分析网络作为已训练好的情感分析模型。此处仅为示例性说明,对此不做限定。When the third loss value does not meet the third preset condition, adjust the parameters of the initial sentiment analysis network (for example, adjust the weight values corresponding to each network layer of the initial sentiment analysis network), and continue to train the initial sentiment analysis network. When the third loss value satisfies the third preset condition, the training of the initial sentiment analysis network is stopped, and the trained initial sentiment analysis network is used as a trained sentiment analysis model. For example, assume that the third preset condition is that the loss value is less than or equal to a preset loss value threshold. Then, when the third loss value is greater than the loss value threshold, adjust the parameters of the initial sentiment analysis network, and continue to train the initial sentiment analysis network. When the third loss value is less than or equal to the loss value threshold, the training of the initial sentiment analysis network is stopped, and the trained initial sentiment analysis network is used as a trained sentiment analysis model. The description here is only for illustration and not for limitation.
可选地,也可以是在训练初始情感分析网络的过程中,观察初始情感分析网络对应的损失函数收敛情况。当损失函数未收敛时,调整初始情感分析网络的参数,并基于训练集继续训练该初始情感分析网络。当损失函数收敛时,停止训练该初始情感分析网络,并将训练后的该初始情感分析网络作为已训练好的情感分析模型。其中,损失函数收敛是指损失函数的值趋于稳定。此处仅为示例性说明,对此不做限定。Optionally, during the process of training the initial sentiment analysis network, observe the convergence of the loss function corresponding to the initial sentiment analysis network. When the loss function does not converge, adjust the parameters of the initial sentiment analysis network, and continue to train the initial sentiment analysis network based on the training set. When the loss function converges, stop training the initial sentiment analysis network, and use the trained initial sentiment analysis network as a trained sentiment analysis model. Among them, the loss function convergence means that the value of the loss function tends to be stable. The description here is only for illustration and not for limitation.
可选地,在一种可能的实现方式中,同时训练命名实体识别模型、属性抽取模型以及情感分析模型。此时,三种模型所采用的训练样本集可以类似,例如,都可以是样本分析文本,对于每种不同的模型,样本分析文本对应的标签不同,具体的训练过程可参考前面对每个模型单独训练的过程。值得说明的是,当三个模型共同训练时,可将三个模型各自对应的损失值加权叠加,比较加权叠加后的损失值是否满足第四预设条件时,若不满足第四预设条件,调整训练过程中三个模型各自对应的参数,并继续训练这三个模型;若加权叠加后的损失值满足第四预设条件,停止训练这三个模型,得到训练好的三个模型。Optionally, in a possible implementation manner, the named entity recognition model, the attribute extraction model and the sentiment analysis model are trained simultaneously. At this time, the training sample sets used by the three models can be similar. For example, they can all be sample analysis texts. For each different model, the labels corresponding to the sample analysis texts are different. For the specific training process, please refer to the previous section for each The process of training the model individually. It is worth noting that when the three models are trained together, the corresponding loss values of the three models can be weighted and superimposed, and when comparing whether the weighted and superimposed loss value satisfies the fourth preset condition, if the fourth preset condition is not satisfied , adjust the corresponding parameters of the three models during the training process, and continue to train the three models; if the loss value after weighted superposition meets the fourth preset condition, stop training the three models, and obtain the three trained models.
假设第四预设条件为损失值小于或等于预设的损失值阈值。那么,当加权叠加后的损失值大于损失值阈值时,调整训练过程中三个模型各自对应的参数,并继续训练这三个模型。当加权叠加后的损失值小于或等于损失值阈值时,停止训练这三个模型,得到训练好的三个模型。此处仅为示例性说明,对此不做限定。It is assumed that the fourth preset condition is that the loss value is less than or equal to a preset loss value threshold. Then, when the loss value after weighted superposition is greater than the loss value threshold, adjust the corresponding parameters of the three models during the training process, and continue to train the three models. When the loss value after weighted superposition is less than or equal to the loss value threshold, the training of these three models is stopped, and three trained models are obtained. The description here is only for illustration and not for limitation.
上述实现方式中,将三种模型同时训练,可以提升三个模型在处理数据时的契合度,三种模型相互监督,进而在实际使用时,使分析得到的实体比较结果更加准确。In the above implementation, training the three models at the same time can improve the fit of the three models when processing data, and the three models supervise each other, so that in actual use, the entity comparison results obtained by analysis are more accurate.
请参见图6,图6是本申请一实施例提供的一种分析文本的装置的示意图。该装置包括的各单元用于执行图1~图5对应的实施例中的各步骤。具体请参阅图1~图5各自对应的实施例中的相关描述。为了便于说明,仅示出了与本实施例相关的部分。参见图6,包括:Please refer to FIG. 6 . FIG. 6 is a schematic diagram of an apparatus for analyzing text provided by an embodiment of the present application. The units included in the device are used to execute the steps in the embodiments corresponding to FIG. 1 to FIG. 5 . For details, please refer to the relevant descriptions in the embodiments corresponding to FIG. 1 to FIG. 5 . For ease of description, only the parts related to this embodiment are shown. See Figure 6, including:
获取单元410,用于获取待分析文本;An acquisition unit 410, configured to acquire text to be analyzed;
识别单元420,用于识别所述待分析文本中的至少两个实体,所述待分析文本包括包含至少两个实体的评论句;An identification unit 420, configured to identify at least two entities in the text to be analyzed, where the text to be analyzed includes comment sentences containing at least two entities;
提取单元430,用于通过预先训练好的属性抽取模型提取所述待分析文本中的属性信息;An extraction unit 430, configured to extract attribute information in the text to be analyzed through a pre-trained attribute extraction model;
分析单元440,用于通过预先训练好的情感分析模型对所述至少两个实体、所述属性信息以及所述待分析文本进行分析,得到所述至少两个实体对应的情感分析结果。The analysis unit 440 is configured to analyze the at least two entities, the attribute information, and the text to be analyzed by using a pre-trained sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities.
可选地,所述识别单元420具体用于:Optionally, the identification unit 420 is specifically configured to:
对所述待分析文本进行分词处理,得到多个第一分词;performing word segmentation processing on the text to be analyzed to obtain a plurality of first word segmentations;
基于预先训练好的命名实体识别模型对所述多个第一分词进行处理,得到所述待分析文本中的至少两个实体。Processing the plurality of first word segmentations based on a pre-trained named entity recognition model to obtain at least two entities in the text to be analyzed.
可选地,所述装置还包括:Optionally, the device also includes:
标签获取单元,用于获取每个实体对应的实体标签;a label acquisition unit, configured to acquire an entity label corresponding to each entity;
添加单元,用于将每个实体对应的实体标签添加至所述待分析文本中,得到第一目标待分析文本;an adding unit, configured to add an entity label corresponding to each entity to the text to be analyzed to obtain a first target text to be analyzed;
所述提取单元430具体用于:The extraction unit 430 is specifically used for:
通过预先训练好的属性抽取模型提取所述第一目标待分析文本中的属性信息。The attribute information in the first target text to be analyzed is extracted by using a pre-trained attribute extraction model.
可选地,所述添加单元具体用于:Optionally, the adding unit is specifically used for:
对所述待分析文本进行分词处理,得到多个第二分词;performing word segmentation processing on the text to be analyzed to obtain a plurality of second word segmentations;
通过所述属性抽取模型对每个第二分词进行映射处理,得到每个第二分词对应的词向量;Mapping each second participle through the attribute extraction model to obtain a word vector corresponding to each second participle;
为每个词向量添加每个实体对应的实体标签,得到所述第一目标待分析文本。An entity label corresponding to each entity is added to each word vector to obtain the first target text to be analyzed.
可选地,所述至少两个实体中包含一组待比较的实体,所述分析单元440具体用于:Optionally, the at least two entities include a group of entities to be compared, and the analysis unit 440 is specifically configured to:
获取实体标签组,所述实体标签组包括待比较的实体各自对应的标签;Obtaining an entity tag group, the entity tag group includes tags corresponding to the entities to be compared;
获取所述属性信息对应的属性标签;Obtain an attribute tag corresponding to the attribute information;
将所述实体标签组以及所述属性标签添加至所述待分析文本中,得到第二目标待分析文本;adding the entity tag group and the attribute tag to the text to be analyzed to obtain a second target text to be analyzed;
通过所述情感分析模型对所述第二目标待分析文本进行分析,得到所述至少两个实体对应的情感分析结果。The second target text to be analyzed is analyzed by the sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities.
可选地,所述分析单元440还用于:Optionally, the analyzing unit 440 is also used for:
对所述第二目标待分析文本进行分词处理,得到多个第三分词;performing word segmentation processing on the second target text to be analyzed to obtain a plurality of third word segmentations;
通过所述情感分析模型对每个第三分词进行映射处理,得到每个第三分词对应的词向量;Each third participle is mapped through the sentiment analysis model to obtain a word vector corresponding to each third participle;
基于对所述第二目标待分析文本进行分词处理的处理顺序,将每个第三分词对应的词向量组合,得到目标词向量集合;Based on the processing sequence of word segmentation processing for the second target text to be analyzed, the word vectors corresponding to each third word segment are combined to obtain a target word vector set;
对所述目标词向量集合进行分析,得到所述情感分析结果。The target word vector set is analyzed to obtain the sentiment analysis result.
可选地,所述装置还包括训练单元,具体用于:Optionally, the device also includes a training unit, specifically for:
获取样本训练集,所述样本训练集包括多个样本文本,以及每个样本文本对应的属性标签;Obtain a sample training set, the sample training set includes a plurality of sample texts, and an attribute label corresponding to each sample text;
基于所述样本训练集对初始属性抽取网络进行训练,并基于训练结果更新所述初始属性抽取网络的参数;Training an initial attribute extraction network based on the sample training set, and updating parameters of the initial attribute extraction network based on a training result;
当检测到所述初始属性抽取网络对应的损失函数收敛时,得到所述属性抽取模型。When it is detected that the loss function corresponding to the initial attribute extraction network converges, the attribute extraction model is obtained.
请参见图7,图7是本申请另一实施例提供的分析文本的设备的示意图。如图7所示,该实施例的分析文本的设备5包括:处理器50、存储器51以及存储在所述存储器51中并可在所述处理器50上运行的计算机指令52。所述处理器50执行所述计算机指令52时实现上述各个分析文本的方法实施例中的步骤,例如图1所示的S101至S104。或者,所述处理器50执行所述计算机指令52时实现上述各实施例中各单元的功能,例如图6所示单元410至440功能。Please refer to FIG. 7 . FIG. 7 is a schematic diagram of a device for analyzing text provided by another embodiment of the present application. As shown in FIG. 7 , the text analysis device 5 of this embodiment includes: a processor 50 , a memory 51 , and computer instructions 52 stored in the memory 51 and operable on the processor 50 . When the processor 50 executes the computer instruction 52, it implements the steps in the above embodiments of the text analysis method, for example, S101 to S104 shown in FIG. 1 . Alternatively, when the processor 50 executes the computer instruction 52, the functions of the units in the above embodiments are realized, for example, the functions of the units 410 to 440 shown in FIG. 6 .
示例性地,所述计算机指令52可以被分割成一个或多个单元,所述一个或者多个单元被存储在所述存储器51中,并由所述处理器50执行,以完成本申请。所述一个或多个单元可以是能够完成特定功能的一系列计算机指令段,该指令段用于描述所述计算机指令52在所述分析文本的设备5中的执行过程。例如,所述计算机指令52可以被分割为获取单元、识别单元、提取单元以及分析单元,各单元具体功能如上所述。Exemplarily, the computer instruction 52 may be divided into one or more units, and the one or more units are stored in the memory 51 and executed by the processor 50 to complete the present application. The one or more units may be a series of computer instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer instruction 52 in the text analysis device 5 . For example, the computer instruction 52 may be divided into an acquisition unit, an identification unit, an extraction unit and an analysis unit, and the specific functions of each unit are as described above.
所述分析文本的设备可包括,但不仅限于,处理器50、存储器51。本领域技术人员可以理解,图7仅仅是分析文本的设备5的示例,并不构成对分析文本的设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述分析文本的设备还可以包括输入输出设备、网络接入设备、总线等。The device for analyzing text may include, but not limited to, a processor 50 and a memory 51 . Those skilled in the art can understand that FIG. 7 is only an example of the device 5 for analyzing text, and does not constitute a limitation to the device for analyzing text. It may include more or less components than those shown in the figure, or combine certain components, or Different components, for example, the device for analyzing text may also include input and output devices, network access devices, buses, and so on.
所称处理器50可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The so-called processor 50 may be a central processing unit (Central Processing Unit, CPU), and may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
所述存储器51可以是所述分析文本的设备的内部存储单元,例如分析文本的设备的硬盘或内存。所述存储器51也可以是所述分析文本的设备的外部存储终端,例如所述分析文本的设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器51还可以既包括所述分析文本的设备的内部存储单元也包括外部存储终端。所述存储器51用于存储所述计算机指令以及所述终端所需的其他程序和数据。所述存储器51还可以用于暂时地存储已经输出或者将要输出的数据。The storage 51 may be an internal storage unit of the device for analyzing text, such as a hard disk or memory of the device for analyzing text. The memory 51 can also be an external storage terminal of the device for analyzing text, such as a plug-in hard disk equipped on the device for analyzing text, a smart memory card (Smart memory card) Media Card, SMC), Secure Digital (Secure Digital, SD) card, Flash Card (Flash Card), etc. Further, the memory 51 may also include both an internal storage unit of the device for analyzing text and an external storage terminal. The memory 51 is used to store the computer instructions and other programs and data required by the terminal. The memory 51 can also be used to temporarily store data that has been output or will be output.
本申请实施例还提供了一种计算机存储介质,计算机存储介质可以是非易失性,也可以是易失性,该计算机存储介质存储有计算机程序,该计算机程序被处理器执行时实现上述各个分析文本的方法实施例中的步骤。The embodiment of the present application also provides a computer storage medium. The computer storage medium may be non-volatile or volatile. The computer storage medium stores a computer program. When the computer program is executed by a processor, the above-mentioned analysis Steps in the method examples of the text.
本申请还提供了一种计算机程序产品,当计算机程序产品在该设备上运行时,使得该设备执行上述各个分析文本的方法实施例中的步骤。The present application also provides a computer program product. When the computer program product is run on the device, the device is made to execute the steps in the above embodiments of the method for analyzing text.
本申请实施例还提供了一种芯片或者集成电路,该芯片或者集成电路包括:处理器,用于从存储器中调用并运行计算机程序,使得安装有该芯片或者集成电路的设备执行上述各个分析文本的方法实施例中的步骤。The embodiment of the present application also provides a chip or integrated circuit, the chip or integrated circuit includes: a processor, used to call and run a computer program from the memory, so that the device installed with the chip or integrated circuit executes the above-mentioned analysis texts The steps in the method embodiment.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神范围,均应包含在本申请的保护范围之内。The above-described embodiments are only used to illustrate the technical solutions of the present application, rather than to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still implement the foregoing embodiments Modifications to the technical solutions recorded in the examples, or equivalent replacements for some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit of the technical solutions of the various embodiments of the application, and should be included in this application. within the scope of the application.

Claims (20)

  1. 一种分析文本的方法,其中,包括: A method of analyzing text, comprising:
    获取待分析文本,所述待分析文本包括包含至少两个实体的评论句;Obtaining the text to be analyzed, the text to be analyzed includes comment sentences containing at least two entities;
    识别所述待分析文本中的至少两个实体;identifying at least two entities in the text to be analyzed;
    通过预先训练好的属性抽取模型提取所述待分析文本中的属性信息;extracting attribute information in the text to be analyzed through a pre-trained attribute extraction model;
    通过预先训练好的情感分析模型对所述至少两个实体、所述属性信息以及所述待分析文本进行分析,得到所述至少两个实体对应的情感分析结果。The at least two entities, the attribute information, and the text to be analyzed are analyzed by using a pre-trained sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities.
  2. 如权利要求1所述的方法,其中,所述识别待分析文本中的至少两个实体,包括: The method according to claim 1, wherein said identifying at least two entities in the text to be analyzed comprises:
    对所述待分析文本进行分词处理,得到多个第一分词;performing word segmentation processing on the text to be analyzed to obtain a plurality of first word segmentations;
    基于预先训练好的命名实体识别模型对所述多个第一分词进行处理,得到所述待分析文本中的至少两个实体。Processing the plurality of first word segmentations based on a pre-trained named entity recognition model to obtain at least two entities in the text to be analyzed.
  3. 如权利要求1所述的方法,其中,所述通过预先训练好的属性抽取模型提取所述待分析文本中的属性信息之前,所述方法还包括: The method according to claim 1, wherein, before extracting the attribute information in the text to be analyzed through the pre-trained attribute extraction model, the method further comprises:
    获取每个实体对应的实体标签;Get the entity label corresponding to each entity;
    将每个实体对应的实体标签添加至所述待分析文本中,得到第一目标待分析文本;adding the entity tag corresponding to each entity to the text to be analyzed to obtain the first target text to be analyzed;
    所述通过预先训练好的属性抽取模型提取所述待分析文本中的属性信息,包括:The attribute information in the text to be analyzed is extracted through the pre-trained attribute extraction model, including:
    通过预先训练好的属性抽取模型提取所述第一目标待分析文本中的属性信息。The attribute information in the first target text to be analyzed is extracted by using a pre-trained attribute extraction model.
  4. 如权利要求3所述的方法,其中,所述将每个实体对应的实体标签添加至所述待分析文本中,得到第一目标待分析文本,包括: The method according to claim 3, wherein said adding the entity tag corresponding to each entity to the text to be analyzed to obtain the first target text to be analyzed comprises:
    对所述待分析文本进行分词处理,得到多个第二分词;performing word segmentation processing on the text to be analyzed to obtain a plurality of second word segmentations;
    通过所述属性抽取模型对每个第二分词进行映射处理,得到每个第二分词对应的词向量;Mapping each second participle through the attribute extraction model to obtain a word vector corresponding to each second participle;
    为每个词向量添加每个实体对应的实体标签,得到所述第一目标待分析文本。An entity label corresponding to each entity is added to each word vector to obtain the first target text to be analyzed.
  5. 如权利要求1所述的方法,其中,所述至少两个实体中包含一组待比较的实体,所述通过预先训练好的情感分析模型对所述至少两个实体、所述属性信息以及所述待分析文本进行分析,得到所述至少两个实体对应的情感分析结果,包括: The method according to claim 1, wherein the at least two entities include a group of entities to be compared, and the at least two entities, the attribute information and the The text to be analyzed is analyzed, and the sentiment analysis results corresponding to the at least two entities are obtained, including:
    获取实体标签组,所述实体标签组包括待比较的实体各自对应的标签;Obtaining an entity tag group, the entity tag group includes tags corresponding to the entities to be compared;
    获取所述属性信息对应的属性标签;Obtain an attribute tag corresponding to the attribute information;
    将所述实体标签组以及所述属性标签添加至所述待分析文本中,得到第二目标待分析文本;adding the entity tag group and the attribute tag to the text to be analyzed to obtain a second target text to be analyzed;
    通过所述情感分析模型对所述第二目标待分析文本进行分析,得到所述至少两个实体对应的情感分析结果。The second target text to be analyzed is analyzed by the sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities.
  6. 如权利要求5所述的方法,其中,所述通过所述情感分析模型对所述第二目标待分析文本进行分析,得到所述至少两个实体对应的情感分析结果,包括: The method according to claim 5, wherein the analysis of the second target text to be analyzed by the sentiment analysis model to obtain the sentiment analysis results corresponding to the at least two entities includes:
    对所述第二目标待分析文本进行分词处理,得到多个第三分词;performing word segmentation processing on the second target text to be analyzed to obtain a plurality of third word segmentations;
    通过所述情感分析模型对每个第三分词进行映射处理,得到每个第三分词对应的词向量;Each third participle is mapped through the sentiment analysis model to obtain a word vector corresponding to each third participle;
    基于对所述第二目标待分析文本进行分词处理的处理顺序,将每个第三分词对应的词向量组合,得到目标词向量集合;Based on the processing sequence of word segmentation processing for the second target text to be analyzed, the word vectors corresponding to each third word segment are combined to obtain a target word vector set;
    对所述目标词向量集合进行分析,得到所述情感分析结果。The target word vector set is analyzed to obtain the sentiment analysis result.
  7. 如权利要求1至6任一项所述的方法,其中,所述识别待分析文本中的至少两个实体之前,所述方法还包括: The method according to any one of claims 1 to 6, wherein, before identifying at least two entities in the text to be analyzed, the method further comprises:
    获取样本训练集,所述样本训练集包括多个样本文本,以及每个样本文本对应的属性标签;Obtain a sample training set, the sample training set includes a plurality of sample texts, and an attribute label corresponding to each sample text;
    基于所述样本训练集对初始属性抽取网络进行训练,并基于训练结果更新所述初始属性抽取网络的参数;Training an initial attribute extraction network based on the sample training set, and updating parameters of the initial attribute extraction network based on a training result;
    当检测到所述初始属性抽取网络对应的损失函数收敛时,得到所述属性抽取模型。When it is detected that the loss function corresponding to the initial attribute extraction network converges, the attribute extraction model is obtained.
  8. 一种分析文本的装置,其中,包括: A device for analyzing text, comprising:
    获取单元,用于获取待分析文本;an acquisition unit, configured to acquire the text to be analyzed;
    识别单元,用于识别所述待分析文本中的至少两个实体,所述待分析文本包括包含至少两个实体的评论句;An identification unit, configured to identify at least two entities in the text to be analyzed, the text to be analyzed includes commentary sentences containing at least two entities;
    提取单元,用于通过预先训练好的属性抽取模型提取所述待分析文本中的属性信息;An extraction unit, configured to extract attribute information in the text to be analyzed through a pre-trained attribute extraction model;
    分析单元,用于通过预先训练好的情感分析模型对所述至少两个实体、所述属性信息以及所述待分析文本进行分析,得到所述至少两个实体对应的情感分析结果。The analysis unit is configured to analyze the at least two entities, the attribute information, and the text to be analyzed by using a pre-trained sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities.
  9. 一种分析文本的设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现: A device for analyzing text, comprising a memory, a processor, and a computer program stored in the memory and operable on the processor, wherein, when the processor executes the computer program, it realizes:
    获取待分析文本,所述待分析文本包括包含至少两个实体的评论句;Obtaining the text to be analyzed, the text to be analyzed includes comment sentences containing at least two entities;
    识别所述待分析文本中的至少两个实体;identifying at least two entities in the text to be analyzed;
    通过预先训练好的属性抽取模型提取所述待分析文本中的属性信息;extracting attribute information in the text to be analyzed through a pre-trained attribute extraction model;
    通过预先训练好的情感分析模型对所述至少两个实体、所述属性信息以及所述待分析文本进行分析,得到所述至少两个实体对应的情感分析结果。The at least two entities, the attribute information, and the text to be analyzed are analyzed by using a pre-trained sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities.
  10. 如权利要求9所述的设备,其中,所述识别待分析文本中的至少两个实体,包括: The device according to claim 9, wherein said identifying at least two entities in the text to be analyzed comprises:
    对所述待分析文本进行分词处理,得到多个第一分词;performing word segmentation processing on the text to be analyzed to obtain a plurality of first word segmentations;
    基于预先训练好的命名实体识别模型对所述多个第一分词进行处理,得到所述待分析文本中的至少两个实体。Processing the plurality of first word segmentations based on a pre-trained named entity recognition model to obtain at least two entities in the text to be analyzed.
  11. 如权利要求9所述的设备,其中,所述通过预先训练好的属性抽取模型提取所述待分析文本中的属性信息之前,所述方法还包括: The device according to claim 9, wherein, before extracting the attribute information in the text to be analyzed through the pre-trained attribute extraction model, the method further comprises:
    获取每个实体对应的实体标签;Get the entity label corresponding to each entity;
    将每个实体对应的实体标签添加至所述待分析文本中,得到第一目标待分析文本;adding the entity tag corresponding to each entity to the text to be analyzed to obtain the first target text to be analyzed;
    所述通过预先训练好的属性抽取模型提取所述待分析文本中的属性信息,包括:The attribute information in the text to be analyzed is extracted through the pre-trained attribute extraction model, including:
    通过预先训练好的属性抽取模型提取所述第一目标待分析文本中的属性信息。The attribute information in the first target text to be analyzed is extracted by using a pre-trained attribute extraction model.
  12. 如权利要求11所述的设备,其中,所述将每个实体对应的实体标签添加至所述待分析文本中,得到第一目标待分析文本,包括: The device according to claim 11, wherein said adding the entity tag corresponding to each entity to the text to be analyzed to obtain the first target text to be analyzed comprises:
    对所述待分析文本进行分词处理,得到多个第二分词;performing word segmentation processing on the text to be analyzed to obtain a plurality of second word segmentations;
    通过所述属性抽取模型对每个第二分词进行映射处理,得到每个第二分词对应的词向量;Mapping each second participle through the attribute extraction model to obtain a word vector corresponding to each second participle;
    为每个词向量添加每个实体对应的实体标签,得到所述第一目标待分析文本。An entity label corresponding to each entity is added to each word vector to obtain the first target text to be analyzed.
  13. 如权利要求9所述的设备,其中,所述至少两个实体中包含一组待比较的实体,所述通过预先训练好的情感分析模型对所述至少两个实体、所述属性信息以及所述待分析文本进行分析,得到所述至少两个实体对应的情感分析结果,包括: The device according to claim 9, wherein the at least two entities include a group of entities to be compared, and the at least two entities, the attribute information and the The text to be analyzed is analyzed, and the sentiment analysis results corresponding to the at least two entities are obtained, including:
    获取实体标签组,所述实体标签组包括待比较的实体各自对应的标签;Obtaining an entity tag group, the entity tag group includes tags corresponding to the entities to be compared;
    获取所述属性信息对应的属性标签;Obtain an attribute tag corresponding to the attribute information;
    将所述实体标签组以及所述属性标签添加至所述待分析文本中,得到第二目标待分析文本;adding the entity tag group and the attribute tag to the text to be analyzed to obtain a second target text to be analyzed;
    通过所述情感分析模型对所述第二目标待分析文本进行分析,得到所述至少两个实体对应的情感分析结果。The second target text to be analyzed is analyzed by the sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities.
  14. 如权利要求13所述的设备,其中,所述通过所述情感分析模型对所述第二目标待分析文本进行分析,得到所述至少两个实体对应的情感分析结果,包括: The device according to claim 13, wherein the analysis of the second target text to be analyzed by the sentiment analysis model to obtain the sentiment analysis results corresponding to the at least two entities includes:
    对所述第二目标待分析文本进行分词处理,得到多个第三分词;performing word segmentation processing on the second target text to be analyzed to obtain a plurality of third word segmentations;
    通过所述情感分析模型对每个第三分词进行映射处理,得到每个第三分词对应的词向量;Each third participle is mapped through the sentiment analysis model to obtain a word vector corresponding to each third participle;
    基于对所述第二目标待分析文本进行分词处理的处理顺序,将每个第三分词对应的词向量组合,得到目标词向量集合;Based on the processing sequence of word segmentation processing for the second target text to be analyzed, the word vectors corresponding to each third word segment are combined to obtain a target word vector set;
    对所述目标词向量集合进行分析,得到所述情感分析结果。The target word vector set is analyzed to obtain the sentiment analysis result.
  15. 如权利要求9至14任一项所述的设备,其中,所述识别待分析文本中的至少两个实体之前,所述方法还包括: The device according to any one of claims 9 to 14, wherein, before identifying at least two entities in the text to be analyzed, the method further comprises:
    获取样本训练集,所述样本训练集包括多个样本文本,以及每个样本文本对应的属性标签;Obtain a sample training set, the sample training set includes a plurality of sample texts, and an attribute label corresponding to each sample text;
    基于所述样本训练集对初始属性抽取网络进行训练,并基于训练结果更新所述初始属性抽取网络的参数;Training an initial attribute extraction network based on the sample training set, and updating parameters of the initial attribute extraction network based on a training result;
    当检测到所述初始属性抽取网络对应的损失函数收敛时,得到所述属性抽取模型。When it is detected that the loss function corresponding to the initial attribute extraction network converges, the attribute extraction model is obtained.
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时实现: A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, wherein when the computer program is executed by a processor, it realizes:
    获取待分析文本,所述待分析文本包括包含至少两个实体的评论句;Obtaining the text to be analyzed, the text to be analyzed includes comment sentences containing at least two entities;
    识别所述待分析文本中的至少两个实体;identifying at least two entities in the text to be analyzed;
    通过预先训练好的属性抽取模型提取所述待分析文本中的属性信息;extracting attribute information in the text to be analyzed through a pre-trained attribute extraction model;
    通过预先训练好的情感分析模型对所述至少两个实体、所述属性信息以及所述待分析文本进行分析,得到所述至少两个实体对应的情感分析结果。The at least two entities, the attribute information, and the text to be analyzed are analyzed by using a pre-trained sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities.
  17. 如权利要求16所述的计算机可读存储介质,其中,所述识别待分析文本中的至少两个实体,包括: The computer-readable storage medium of claim 16, wherein said identifying at least two entities in the text to be analyzed comprises:
    对所述待分析文本进行分词处理,得到多个第一分词;performing word segmentation processing on the text to be analyzed to obtain a plurality of first word segmentations;
    基于预先训练好的命名实体识别模型对所述多个第一分词进行处理,得到所述待分析文本中的至少两个实体。Processing the plurality of first word segmentations based on a pre-trained named entity recognition model to obtain at least two entities in the text to be analyzed.
  18. 如权利要求16所述的计算机可读存储介质,其中,所述通过预先训练好的属性抽取模型提取所述待分析文本中的属性信息之前,所述方法还包括: The computer-readable storage medium according to claim 16, wherein, before extracting the attribute information in the text to be analyzed through the pre-trained attribute extraction model, the method further comprises:
    获取每个实体对应的实体标签;Get the entity label corresponding to each entity;
    将每个实体对应的实体标签添加至所述待分析文本中,得到第一目标待分析文本;adding the entity tag corresponding to each entity to the text to be analyzed to obtain the first target text to be analyzed;
    所述通过预先训练好的属性抽取模型提取所述待分析文本中的属性信息,包括:The attribute information in the text to be analyzed is extracted through the pre-trained attribute extraction model, including:
    通过预先训练好的属性抽取模型提取所述第一目标待分析文本中的属性信息。The attribute information in the first target text to be analyzed is extracted by using a pre-trained attribute extraction model.
  19. 如权利要求18所述的计算机可读存储介质,其中,所述将每个实体对应的实体标签添加至所述待分析文本中,得到第一目标待分析文本,包括: The computer-readable storage medium according to claim 18, wherein said adding the entity tag corresponding to each entity to the text to be analyzed to obtain the first target text to be analyzed comprises:
    对所述待分析文本进行分词处理,得到多个第二分词;performing word segmentation processing on the text to be analyzed to obtain a plurality of second word segmentations;
    通过所述属性抽取模型对每个第二分词进行映射处理,得到每个第二分词对应的词向量;Mapping each second participle through the attribute extraction model to obtain a word vector corresponding to each second participle;
    为每个词向量添加每个实体对应的实体标签,得到所述第一目标待分析文本。An entity label corresponding to each entity is added to each word vector to obtain the first target text to be analyzed.
  20. 如权利要求16所述的计算机可读存储介质,其中,所述至少两个实体中包含一组待比较的实体,所述通过预先训练好的情感分析模型对所述至少两个实体、所述属性信息以及所述待分析文本进行分析,得到所述至少两个实体对应的情感分析结果,包括: The computer-readable storage medium according to claim 16, wherein said at least two entities include a group of entities to be compared, and said at least two entities, said The attribute information and the text to be analyzed are analyzed to obtain the sentiment analysis results corresponding to the at least two entities, including:
    获取实体标签组,所述实体标签组包括待比较的实体各自对应的标签;Obtaining an entity tag group, the entity tag group includes tags corresponding to the entities to be compared;
    获取所述属性信息对应的属性标签;Obtain an attribute tag corresponding to the attribute information;
    将所述实体标签组以及所述属性标签添加至所述待分析文本中,得到第二目标待分析文本;adding the entity tag group and the attribute tag to the text to be analyzed to obtain a second target text to be analyzed;
    通过所述情感分析模型对所述第二目标待分析文本进行分析,得到所述至少两个实体对应的情感分析结果。The second target text to be analyzed is analyzed by the sentiment analysis model to obtain sentiment analysis results corresponding to the at least two entities.
PCT/CN2022/071433 2021-06-24 2022-01-11 Method and apparatus for analyzing text, device and storage medium WO2022267454A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110705319.4A CN113420122B (en) 2021-06-24 Method, device, equipment and storage medium for analyzing text
CN202110705319.4 2021-06-24

Publications (1)

Publication Number Publication Date
WO2022267454A1 true WO2022267454A1 (en) 2022-12-29

Family

ID=77717595

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071433 WO2022267454A1 (en) 2021-06-24 2022-01-11 Method and apparatus for analyzing text, device and storage medium

Country Status (1)

Country Link
WO (1) WO2022267454A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116011447A (en) * 2023-03-28 2023-04-25 杭州实在智能科技有限公司 E-commerce comment analysis method, system and computer readable storage medium
CN116069938A (en) * 2023-04-06 2023-05-05 中电科大数据研究院有限公司 Text relevance analysis method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080249764A1 (en) * 2007-03-01 2008-10-09 Microsoft Corporation Smart Sentiment Classifier for Product Reviews
CN103207855A (en) * 2013-04-12 2013-07-17 广东工业大学 Fine-grained sentiment analysis system and method specific to product comment information
CN104268197A (en) * 2013-09-22 2015-01-07 中科嘉速(北京)并行软件有限公司 Industry comment data fine grain sentiment analysis method
CN105912720A (en) * 2016-05-04 2016-08-31 南京大学 Method for analyzing emotion-involved text data in computer
CN113420122A (en) * 2021-06-24 2021-09-21 平安科技(深圳)有限公司 Method, device and equipment for analyzing text and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080249764A1 (en) * 2007-03-01 2008-10-09 Microsoft Corporation Smart Sentiment Classifier for Product Reviews
CN103207855A (en) * 2013-04-12 2013-07-17 广东工业大学 Fine-grained sentiment analysis system and method specific to product comment information
CN104268197A (en) * 2013-09-22 2015-01-07 中科嘉速(北京)并行软件有限公司 Industry comment data fine grain sentiment analysis method
CN105912720A (en) * 2016-05-04 2016-08-31 南京大学 Method for analyzing emotion-involved text data in computer
CN113420122A (en) * 2021-06-24 2021-09-21 平安科技(深圳)有限公司 Method, device and equipment for analyzing text and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116011447A (en) * 2023-03-28 2023-04-25 杭州实在智能科技有限公司 E-commerce comment analysis method, system and computer readable storage medium
CN116069938A (en) * 2023-04-06 2023-05-05 中电科大数据研究院有限公司 Text relevance analysis method

Also Published As

Publication number Publication date
CN113420122A (en) 2021-09-21

Similar Documents

Publication Publication Date Title
WO2018032937A1 (en) Method and apparatus for classifying text information
CN109086357B (en) Variable automatic encoder-based emotion classification method, device, equipment and medium
CN112784696B (en) Lip language identification method, device, equipment and storage medium based on image identification
CN110427487B (en) Data labeling method and device and storage medium
CN110413787B (en) Text clustering method, device, terminal and storage medium
WO2020082734A1 (en) Text emotion recognition method and apparatus, electronic device, and computer non-volatile readable storage medium
WO2022267454A1 (en) Method and apparatus for analyzing text, device and storage medium
CN111611775B (en) Entity identification model generation method, entity identification device and equipment
CN110188357B (en) Industry identification method and device for objects
CN108090099B (en) Text processing method and device
CN111859968A (en) Text structuring method, text structuring device and terminal equipment
WO2023040493A1 (en) Event detection
CN111930792A (en) Data resource labeling method and device, storage medium and electronic equipment
US11989518B2 (en) Normalized processing method and apparatus of named entity, and electronic device
CN115099239B (en) Resource identification method, device, equipment and storage medium
CN107704869B (en) Corpus data sampling method and model training method
CN111177375A (en) Electronic document classification method and device
CN115062621A (en) Label extraction method and device, electronic equipment and storage medium
CN110969005B (en) Method and device for determining similarity between entity corpora
CN112464927B (en) Information extraction method, device and system
CN113204956A (en) Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device
CN115687790B (en) Advertisement pushing method and system based on big data and cloud platform
CN114691907B (en) Cross-modal retrieval method, device and medium
CN113255319B (en) Model training method, text segmentation method, abstract extraction method and device
CN115270818A (en) Intention identification method and device, storage medium and computer equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22826965

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE