WO2020164336A1 - 通过强化学习提取主干词的方法及装置 - Google Patents

通过强化学习提取主干词的方法及装置 Download PDF

Info

Publication number
WO2020164336A1
WO2020164336A1 PCT/CN2020/070149 CN2020070149W WO2020164336A1 WO 2020164336 A1 WO2020164336 A1 WO 2020164336A1 CN 2020070149 W CN2020070149 W CN 2020070149W WO 2020164336 A1 WO2020164336 A1 WO 2020164336A1
Authority
WO
WIPO (PCT)
Prior art keywords
classification
sentence
network
total loss
current
Prior art date
Application number
PCT/CN2020/070149
Other languages
English (en)
French (fr)
Inventor
刘佳
崔恒斌
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020164336A1 publication Critical patent/WO2020164336A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • One or more embodiments of this specification relate to the field of machine learning, and more particularly to methods and devices for extracting stem words in sentences by means of reinforcement learning.
  • Computer-executed natural language processing and text analysis have been applied to a variety of technical scenarios, such as intelligent customer service.
  • intelligent customer service it is necessary to identify the user's intent to describe the question, and then match it to the knowledge points in the knowledge base, thereby automatically answering the user's question.
  • voice for example, when interacting with the telephone
  • there are often some spoken descriptions such as "um”, “ah”, “that”, “oh”, “yes”, etc. Etc., or include some non-key, unnecessary words.
  • This requires extracting the main words in the sentence, that is, the main words for subsequent semantic analysis and intention recognition.
  • event extraction it is also necessary to exclude some stop words and extract the main words, so as to optimize the effect of event extraction.
  • One or more embodiments of this specification describe a method and device for extracting backbone words using a reinforcement learning system.
  • the training of stem word extraction is carried out in the manner of reinforcement learning, thereby reducing the cost of manual labeling, improving the efficiency of stem word extraction, and optimizing the effect of text analysis.
  • a method for extracting stem words through reinforcement learning including:
  • At least the strategy network is updated to extract the main words from the sentence to be analyzed.
  • the policy network includes a first embedding layer, a first processing layer, and a second processing layer.
  • the use of the policy network to extract the stem words of the first sample sentence in the sentence sample set includes:
  • the first embedding layer obtaining the word embedding vector of each word in the first sample sentence
  • the word embedding vector determine the probability of each word as a backbone word
  • At least a part of the words is selected from the various words at least according to the probability to form the first trunk word set.
  • a word with a probability value greater than a preset threshold is selected from the various words to form the first trunk word set.
  • the classification network includes a second embedding layer and a third processing layer, and said using the classification network to classify the first candidate sentence formed by the first set of stem words includes:
  • the first classification result of the first candidate sentence is determined according to the sentence embedding vector.
  • the policy network and/or classification network are based on a recurrent neural network RNN.
  • the above method further includes determining the direction in which the total loss is reduced, including:
  • a direction in which the total loss decreases is determined.
  • the foregoing N classification results are obtained by respectively performing classification processing on the N candidate sentences using the classification network under the same set of classification parameters; in this case, the N total losses correspond to the N sets of strategy parameters;
  • determining the direction in which the total loss is reduced includes:
  • the direction opposite to the positive direction and the negative direction is superimposed as the direction in which the total loss decreases.
  • the current policy parameters in the policy network may be updated in the direction in which the total loss is reduced.
  • determining the direction of the total loss reduction includes:
  • the sum of the first adjustment direction and the second adjustment direction is taken as the direction in which the total loss is reduced.
  • the current policy parameters of the policy network may be updated in the first adjustment direction; and the current classification parameters of the classification network may be updated in the second adjustment direction.
  • the above method further includes:
  • a device for extracting stem words through reinforcement learning including:
  • the classification network training unit is configured to use sentence sample sets to train a classification network for sentence classification
  • the first determining unit is configured to use the policy network under the current policy parameters to extract the stem words of the first sample sentence in the sentence sample set to obtain the first stem word set, and according to the first sample sentence The number of words in and the number of words in the first set of stem words determine the current first loss;
  • the second determining unit is configured to use the classification network to classify the first candidate sentence formed by the first stem word set, to obtain the first classification result of the first candidate sentence, and according to the The first classification result and the classification label of the first sample sentence determine the current second loss;
  • the total loss determining unit is configured to determine the current total loss according to the current first loss and the current second loss;
  • the update unit is configured to update at least the strategy network in a direction in which the total loss is reduced, so as to extract the main words from the sentence to be analyzed.
  • a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method of the first aspect.
  • a computing device including a memory and a processor, characterized in that executable code is stored in the memory, and when the processor executes the executable code, the method of the first aspect is implemented .
  • the learning and training of the extraction of the main words are carried out by means of reinforcement learning. More specifically, an actor-critic reinforcement learning system is used for stem word extraction.
  • the strategy network is used as an actor for stem word extraction;
  • the classification network is used as a critic to classify sentences.
  • the existing sentence sample library can be used as a training prediction to train the classification network, thereby avoiding the labor cost of the main word tagging.
  • the classification network after preliminary training can classify the sentences composed of the main words extracted by the strategy network, and thus evaluate the effect of main words extraction.
  • an ideal reinforcement learning system By setting losses for both the output results of the strategy network and the classification network, and repeatedly training the strategy network and the classification network according to the total loss, an ideal reinforcement learning system can be obtained. In this way, an ideal network system can be trained without manual labeling of the main words, and the effective extraction of main words can be realized.
  • Figure 1 shows a schematic diagram of a deep reinforcement learning system using Actor-Critic mode
  • Figure 2 is a schematic diagram of a reinforcement learning system according to an embodiment disclosed in this specification.
  • Fig. 3 shows a flowchart of a method for training a reinforcement learning system for stem word extraction according to an embodiment
  • Fig. 4 shows a schematic structural diagram of a policy network according to an embodiment
  • Fig. 5 shows a schematic structural diagram of a classification network according to an embodiment
  • Figure 6 shows a flow chart of steps for determining the direction of total loss reduction in a training mode
  • Fig. 7 shows a schematic diagram of an apparatus according to an embodiment.
  • a supervised machine learning method can be used to train the stem word extraction model.
  • a large amount of manually labeled annotation data is needed. These annotation data need to label each word in a sentence whether it is a stem word, and the labor cost is high.
  • the method of reinforcement learning is adopted to extract the main words, which reduces the cost of manual labeling and optimizes the effect of the main words extraction.
  • a reinforcement learning system includes an agent and an execution environment.
  • the agent continuously learns and optimizes its strategy through interaction and feedback with the execution environment. Specifically, the agent observes and obtains the state of the execution environment, and according to a certain strategy, determines the action or action to be taken for the state of the current execution environment. Such behavior acts on the execution environment, will change the state of the execution environment, and generate a feedback to the agent at the same time, this feedback is also called reward.
  • the agent judges based on the reward points obtained, whether the previous behavior is correct, whether the strategy needs to be adjusted, and then updates its strategy. By repeatedly observing the state, determining behavior, and receiving feedback, the agent can continuously update the strategy. The ultimate goal is to be able to learn a strategy that maximizes the accumulation of reward points.
  • FIG. 1 shows a schematic diagram of a deep reinforcement learning system using the Actor-Critic method. As shown in Figure 1, the system includes a strategy model as an actor and an evaluation model as a critic.
  • the strategy model obtains the environment state s from the environment, and outputs the action a to be taken in the current environment state according to a certain strategy.
  • the evaluation model obtains the above-mentioned environmental state s and the action a output by the strategy model, scores the current decision of the strategy model to take action a in the state s, and feeds the score back to the strategy model.
  • the strategy model adjusts the strategy according to the score of the evaluation model in order to obtain a higher score. In other words, the goal of strategy model training is to obtain the highest possible score for the evaluation model.
  • the evaluation model will continuously adjust its scoring method, so that the scoring better reflects the accumulation of the reward score r of environmental feedback.
  • the repeated training of the evaluation model and the strategy model makes the evaluation model's score more and more accurate, getting closer and closer to the rewards of environmental feedback, so the strategies adopted by the strategy model are more and more optimized and reasonable, and more environmental rewards are obtained. .
  • the main word extraction is performed by the reinforcement learning system adopting the Actor-Critic method.
  • Fig. 2 is a schematic diagram of a reinforcement learning system according to an embodiment disclosed in this specification.
  • the reinforcement learning system used for stem word extraction includes a strategy network 100 and a classification network 200.
  • the strategy network 100 is used to extract the main words from the sentence, which corresponds to the strategy model shown in FIG. 1 and functions as an Actor;
  • the classification network 200 is used to classify sentences, which corresponds to the evaluation model shown in FIG. 1, and functions as Critic.
  • Both the policy network 100 and the classification network 200 are neural networks.
  • sample sentences with sentence classification tags can be used.
  • the sample sentence (corresponding to the environment state s) is input to the policy network 100.
  • the policy network 100 extracts a number of stem words from the sample sentence to form a stem word set (equivalent to an action a taken), and the stem word set can correspond to a stem sentence.
  • the classification network 200 obtains the main word set, and classifies the main sentence corresponding to the main word set to obtain the classification result. By comparing the classification result with the classification label of the original sample sentence, it is evaluated whether the main word set is extracted correctly.
  • the loss (loss 1 and loss 2 in the figure) can be set for the main word extraction process of the policy network 100 and the classification process of the classification network 200 respectively. Based on the loss, the policy network 100 and the classification network 200 are repeatedly trained to make the loss smaller and the classification More accurate. The strategy network 100 thus trained can be used to extract the main words of the sentence to be analyzed.
  • Fig. 3 shows a flowchart of a method for training a reinforcement learning system for stem word extraction according to an embodiment. It can be understood that the method can be executed by any device, device, platform, or device cluster with computing and processing capabilities.
  • the method includes: step 31, using a sentence sample set to train a classification network for sentence classification; step 32, using the strategy network under the current strategy parameter group to perform the first sample sentence in the sentence sample set Perform stem word extraction to obtain a first stem word set, and determine the current first loss according to the number of words in the first sample sentence and the number of words in the first stem word set; step 33, use all The classification network performs classification processing on the first candidate sentence formed by the first stem word set, obtains the first classification result of the first candidate sentence, and according to the first classification result and the first The classification label of the sample sentence, determine the current second loss; step 34, determine the current total loss based on the current first loss and the current second loss; step 35, at least update the current total loss in the direction that the total loss decreases
  • the strategy using
  • the policy network 100 is used to extract the main words from the sentence, and the classification network 200 is used to classify the sentence, thereby evaluating the quality of the main words extracted by the policy network.
  • These two neural networks interact with each other and require repeated training to obtain ideal network parameters.
  • the classification network 200 is trained separately so that it can realize basic sentence classification.
  • step 31 a sentence sample set is used to train a classification network for sentence classification.
  • Sentence classification or text classification
  • text classification is a common task in text analysis. Therefore, there are already a large number of rich sample corpora that can be used for classification training. Therefore, in step 31, some sentence samples can be obtained from the existing corpus to form a sentence sample set, where the sentence samples include the original sentence and the classification label added to the original sentence. Using such a sentence sample set composed of sentence samples with classification labels, the sentence classification network can be trained.
  • the training method can be carried out by the classic supervised training method.
  • a preliminary trained classification network can be obtained, which can be used to classify sentences.
  • the above classification network can be used to evaluate the strategy network to train the reinforcement learning system.
  • step 32 use the strategy network under the current strategy parameter group to extract the main word from any sample sentence in the sentence sample set, which is referred to as the first sample sentence hereinafter, to obtain the corresponding main word set, called It is the first trunk word set.
  • the policy parameters in the policy network can be initialized randomly; as the policy network is trained, the policy parameters will be continuously adjusted and updated.
  • the current strategy parameter group can be a random parameter group in the initial state, or it can be a strategy parameter in a certain state during the training process.
  • a set of policy parameters of the policy network can be considered to correspond to a policy.
  • the policy network processes the input first sample sentence according to the current policy, and extracts the main word from it.
  • the policy network may include multiple network layers, and the stem word extraction is implemented through the multiple network layers.
  • Fig. 4 shows a schematic structural diagram of a policy network according to an embodiment.
  • the policy network 100 may include an embedded layer 110, a first processing layer 120, and a second processing layer 130.
  • the embedding layer 110 obtains a sample sentence, and calculates its word embedding vector for each word in the sentence. For example, for the first sample sentence, the word sequence ⁇ W 1 , W 2 ,..., W n ⁇ can be obtained after word segmentation, which includes n words. Calculates the corresponding word embedded layer embedded vector E i for each word W i, thereby obtaining ⁇ E 1, E 2, ... , E n ⁇ .
  • the first processing layer 120 determines the probability of each word as the main word according to the above word embedding vector. For example, for a word embedding vector ⁇ E 1 , E 2 ,..., En ⁇ of n words, determine the probability ⁇ P 1 , P 2 ,..., P n ⁇ of each word as a backbone word.
  • the second processing layer 130 selects at least a part of words from each word according to the above-mentioned probabilities as the main words to form a main word set.
  • a probability threshold is preset. The second processing layer selects words with a probability greater than the above threshold from each word as the main word.
  • the entirety of the network parameters of each layer in the above embedded layer 110, the first processing layer 120, and the second processing layer 130 constitutes a strategy parameter.
  • the policy network 100 uses a recurrent neural network RNN. More specifically, the above embedding layer 110 can be implemented by RNN, so that when embedding each word, the timing effect of the word is considered.
  • the first processing layer 120 and the second processing layer 130 may be implemented by a fully connected processing layer.
  • the strategy network 100 may also adopt different neural network architectures, such as an improved long and short-term memory LSTM neural network based on RNN, a GRU neural network, or a deep neural network DNN, etc.
  • neural network architectures such as an improved long and short-term memory LSTM neural network based on RNN, a GRU neural network, or a deep neural network DNN, etc.
  • a loss function can be used, which is called the first loss function below, to measure the loss of the trunk word extraction process, which is called the first loss below, and is denoted as LK(Loss_Keyword). That is, in step 32, on the basis of obtaining the first stem word set, the current first loss is determined according to the number of words in the first sample sentence and the number of words in the first stem word set.
  • the first loss function is set such that the smaller the number of extracted stem words, the lower the loss value; the more the number of stem words, the higher the loss value.
  • the first loss can also be determined according to the proportion of the extracted stem words relative to the sample sentence. The higher the proportion, the greater the loss value, and the smaller the proportion, the lower the loss value. This is all considered. In an ideal state where the training is expected to be completed, the policy network 100 can exclude as many useless words as possible from the original sentence and retain as few words as possible as the main words.
  • the first loss function can be set as:
  • Num_Reserve is the number of words retained as the main word, that is, the number of words in the main word set
  • Num_Total is the number of words in the sample sentence.
  • step 33 the classification network is used to classify the first candidate sentence formed by the first stem word set to obtain the first classification result of the first candidate sentence.
  • the preliminary classification parameters of the classification network are determined, and such a classification network can be used to classify sentences.
  • the policy network 100 may output the first set of stem words extracted for the first sample sentence, and the first set of stem words may correspond to a candidate sentence, that is, the first candidate sentence.
  • the first candidate sentence can be understood as a sentence obtained after excluding stop words and meaningless words from the first sample sentence, and only retaining the main word.
  • a classification network may be used to classify the first candidate sentence to obtain a classification result.
  • the classification network may include multiple network layers, and sentence classification is implemented through the multiple network layers.
  • Fig. 5 shows a schematic structural diagram of a classification network according to an embodiment.
  • the classification network 200 may include an embedded layer 210 and a fully connected processing layer 220.
  • the embedding layer 210 obtains the main word set output by the policy network 100, and calculates the word embedding vector for each word, and then calculates the sentence embedding vector of the candidate sentence formed by the main word set. For example, for the first set of stem words ⁇ w 1 ,w 2 ,...,w m ⁇ , the word embedding vector ⁇ e 1 , e 2 ,..., e m ⁇ of each word can be calculated separately, and then based on each word embedding vector, Obtain the sentence embedding vector Es of the first candidate sentence.
  • the sentence embedding vector can be obtained by performing operations such as splicing and averaging on each word embedding vector.
  • the fully connected processing layer 220 determines the classification result of the first candidate sentence according to the above sentence embedding vector Es, that is, the first classification result.
  • the entire network parameters of each layer in the above embedded layer 210 and the fully connected processing layer 220 constitute classification parameters.
  • the classification network 200 can be implemented by using a recurrent neural network RNN. More specifically, the above embedding layer 210 can be implemented by RNN. In other embodiments, the classification network 200 may also adopt different neural network architectures, such as LSTM neural network, GRU neural network, or deep neural network DNN, and so on.
  • LC Loss_Classify
  • the second loss function is set to determine the second loss LC based on the cross-entropy algorithm.
  • the second loss LC can also be determined based on the difference between the classification result and the classification label through loss functions of other forms and other algorithms.
  • the classification loss of this classification can be determined, that is, the current The second loss.
  • the current total loss is determined according to the current first loss and the current second loss.
  • the total loss can be understood as the loss of the entire reinforcement learning system, including the loss of the process of extracting the main words of the strategy network, and the loss of the classification process of the classification network.
  • the total loss is defined as the sum of the above-mentioned first loss and second loss.
  • a certain weight may be assigned to the first loss and the second loss, and the total loss is defined as the weighted sum of the first loss and the second loss.
  • the current total loss can be determined based on the current first loss corresponding to the main word extracted this time, and the current second loss corresponding to this classification.
  • the reinforcement learning system can be trained.
  • the goal of training is to make the total loss as small as possible.
  • the total loss as small as possible means that the strategy network 100 eliminates as many useless words as possible and extracts as few main words as possible without changing the sentence. Therefore, the sentence classification result of the classification network 200 is as close as possible to the classification label of the original sentence.
  • the reinforcement learning system is updated in the direction of the total loss reduction. Updating the reinforcement learning system includes at least updating the strategy network 100, and may also include updating the classification network 200.
  • the method for determining the direction of the above total loss reduction and the updating method of the reinforcement learning system may be different in different training methods and different training stages, which are described separately below.
  • a training method in order to determine the direction of total loss reduction, different strategies are used in the strategy network 100 to process multiple sample sentences respectively to obtain multiple corresponding stem word sentences and multiple corresponding first losses; Then, the classification network 200 is used to classify each stem word sentence to obtain multiple corresponding classification results and multiple corresponding second losses.
  • multiple total losses for processing multiple sample sentences are obtained.
  • the current loss is compared with multiple total losses, and the gradient of the network parameter corresponding to the total loss that is smaller than the current loss in the multiple total losses relative to the current network parameter is determined as the direction in which the total loss decreases.
  • Fig. 6 shows a flowchart of steps for determining the direction of total loss reduction in this training mode.
  • N strategies In order to explore more and better strategies, in the strategy network 100, a certain randomness can be added to the current strategy to generate N strategies, and these N strategies correspond to N sets of strategy parameters.
  • random disturbance can be added to the embedding algorithm of the embedding layer to obtain a new strategy; the algorithm for determining the probability of the backbone word in the first processing layer can be changed to obtain a new strategy; the probability can also be adjusted
  • the selected rule algorithm for example, changes the probability threshold to obtain a new strategy.
  • step 61 the first sample sentence is processed by using the strategy network under the above N groups of strategy parameters to obtain the corresponding N trunk word sets.
  • N first losses can be determined respectively according to the first loss function as described above.
  • step 62 the classification network 200 is used to classify the N candidate sentences corresponding to the N stem word sets to obtain N classification results.
  • N second losses corresponding to the N classification results are respectively determined.
  • step 63 according to the N first losses and N second losses, the corresponding N total losses are determined, denoted as L1, L2, ..., Ln.
  • the average value La of the above N total losses can also be determined.
  • step 64 at least one first total loss whose loss value is less than or equal to the mean value and at least one second total loss whose loss value is greater than the mean value are determined.
  • the above-mentioned N total losses are divided into total losses less than or equal to the average value La, called the first total loss, and total losses greater than the average value La, called the second total loss.
  • step 65 based on the above-mentioned first total loss and second total loss, the direction in which the total loss is reduced is determined. More specifically, the first total loss may correspond to the direction of positive learning because the loss is small, and the second total loss may correspond to the direction of negative learning because of the large loss. Therefore, in step 65, the direction of positive learning and the opposite direction of the negative learning direction are combined to obtain the total learning direction, that is, the direction in which the total loss is reduced.
  • the classification network is trained separately, as shown in step 31.
  • the above classification network in the next second stage, the above classification network is fixed, and only the strategy network is trained and updated; then, in the third stage, the update strategy network and the classification network are trained at the same time.
  • the classification network is fixed, that is, the classification parameters in the classification network remain unchanged and no adjustment is made.
  • the classification network under the same set of classification parameters is used to classify the aforementioned N candidate sentences, that is, based on the same classification method, to obtain the N Classification results.
  • the N total losses determined in step 63 actually correspond to the N strategies of the strategy network, and in turn correspond to the N sets of strategy parameters. That is, the i-th total loss Li corresponds to the i-th group of policy parameters PSi.
  • step 64 on the basis of determining the first total loss and the second total loss, the first strategy parameter corresponding to the first total loss and the second strategy parameter corresponding to the second total loss are determined.
  • the total loss Li is less than or equal to the mean value La, the total loss is classified as the first total loss, and the corresponding strategy parameter group PSi is classified as the first strategy parameter; if the total loss Li is greater than the mean value La, the total loss is classified as the first total loss. The total loss is classified as the second total loss, and the corresponding strategy parameter group PSi is classified as the second strategy parameter.
  • step 65 the direction of the total loss reduction is determined by the following method:
  • the first strategy parameter corresponds to the total loss whose loss value is less than or equal to the average value, or the total loss whose loss value is smaller. Therefore, it can be considered that the strategy selection direction corresponding to the first strategy parameter is correct.
  • the “positive sample” learned by the system should be forward learning; and the second strategy parameter corresponds to the total loss whose loss value is greater than the average value, which is the total loss whose loss value is larger. Therefore, it can be considered that the second strategy parameter corresponds to The direction of strategy selection is wrong, it is a "negative sample" of system learning, and reverse learning should be performed.
  • the first strategy parameter may be multiple sets of first strategy parameters.
  • the multiple sets of first strategy parameters may have different effects on the extraction of stem words at different positions of the sample sentence. Therefore, in one embodiment, the multiple sets of first strategy parameters are all forward-learned to determine the first strategy parameters in each group. The gradient of the strategy parameter relative to the current strategy parameter is accumulated to obtain the above-mentioned positive direction.
  • the second strategy parameter may also be multiple sets of second strategy parameters.
  • negative learning is performed on the multiple sets of second strategy parameters, the gradient of each set of second strategy parameters with respect to the current strategy parameter is determined, and the gradients are accumulated to obtain the aforementioned negative direction.
  • PSi is the first strategy parameter
  • PSj is the second strategy parameter
  • is the current strategy parameter
  • N 10, where L1-L6 is less than the average loss, which is the first total loss, and the corresponding strategy parameter group PS1-PS6 is the first strategy parameter; assuming that L7-L10 is greater than the average loss, it is the second For the total loss, the corresponding strategy parameter group PS7-PS10 is the second strategy parameter.
  • the direction in which the total loss is reduced is determined by the above method. Therefore, in step 35 of FIG. 3, the current policy parameter set in the policy network 100 is updated in the direction of the total loss reduction.
  • the training of the reinforcement learning system can enter the third stage, and the strategy network 100 and the classification network 200 are trained and updated at the same time.
  • step 61 the first sample sentence is still processed by the strategy network under N sets of different strategy parameters to obtain the corresponding N trunk word sets. These N trunk word sets can correspond to N candidates sentence.
  • the classification parameters used to classify the aforementioned N candidate sentences are not completely the same.
  • step 63 the corresponding N total losses are determined according to the N first losses and N second losses.
  • the network parameters of the policy network and the classification network have changed.
  • the N total losses correspond to N parameter sets, where the i-th parameter set Si includes the i-th group strategy parameter PSi and the classification parameter CSi corresponding to the classification network when processing the i-th candidate sentence.
  • the aforementioned parameter set is an overall set of network parameters of the policy network 100 and the classification network 200.
  • the average value La of N total losses can be determined. Then, in step 64, the above N total losses are divided into a first total loss less than or equal to the average value La, and a second total loss greater than the average value La.
  • the first parameter set corresponding to the first total loss and the second parameter set corresponding to the second total loss can be determined accordingly.
  • the total loss Li is less than or equal to the average value La
  • the total loss is classified as the first total loss
  • the corresponding parameter set Si is classified as the first parameter set
  • the total loss Li is greater than the average La
  • the total loss The total loss is classified as the second total loss
  • the corresponding parameter set Si is classified as the second parameter set.
  • step 65 the direction of the total loss reduction is determined by the following method:
  • the above concept of determining the direction of total loss reduction, that is, the direction of parameter adjustment, is the same as in the second stage, that is, the parameter set corresponding to the total loss with the smaller loss value, that is, the first parameter set, is used as the system learning " Positive sample” for positive learning; the parameter set corresponding to the total loss with a larger loss value, that is, the second parameter set, is used as the "negative sample” of system learning for reverse learning.
  • learning for the strategy network and classification network, respectively determine the adjustment and optimization directions of the corresponding strategy parameters and classification parameters.
  • the determination of the adjustment direction is similar to the second stage, except that when calculating the gradient, the gradient of the entire parameter set relative to the current policy parameter is calculated.
  • the parameter set strategy parameter and the classification parameter are two sets of independent parameters. Therefore, in the actual gradient calculation, the gradient of the strategy parameter part in the parameter set relative to the current strategy parameter is still calculated to obtain the aforementioned first positive direction. And the first negative direction to determine the first adjustment direction, that is, the strategy parameter optimization direction.
  • the above first adjustment direction can be expressed as:
  • Si is the first parameter set
  • Sj is the second parameter set
  • is the current strategy parameter
  • the determination of the adjustment direction is similar to the strategy parameters. Specifically, the accumulation of the gradient of the first parameter set relative to the current classification parameter is calculated as the second positive direction; the second parameter set is calculated relative to the current The accumulation of the gradient of the classification parameter is taken as the second negative direction; the second positive direction and the opposite direction of the second negative direction are superimposed as the classification optimization direction.
  • the strategy parameters and the classification parameters are usually independent of each other, in the actual gradient calculation, the aforementioned second positive direction and second negative direction can be obtained by calculating the gradient of the classification parameter part in each parameter set relative to the current classification parameter. Direction, and then determine the second adjustment direction as the optimization direction of the classification parameters.
  • the above second adjustment direction can be expressed as:
  • Si is the first parameter set
  • Sj is the second parameter set
  • is the current classification parameter
  • the sum of the first adjustment direction and the second adjustment direction can be used as the direction of total loss reduction, that is, the adjustment direction of the entire system.
  • step 35 of FIG. 3 updating the reinforcement learning system in the direction in which the total loss is reduced includes: updating the current strategy parameters in the strategy network 100 according to the above-mentioned first adjustment direction, and updating the classification network according to the above-mentioned second adjustment direction The current classification parameters in. In this way, in the third stage, both the strategy network and the classification network are trained.
  • the above embodiment describes the training process of training the classification network separately in the first stage, fixing the classification network in the second stage, training the strategy network separately, and then training the strategy network and the classification network at the same time in the third stage
  • the second stage can be skipped and the third stage is directly entered, and the strategy network and the classification network are trained at the same time.
  • the strategy network can accurately extract as few main words as possible to make the sentence expression more refined without affecting the meaning of the sentence, that is, the semantic classification result of the sentence.
  • the trained strategy network can be used for stem word extraction.
  • the sentence to be analyzed can be input to the strategy network, and the strategy network uses the strategy parameters obtained by training to process the sentence.
  • the main word in the sentence can be determined.
  • the set of these main words can correspond to a main sentence, which can be used for subsequent intent recognition, semantic matching and other further text analysis to optimize the effect of subsequent text analysis.
  • the learning and training of the extraction of main words are carried out by means of reinforcement learning.
  • the strategy network is used as an actor for stem word extraction; the classification network is used as a critic to classify sentences.
  • the existing sentence sample library can be used as a training prediction to train the classification network, thereby avoiding the labor cost of the main word tagging.
  • the classification network after preliminary training can classify the sentences composed of the main words extracted by the strategy network, and thus evaluate the effect of main words extraction.
  • an apparatus for extracting stem words through reinforcement learning is also provided.
  • the device can be deployed on any equipment or platform with computing and processing capabilities.
  • Fig. 7 shows a schematic diagram of an apparatus according to an embodiment. As shown in FIG. 7, the device 700 includes:
  • the classification network training unit 71 is configured to use sentence sample sets to train a classification network for sentence classification
  • the first determining unit 72 is configured to use the policy network under the current policy parameters to extract the stem words of the first sample sentence in the sentence sample set to obtain the first stem word set, and according to the first sample sentence The number of words in and the number of words in the first stem word set to determine the current first loss;
  • the second determining unit 73 is configured to use the classification network to classify the first candidate sentence formed by the first stem word set, to obtain the first classification result of the first candidate sentence, and according to the Describe the first classification result and the classification label of the first sample sentence, and determine the current second loss;
  • the total loss determining unit 74 is configured to determine the current total loss according to the current first loss and the current second loss;
  • the update unit 75 is configured to update at least the strategy network in a direction in which the total loss is reduced, so as to extract the main words from the sentence to be analyzed.
  • the policy network includes a first embedding layer, a first processing layer, and a second processing layer.
  • the first determining unit 72 is specifically configured as:
  • the first embedding layer obtaining the word embedding vector of each word in the first sample sentence
  • the word embedding vector determine the probability of each word as a backbone word
  • At least a part of the words is selected from the various words at least according to the probability to form the first trunk word set.
  • a word with a probability value greater than a preset threshold is selected from the various words to form the first trunk word set.
  • the classification network includes a second embedding layer and a third processing layer
  • the second determining unit 73 is specifically configured to:
  • the first classification result of the first candidate sentence is determined according to the sentence embedding vector.
  • the strategy network and/or the classification network are based on a recurrent neural network RNN.
  • the first determining unit 72 is further configured to process the first sample sentence by using the policy network under the N groups of policy parameters to obtain the corresponding N trunk word sets, and to respectively determine N First loss
  • the second determining unit 73 is further configured to use the classification network to classify the N candidate sentences corresponding to the N trunk word sets, to obtain N classification results, and to determine N second loss;
  • the total loss determining unit 74 is further configured to determine the corresponding N total losses and the average value of the N total losses according to the N first losses and N second losses;
  • the update unit 75 includes a direction determination module 751 and an update module 752.
  • the direction determining module 751 is configured to determine the direction in which the total loss is reduced based on the at least one first total loss and the at least one second total loss;
  • the updating module 752 is configured to determine according to the direction determining module 751 In the direction of performing network updates.
  • the second determining unit 73 is configured to: use the classification network under the same set of classification parameters to perform classification processing on the N candidate sentences respectively to obtain the N classification results ;
  • the N total losses correspond to the N sets of policy parameters;
  • the direction determining module 751 is configured to:
  • the direction opposite to the positive direction and the negative direction is superimposed as the direction in which the total loss decreases.
  • the update module 752 is configured to update the current policy parameters in the policy network in the direction in which the total loss decreases.
  • the direction determining module 751 is configured as:
  • the sum of the first adjustment direction and the second adjustment direction is taken as the direction in which the total loss is reduced.
  • the update module 752 is configured to:
  • the current classification parameters of the classification network are updated.
  • the device 700 further includes a prediction unit (not shown) configured to:
  • a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method described in conjunction with FIG. 2 and FIG. 4.
  • a computing device including a memory and a processor, the memory is stored with executable code, and when the processor executes the executable code, a combination of FIGS. 3 and 6 is implemented. The method described.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

一种通过强化学习提取主干词的方法和装置,方法包括,首先利用句子样本集,训练用于句子分类的分类网络。然后,利用当前策略参数下的策略网络,对句子样本集中的样本句子进行主干词提取,获得主干词集合,并根据该样本句子中的词语数目和主干词集合中的词语数目,确定当前的第一损失;接着,利用分类网络对主干词集合构成的备选句子进行分类处理,获得该备选句子的分类结果,并根据分类结果以及样本句子的分类标签,确定当前的第二损失。如此,可以根据当前的第一损失和第二损失,确定当前的总损失。进而,在总损失减小的方向,更新强化学习系统,其中包括,至少更新所述策略网络,以用于从待分析句子中提取主干词。

Description

通过强化学习提取主干词的方法及装置 技术领域
本说明书一个或多个实施例涉及机器学习领域,尤其涉及利用强化学习的方式提取句子中的主干词的方法和装置。
背景技术
计算机执行的自然语言处理和文本分析,例如意图识别,事件抽取等,已经应用到多种技术场景中,例如智能客服。在智能客服中,需要对用户的描述问题进行意图识别,进而将其匹配到知识库中的知识点,从而自动地回答用户的问题。然而,用户在进行问题描述时,特别是通过语音进行问题描述,例如电话交互沟通时,经常有一些口语描述,比如“嗯”、“啊”、“那个”、“哦”、“就是”等等,或者包含一些非重点的,不必要的词语。这就需要把句子中主要的词,即主干词提取出来,以便后续做语义分析和意图识别。在进行事件抽取时,也需要排除一些停用词,提取出主干词,从而优化事件提取的效果。
因此,希望能有改进的方案,可以有效地对句子中的主干词进行提取,从而优化文本分析效果。
发明内容
本说明书一个或多个实施例描述了一种利用强化学习系统提取主干词的方法和装置。通过实施例中的方法和装置,利用强化学习的方式进行主干词提取的训练,从而减少人工标注成本,提高主干词提取效率,优化文本分析效果。
根据第一方面,提供了一种通过强化学习提取主干词的方法,包括:
利用句子样本集,训练用于句子分类的分类网络;
利用当前策略参数下的策略网络,对所述句子样本集中的第一样本句子进行主干词提取,获得第一主干词集合,并根据所述第一样本句子中的词语数目和所述第一主干词集合中的词语数目,确定当前的第一损失;
利用所述分类网络对由所述第一主干词集合构成的第一备选句子进行分类处理,获得所述第一备选句子的第一分类结果,并根据所述第一分类结果以及所述第一样本句 子的分类标签,确定当前的第二损失;
根据所述当前的第一损失和当前的第二损失,确定当前的总损失;
在总损失减小的方向,至少更新所述策略网络,以用于从待分析句子中提取主干词。
在一个实施例中,策略网络包括第一嵌入层,第一处理层和第二处理层,所述利用策略网络对所述句子样本集中的第一样本句子进行主干词提取包括:
在所述第一嵌入层,获得所述第一样本句子中的各个词的词嵌入向量;
在所述第一处理层,根据所述词嵌入向量,确定所述各个词作为主干词的概率;
在所述第二处理层,至少根据所述概率,从所述各个词中选择至少一部分词,构成所述第一主干词集合。
在一个进一步的实施例中,在所述第二处理层,从所述各个词中选择概率值大于预设阈值的词,构成所述第一主干词集合。
根据一种实施方式,分类网络包括第二嵌入层和第三处理层,所述利用所述分类网络对由所述第一主干词集合构成的第一备选句子进行分类处理包括:
在所述第二嵌入层,获得所述第一备选句子对应的句子嵌入向量;
在所述第三处理层,根据所述句子嵌入向量,确定所述第一备选句子的第一分类结果。
在一种实施方式中,策略网络和/或分类网络基于循环神经网络RNN。
在一个实施例中,上述方法还包括,确定所述总损失减小的方向,包括:
分别利用N组策略参数下的所述策略网络处理所述第一样本句子,获得对应的N个主干词集合,并分别确定N个第一损失;
利用所述分类网络,对所述N个主干词集合分别对应的N个备选句子进行分类处理,获得N个分类结果,并分别确定N个第二损失;
根据N个第一损失和N个第二损失,确定对应的N个总损失,以及所述N个总损失的均值;
确定损失值小于等于所述均值的至少一个第一总损失,以及损失值大于所述均值的至少一个第二总损失;
基于所述至少一个第一总损失和所述至少一个第二总损失,确定所述总损失减小的方向。
进一步的,在一个实施例中,上述N个分类结果是利用同一组分类参数下的所述分类网络,对所述N个备选句子分别进行分类处理而获得;在这样的情况下,所述N个总损失对应于所述N组策略参数;
此时,确定所述总损失减小的方向,包括:
确定所述至少一个第一总损失对应的至少一组第一策略参数相对于当前策略参数的梯度的累积,作为正方向;
确定所述至少一个第二总损失对应的至少一组第二策略参数相对于当前策略参数的梯度的累积,作为负方向;
将所述正方向与所述负方向的相反方向叠加,作为所述总损失减小的方向。
进一步的,在上述情况下,可以在所述总损失减小的方向,更新所述策略网络中的当前策略参数。
在另一实施例中,所述N个分类结果是利用M组分类参数下的所述分类网络,对所述N个备选句子进行分类处理而获得,其中M<=N;在这样的情况下,所述N个总损失对应于N个参数集,其中第i参数集包括第i组策略参数,和处理第i备选句子时所述分类网络对应的分类参数;
此时,确定所述总损失减小的方向包括:
确定所述至少一个第一总损失对应的至少一组第一参数集相对于当前策略参数的梯度的累积,作为第一正方向;
确定所述至少一个第二总损失对应的至少一组第二参数集相对于当前策略参数的梯度的累积,作为第一负方向;
将所述第一正方向与所述第一负方向的相反方向叠加,作为第一调整方向;
确定所述至少一个第一总损失对应的至少一组第一参数集相对于当前分类参数的梯度的累积,作为第二正方向;
确定所述至少一个第二总损失对应的至少一组第二参数集相对于当前分类参数的梯度的累积,作为第二负方向;
将所述第二正方向与所述第二负方向的相反方向叠加,作为第二调整方向;
将所述第一调整方向和第二调整方向的总和作为所述总损失减小的方向。
进一步的,在上述情况下,可以在所述第一调整方向,更新所述策略网络的当前策略参数;在所述第二调整方向,更新所述分类网络的当前分类参数。
根据一种实施方式,上述方法还包括:
将待分析的第二句子输入所述策略网络;
根据所述策略网络的输出,确定所述第二句子中的主干词。
根据第二方面,提供了一种通过强化学习提取主干词的装置,包括:
分类网络训练单元,配置为利用句子样本集,训练用于句子分类的分类网络;
第一确定单元,配置为利用当前策略参数下的策略网络,对所述句子样本集中的第一样本句子进行主干词提取,获得第一主干词集合,并根据所述第一样本句子中的词语数目和所述第一主干词集合中的词语数目,确定当前的第一损失;
第二确定单元,配置为利用所述分类网络对由所述第一主干词集合构成的第一备选句子进行分类处理,获得所述第一备选句子的第一分类结果,并根据所述第一分类结果以及所述第一样本句子的分类标签,确定当前的第二损失;
总损失确定单元,配置为根据所述当前的第一损失和当前的第二损失,确定当前的总损失;
更新单元,配置为在总损失减小的方向,至少更新所述策略网络,以用于从待分析句子中提取主干词。
根据第三方面,提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行第一方面的方法。
根据第四方面,提供了一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现第一方面的方法。
根据本说明书实施例提供的方法和装置,通过强化学习的方式,进行主干词提取的学习和训练。更具体的,采用actor-critic方式的强化学习系统进行主干词提取,其中在强化学习系统中,策略网络作为actor,用于主干词提取;分类网络作为critic,用于对句子进行分类。可以利用现有的句子样本库作为训练预料训练分类网络,从而避免主 干词标注的人工成本。经过初步训练的分类网络即可对策略网络提取的主干词构成的句子进行分类,如此评估主干词提取的效果。通过对策略网络和分类网络的输出结果均设置损失,根据总损失反复训练策略网络和分类网络,可以得到理想的强化学习系统。如此,可以在无需主干词人工标注的情况下,训练得到理想的网络系统,实现对主干词的有效提取。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1示出采用Actor-Critic方式的深度强化学习系统的示意图;
图2为本说明书披露的一个实施例的强化学习系统示意图;
图3示出根据一个实施例的训练用于主干词提取的强化学习系统的方法流程图;
图4示出根据一个实施例的策略网络的结构示意图;
图5示出根据一个实施例的分类网络的结构示意图;
图6示出在一种训练方式下确定总损失减小方向的步骤流程图;
图7示出根据一个实施例的装置示意图。
具体实施方式
下面结合附图,对本说明书提供的方案进行描述。
如前所述,在多种文本分析场景中,都需要对句子的主干词进行提取。为了能够自动地进行主干词提取,在一种方案中,可以通过有监督的机器学习方法来训练主干词提取模型。根据常规的监督学习方法,为了训练这样的主干词提取模型,就需要大量人工标注的标注数据,这些标注数据需要对句子中各个词是否为主干词进行标注,人工成本很大。
根据本说明书实施例的构思,采用强化学习的方式进行主干词提取,降低人工标注成本,优化主干词提取效果。
如本领域技术人员所知,强化学习是一种基于序列行为的反馈,进行的无标注的学习策略的方法。一般地,强化学习系统包括智能体和执行环境,智能体通过与执行环境的交互和反馈,不断进行学习,优化其策略。具体而言,智能体观察并获得执行环境的状态(state),根据一定策略,针对当前执行环境的状态确定要采取的行为或动作(action)。这样的行为作用于执行环境,会改变执行环境的状态,同时产生一个反馈给智能体,该反馈又称为奖励分数(reward)。智能体根据获得的奖励分数来判断,之前的行为是否正确,策略是否需要调整,进而更新其策略。通过反复不断地观察状态、确定行为、收到反馈,使得智能体可以不断更新策略,最终目标是能够学习到一个策略,使得获得的奖励分数累积最大化。
存在多种算法来进行智能体中策略的学习和优化,其中Actor-Critic方法是用于强化学习的一种策略梯度方法。图1示出采用Actor-Critic方式的深度强化学习系统的示意图。如图1所示,系统包括作为actor的策略模型和作为critic的评估模型。
策略模型从环境获得环境状态s,根据一定策略,输出在当前环境状态下要采取的动作a。评估模型获取上述环境状态s,以及策略模型输出的动作a,对策略模型在状态s下采取动作a的本次决策进行打分,并将该打分反馈给策略模型。策略模型根据评估模型的打分来调整策略,以期获得更高的打分。也就是说,策略模型训练的目标是,获得评估模型的尽可能高的打分。另一方面,评估模型也会不断调整其打分方式,使得打分更好的反映环境反馈的奖励分数r的累积。
如此,反复训练评估模型和策略模型,使得评估模型的打分越来越准确,越来越接近环境反馈的奖励,于是,策略模型采取的策略也越来越优化合理,得到更多的环境的奖励。
基于以上的特点,根据本说明书的实施例,通过采用Actor-Critic方式的强化学习系统进行主干词提取。
图2为本说明书披露的一个实施例的强化学习系统示意图。如图2所示,用于主干词提取的强化学习系统包括策略网络100和分类网络200。策略网络100用于从句子中提取主干词,它对应于图1所示的策略模型,作用为Actor;分类网络200用于对句子进行分类,它对应于图1所示的评估模型,作用为Critic。策略网络100和分类网络200均为神经网络。
为了对策略网络100和分类网络200进行训练,可以采用具有句子分类标签的样 本句子。
在训练过程中,将样本句子(对应于环境状态s)输入到策略网络100。通过一定策略,策略网络100从该样本句子中提取出若干主干词,形成主干词集合(相当于采取的一个动作a),该主干词集合可以对应于一个主干句。
分类网络200获取主干词集合,并对该主干词集合对应的主干句子进行分类,得到分类结果。通过比对该分类结果与原始样本句子的分类标签,来评估该主干词集合提取得是否正确。
可以分别为策略网络100的主干词提取过程和分类网络200的分类过程设置损失(图中的损失1和损失2),基于该损失反复训练策略网络100和分类网络200,使得损失更小,分类更准。如此训练得到的策略网络100,就可以用于对待分析的句子进行主干词提取。
下面描述以上系统的训练过程和处理过程。
图3示出根据一个实施例的训练用于主干词提取的强化学习系统的方法流程图。可以理解,该方法可以通过任何具有计算、处理能力的装置、设备、平台、设备集群来执行。如图3所示,该方法包括:步骤31,利用句子样本集,训练用于句子分类的分类网络;步骤32,利用当前策略参数组下的策略网络,对句子样本集中的第一样本句子进行主干词提取,获得第一主干词集合,并根据所述第一样本句子中的词语数目和所述第一主干词集合中的词语数目,确定当前的第一损失;步骤33,利用所述分类网络对由所述第一主干词集合构成的第一备选句子进行分类处理,获得所述第一备选句子的第一分类结果,并根据所述第一分类结果以及所述第一样本句子的分类标签,确定当前的第二损失;步骤34,根据当前的第一损失和当前的第二损失,确定当前的总损失;步骤35,在总损失减小的方向,至少更新所述策略网络,以用于从待分析句子中提取主干词。下面描述以上各个步骤的具体执行方式。
如以上结合图2所述,策略网络100用于从句子中提取主干词,分类网络200用于对句子进行分类,进而评估策略网络提取主干词的质量。这两个神经网络互相交互,需要反复进行训练,才可以获得理想的网络参数。为了促进模型尽快收敛,在第一阶段,单独训练分类网络200,使其可以实现基本的句子分类。
因此,首先,在步骤31,利用句子样本集,训练用于句子分类的分类网络。
句子分类,或称为文本分类,是文本分析中的常见任务,因此,已经存在大量丰 富的样本语料,可以用于进行分类训练。因此,在步骤31,可以从已有语料库中获取一些句子样本,构成句子样本集,这里的句子样本包括原始句子,以及为该原始句子添加的分类标签。利用这样的具有分类标签的句子样本构成的句子样本集,就可以训练句子分类网络。训练的方式可以采用经典的监督训练的方式进行。
如此,通过步骤31,可以得到初步训练的分类网络,该分类网络可以用于对句子进行分类。在此基础上,就可以利用上述分类网络对策略网络进行评估,从而训练强化学习系统。
具体地,在步骤32,利用当前策略参数组下的策略网络,对句子样本集中的任意的一个样本句子,下文称为第一样本句子,进行主干词提取,获得对应的主干词集合,称为第一主干词集合。
可以理解,初始地,策略网络中的策略参数可以是随机初始化的;随着策略网络的训练,策略参数会不断调整和更新。当前的策略参数组可以是初始状态下随机的参数组,也可以是训练过程中,某一状态下的策略参数。策略网络的一组策略参数可以认为对应于一种策略。相应的,在步骤32,策略网络根据当前策略,对输入的第一样本句子进行处理,从中提取出主干词。
在一个实施例中,策略网络可以包括多个网络层,通过该多个网络层实现主干词提取。
图4示出根据一个实施例的策略网络的结构示意图。如图4所示,策略网络100可以包括,嵌入层110,第一处理层120和第二处理层130。
嵌入层110获得样本句子,对于句子中的各个词,计算其词嵌入向量。例如,对于第一样本句子,将其分词后可以得到词序列{W 1,W 2,…,W n},其中包括n个词。嵌入层针对每个词W i计算对应的词嵌入向量E i,于是得到{E 1,E 2,…,E n}。
第一处理层120根据以上的词嵌入向量,确定各个词作为主干词的概率。例如,对于n个词的词嵌入向量{E 1,E 2,…,E n},确定各个词的作为主干词的概率{P 1,P 2,…,P n}。
第二处理层130根据上述概率,从各个词中选择至少一部分词,作为主干词,构成主干词集合。在一个实施例中,预先设置一个概率阈值。第二处理层从各个词中,选出概率大于上述阈值的词,作为主干词。
以上嵌入层110、第一处理层120和第二处理层130中各层网络参数的整体,构成策略参数。
在一个实施例中,策略网络100采用循环神经网络RNN。更具体的,可以通过RNN实现以上的嵌入层110,从而在进行各个词的词嵌入时,考虑词的时序影响。第一处理层120和第二处理层130可以通过全连接处理层实现。
在其他实施例中,策略网络100也可以采用不同的神经网络架构,例如基于RNN改进的长短期记忆LSTM神经网络,GRU神经网络,或者深度神经网络DNN,等等。
通过以上的策略网络,可以对样本句子进行主干词提取。例如,对于第一样本句子中的n个词,策略网络通过当前策略,从中选择了m个词(m<=n)作为主干词,这m个主干词表示为{w 1,w 2,…,w m}。如此,得到主干词集合。
在得到主干词集合的基础上,可以通过一个损失函数,下文称为第一损失函数,衡量主干词提取过程的损失,下文称为第一损失,记为LK(Loss_Keyword)。也就是,在步骤32,在获得第一主干词集合的基础上,根据第一样本句子中的词语数目和第一主干词集合中的词语数目,确定当前的第一损失。
在一个实施例中,第一损失函数被设定为,提取的主干词的数目越少,损失值越低;主干词数目越多,损失值越高。在一个实施例中,还可以根据提取的主干词相对于样本句子的占比来确定第一损失,占比越高,损失值越大,占比越小,损失值越低。这都是考虑到,希望训练完成的理想状态下,策略网络100可以从原始句子中排除尽量多的无用词,保留尽可能少的词作为主干词。
例如,在一个例子中,第一损失函数可以设置为:
LK=Num_Reserve/Num_Total,
其中,Num_Reserve为作为主干词保留下来的词语数目,即主干词集合中的词语数目,Num_Total为样本句子中的词语数目。
在以上例子中,假定第一样本句子中包含n个词,策略网络通过当前策略,从中选择了m个词,那么当前的第一损失为LK=m/n。
接下来,在步骤33,利用分类网络对由第一主干词集合构成的第一备选句子进行分类处理,获得第一备选句子的第一分类结果。
可以理解,通过步骤31的初步训练,确定出了分类网络的初步分类参数,这样的分类网络可以用于对句子进行分类。此外,在步骤32,策略网络100可以输出针对第一样本句子提取的第一主干词集合,该第一主干词集合可以对应一个备选句子,即第一备 选句子。该第一备选句子可以理解为,对第一样本句子排除停用词、无意义词,仅保留主干词后得到的句子。相应的,在步骤33,可以用分类网络对该第一备选句子进行分类处理,得到分类结果。
在一个实施例中,分类网络可以包括多个网络层,通过该多个网络层实现句子分类。
图5示出根据一个实施例的分类网络的结构示意图。如图5所示,分类网络200可以包括,嵌入层210,全连接处理层220。
嵌入层210获得策略网络100输出的主干词集合,对于各个词,计算其词嵌入向量,进而计算出该主干词集合所构成的备选句子的句子嵌入向量。例如,对于第一主干词集合{w 1,w 2,…,w m},可以分别计算各个词的词嵌入向量{e 1,e 2,…,e m},然后基于各个词嵌入向量,得到第一备选句子的句子嵌入向量Es。在不同实施例中,句子嵌入向量可以通过对各个词嵌入向量进行拼接、平均等运算而得到。
接着,全连接处理层220根据以上的句子嵌入向量Es,确定第一备选句子的分类结果,即第一分类结果。
以上嵌入层210和全连接处理层220中各层网络参数的整体,构成分类参数。
与策略网络100类似的,分类网络200可以采用循环神经网络RNN来实现。更具体的,可以通过RNN实现以上的嵌入层210。在其他实施例中,分类网络200也可以采用不同的神经网络架构,例如LSTM神经网络,GRU神经网络,或者深度神经网络DNN,等等。
在对备选句子进行分类后,可以通过另一个损失函数,下文称为第二损失函数,衡量分类过程的损失,下文称为第二损失,记为LC(Loss_Classify)。也就是,在步骤33,在获得第一分类结果的基础上,根据该第一分类结果以及第一样本句子的分类标签,确定当前的第二损失。
在一个实施例中,第二损失函数被设定为,基于交叉熵算法确定第二损失LC。在其他实施例中,也可以通过其他形式和其他算法的损失函数,基于分类结果和分类标签之间的差异,确定出第二损失LC。相应的,通过上述第二损失函数,基于本次分类得到的第一分类结果,以及第一样本句子对应的分类标签之间的比对,可以确定出本次分类的分类损失,即当前的第二损失。
在确定出第一损失和第二损失的基础上,在步骤34,根据当前的第一损失和当前 的第二损失,确定当前的总损失。
总损失可以理解为,整个强化学习系统的损失,包括策略网络提取主干词过程的损失,和分类网络进行分类过程的损失。在一个实施例中,总损失定义为,上述第一损失和第二损失的加和。在另一实施例中,还可以为第一损失和第二损失各自赋予一定权重,将总损失定义为,第一损失和第二损失的加权求和。
根据总损失的定义方式,基于本次提取主干词对应的当前的第一损失,以及本次分类对应的当前的第二损失,可以确定出当前的总损失。
基于这样的总损失,就可以对强化学习系统进行训练,训练的目标是使得总损失尽可能小。根据以上第一损失、第二损失和总损失的定义方式,可以理解,总损失尽可能小意味着,在策略网络100排除尽量多的无用词、提取尽量少的主干词的同时,不改变句子的含义,因而分类网络200的句子分类结果尽量接近原始句子的分类标签。
为了达到总损失减小的目的,在步骤35,在总损失减小的方向,更新强化学习系统。更新强化学习系统至少包括,更新策略网络100,还可以包括,更新分类网络200。
以上总损失减小的方向的确定方式,以及强化学习系统的更新方式,在不同训练方式下、不同训练阶段中可以有所不同,下面分别进行描述。
根据一种训练方式,为了确定出总损失减小的方向,在策略网络100中用不同策略分别处理多个样本句子,得出对应的多个主干词句子,以及对应的多个第一损失;然后利用分类网络200对各个主干词句子进行分类,得出对应的多个分类结果,以及对应的多个第二损失。于是,得到对多个样本句子进行处理的多个总损失。比较当前损失与多个总损失,将多个总损失中比当前损失小的总损失所对应的网络参数相对于当前网络参数的梯度,确定为总损失减小的方向。
根据另一种训练方式,为了确定出总损失减小的方向,对同一样本句子进行多次处理得到多个总损失,基于这样的多个总损失,确定总损失减小的方向。图6示出在该训练方式下确定总损失减小方向的步骤流程图。
为了探索出更多更好的策略,在策略网络100中,可以在当前策略的基础上加入一定随机性而产生N个策略,这N个策略对应于N组策略参数。结合图4所示的网络结构,可以对嵌入层的嵌入算法加入随机扰动,得到新的策略;可以对第一处理层中确定主干词概率的算法进行变动,得到新的策略;也可以对概率选择的规则算法,例如对概率阈值进行变动,得到新的策略。通过以上各种变动方式的组合,可以得到N种策略, 对应于N组策略参数。
相应的,在步骤61,分别利用上述N组策略参数下的策略网络处理第一样本句子,获得对应的N个主干词集合。并且,可以根据如前所述的第一损失函数,分别确定出N个第一损失。
然后,在步骤62,利用分类网络200,对所述N个主干词集合分别对应的N个备选句子进行分类处理,获得N个分类结果。并且,根据前述的第二损失函数,分别确定N个分类结果对应的N个第二损失。
在步骤63,根据N个第一损失和N个第二损失,确定对应的N个总损失,记为L1,L2,…,Ln。并且,还可以确定出上述N个总损失的均值La。
在步骤64,确定损失值小于等于均值的至少一个第一总损失,以及损失值大于均值的至少一个第二总损失。换而言之,将上述N个总损失划分为,小于等于均值La的总损失,称为第一总损失,以及大于均值La的总损失,称为第二总损失。
在步骤65,基于上述第一总损失和第二总损失,确定总损失减小的方向。更具体而言,上述第一总损失由于损失较小,可以对应于正向学习的方向,上述第二总损失由于损失较大,可以对应于负向学习的方向。因此,在步骤65,综合正向学习的方向,和负向学习方向的反方向,可以得到总的学习方向,即总损失减小的方向。
对于以上的训练方式,在不同的训练阶段,也可以有不同的具体执行方式。
如前所述,在整个强化学习系统训练的第一阶段,单独训练分类网络,如步骤31所示。为了加速模型的收敛,在一个实施例,在接下来的第二阶段,固定住上述分类网络,仅训练和更新策略网络;然后,在第三阶段,同时训练更新策略网络和分类网络。下面分别描述第二阶段和第三阶段,图6流程的执行方式。
具体的,在第二阶段中,分类网络被固定,也就是分类网络中的分类参数不变,不进行调整。那么相应的,在图6的步骤62中,是利用同一组分类参数下的分类网络,对前述N个备选句子进行分类处理,也就是基于同样的分类方式进行分类,得到了所述N个分类结果。
由于分类参数不变,在这样的情况下,步骤63中确定的N个总损失,实际上对应于策略网络的N个策略,进而对应于N组策略参数。也就是,第i个总损失Li,对应于第i组策略参数PSi。
然后在步骤64,在确定出第一总损失和第二总损失的基础上,确定出第一总损失对应的第一策略参数,以及第二总损失对应的第二策略参数。
换而言之,如果总损失Li小于等于均值La,则将该总损失归为第一总损失,相应的策略参数组PSi则归为第一策略参数;如果总损失Li大于均值La,则将该总损失归为第二总损失,相应的策略参数组PSi则归为第二策略参数。
接下来,在步骤65,通过以下方式确定总损失减小的方向:
确定至少一组第一策略参数相对于当前策略参数的梯度的累积,作为正方向;确定至少一组第二策略参数相对于当前策略参数的梯度的累积,作为负方向;将所述正方向与所述负方向的相反方向叠加,作为总损失减小的方向。
这是由于,第一策略参数对应于损失值小于等于平均值的总损失,或者说,损失值较小的总损失,因此,可以认为第一策略参数所对应的策略选择方向是正确的,是系统学习的“正样本”,应该进行正向学习;而第二策略参数对应于损失值大于平均值的总损失,是损失值较大的总损失,因此,可以认为第二策略参数所对应的策略选择方向是错误的,是系统学习的“负样本”,应该进行反向学习。
一般地,损失值小于等于平均值的第一总损失可以是多个,相应的,第一策略参数可以是多组第一策略参数。该多组第一策略参数有可能对样本句子不同位置的主干词提取有不同的效果,因此,在一个实施例中,对该多组第一策略参数均进行正向学习,确定各组第一策略参数相对于当前策略参数的梯度,将其进行累积,得到上述正方向。
相应的,第二策略参数也可以是多组第二策略参数。在一个实施例中,对该多组第二策略参数均进行负向学习,确定各组第二策略参数相对于当前策略参数的梯度,将其进行累积,得到上述负方向。
最后,将负方向取反,与正方向进行叠加,作为总损失减小的方向。
以上总损失减小的方向可以表示为:
Figure PCTCN2020070149-appb-000001
其中,PSi为第一策略参数,PSj为第二策略参数,θ为当前策略参数。
在一个具体例子中,假定N=10,其中L1-L6小于损失均值,为第一总损失,相应的策略参数组PS1-PS6为第一策略参数;假定L7-L10大于损失均值,为第二总损失, 相应的策略参数组PS7-PS10为第二策略参数。
在一个实施例中,分别计算PS1-PS5这6组策略参数相对于当前策略参数的梯度,将其进行累积,得到上述正方向;分别计算PS7-PS10这4组策略参数相对于当前策略参数的梯度,将其进行累积,得到上述负方向,进而得到总损失减小的方向。
如此,在系统训练的第二阶段的一个实施例中,通过以上方式确定出总损失减小的方向。于是,在图3的步骤35,在总损失减小的方向,更新策略网络100中的当前策略参数组。
通过不断执行以上过程,在分类网络200的分类方式不变的情况下,探索更多的主干词提取策略,并不断更新、优化策略网络100中的策略参数,从而针对性的训练策略网络100。
在策略网络的训练达到一定训练目标之后,强化学习系统的训练可以进入第三阶段,同时训练和更新策略网络100和分类网络200。下面描述在第三阶段,图6的执行方式。
在第三阶段,在步骤61,仍然利用N组不同的策略参数下的策略网络处理第一样本句子,获得对应的N个主干词集合,这N个主干词集合可以对应于N个备选句子。
然而,不同的是,在第三阶段中,分类网络不固定,也就是说,分类网络中的分类参数也可以进行调整。那么相应的,在步骤62中,是利用M组不同的分类参数下的分类网络,对步骤61得到的N个备选句子进行分类处理,得到N个备选句子对应的N个分类结果。其中,M<=N。
在M=N的情况下,相当于,对于N个备选句子,分别采用了M=N种不同的分类方法(对应于N组分类参数)进行分类;在M<N的情况下,相当于,对前述N个备选句子进行分类所采用的分类参数,不完全相同。
接着在步骤63,根据N个第一损失和N个第二损失,确定对应的N个总损失。
需要理解的是,在以上得到N个分类结果的过程中,策略网络和分类网络的网络参数均发生了变化。此时,N个总损失对应于N个参数集,其中第i参数集Si包括第i组策略参数PSi,和处理第i备选句子时分类网络对应的分类参数CSi。换而言之,上述参数集是策略网络100和分类网络200的网络参数的整体集合。
此外,与前述类似的,可以确定出N个总损失的均值La。然后,在步骤64,将上 述N个总损失划分为,小于等于均值La的第一总损失,以及大于均值La的第二总损失。
并且,在确定出第一总损失和第二总损失的基础上,可以相应确定出第一总损失对应的第一参数集,以及第二总损失对应的第二参数集。
换而言之,如果总损失Li小于等于均值La,则将该总损失归为第一总损失,相应的参数集Si则归为第一参数集;如果总损失Li大于均值La,则将该总损失归为第二总损失,相应的参数集Si则归为第二参数集。
接下来,在步骤65,通过以下方式确定总损失减小的方向:
确定至少一组第一参数集相对于当前策略参数的梯度的累积,作为第一正方向;确定至少一组第二参数集相对于当前策略参数的梯度的累积,作为第一负方向;将第一正方向与第一负方向的相反方向叠加,作为第一调整方向,即策略参数优化方向;
确定至少一组第一参数集相对于当前分类参数的梯度的累积,作为第二正方向;确定至少一组第二参数集相对于当前分类参数的梯度的累积,作为第二负方向;将第二正方向与第二负方向的相反方向叠加,作为第二调整方向,即分类参数优化方向。
以上确定总损失减小的方向,也就是参数调整方向的构思与第二阶段相同,也就是,将损失值较小的总损失所对应的参数集,即第一参数集,作为系统学习的“正样本”,进行正向学习;将损失值较大的总损失所对应的参数集,即第二参数集,作为系统学习的“负样本”,进行反向学习。学习时,对于策略网络和分类网络,分别确定各自对应的策略参数和分类参数的调整优化方向。
具体的,对于策略网络的策略参数,其调整方向的确定与第二阶段类似,只是计算梯度时,是计算整个参数集相对于当前策略参数的梯度。一般的,参数集中策略参数和分类参数是两套相互独立的参数,因此,实际梯度运算中,仍然是通过计算参数集中的策略参数部分相对于当前策略参数的梯度,获得前述的第一正方向和第一负方向,进而确定出第一调整方向,即策略参数优化方向。
以上第一调整方向可以表示为:
Figure PCTCN2020070149-appb-000002
其中,Si为第一参数集,Sj为第二参数集,θ为当前策略参数。
对于分类网络中的分类参数,其调整方向的确定与策略参数相似,具体地,计算第一参数集相对于当前分类参数的梯度的累积,作为第二正方向;计算第二参数集相对于当前分类参数的梯度的累积,作为第二负方向;将第二正方向与第二负方向的相反方向叠加,作为分类优化方向。如前所述,由于策略参数和分类参数通常相互独立,在实际梯度运算中,可以通过计算各个参数集中的分类参数部分相对于当前分类参数的梯度,获得前述的第二正方向和第二负方向,进而确定出第二调整方向,作为分类参数优化方向。
以上第二调整方向可以表示为:
Figure PCTCN2020070149-appb-000003
其中,Si为第一参数集,Sj为第二参数集,σ为当前分类参数。
于是,可以将第一调整方向和第二调整方向的总和,作为总损失减小的方向,即整个系统的调整方向。
如此,在系统训练的第三阶段的一个实施例中,通过以上方式确定出总损失减小的方向。于是,在图3的步骤35,在总损失减小的方向,更新强化学习系统包括,按照上述第一调整方向,更新策略网络100中的当前策略参数,按照上述第二调整方向,更新分类网络中的当前分类参数。如此,在第三阶段,同时训练策略网络和分类网络。
可以理解,尽管以上实施例中描述了在第一阶段单独训练分类网络之后,在第二阶段固定住分类网络,单独训练策略网络,然后在第三阶段,同时训练策略网络和分类网络的训练过程,但是,在其他实施例中,也可以在第一阶段之后,跳过第二阶段而直接进入第三阶段,同时训练策略网络和分类网络。
通过不断训练策略网络和分类网络,可以探索、确定出更优化的主干词提取策略以及分类算法,不断优化整个强化学习系统,使得系统的总损失不断减小,实现训练目标。在达成训练目标的情况下,策略网络可以准确地提取出尽量少的主干词,以使得句子表达更加精炼,同时不影响句子的含义,也就是不影响该句子的语义分类结果。
在实现训练目标的情况下,就可以将训练得到的策略网络用于主干词提取。在这样的情况下,可以将待分析的句子输入给策略网络,策略网络利用训练得到的策略参数,对该句子进行处理。根据策略网络的输出,就可以确定该句子中的主干词。这些主干词的集合可以对应于一个主干句子,用于后续的意图识别,语义匹配等进一步文本分析, 优化后续文本分析的效果。
综合以上,通过强化学习的方式,进行主干词提取的学习和训练。在强化学习系统中,策略网络作为actor,用于主干词提取;分类网络作为critic,用于对句子进行分类。可以利用现有的句子样本库作为训练预料训练分类网络,从而避免主干词标注的人工成本。经过初步训练的分类网络即可对策略网络提取的主干词构成的句子进行分类,如此评估主干词提取的效果。通过对策略网络和分类网络的输出结果均设置损失,根据总损失反复训练策略网络和分类网络,可以得到理想的强化学习系统。如此,可以在无需主干词人工标注的情况下,训练得到理想的网络系统,实现对主干词的有效提取。
根据另一方面的实施例,还提供一种通过强化学习提取主干词的装置。该装置可以部署在任何具有计算、处理能力的设备或平台上。图7示出根据一个实施例的装置示意图。如图7所示,该装置700包括:
分类网络训练单元71,配置为利用句子样本集,训练用于句子分类的分类网络;
第一确定单元72,配置为利用当前策略参数下的策略网络,对所述句子样本集中的第一样本句子进行主干词提取,获得第一主干词集合,并根据所述第一样本句子中的词语数目和所述第一主干词集合中的词语数目,确定当前的第一损失;
第二确定单元73,配置为利用所述分类网络对由所述第一主干词集合构成的第一备选句子进行分类处理,获得所述第一备选句子的第一分类结果,并根据所述第一分类结果以及所述第一样本句子的分类标签,确定当前的第二损失;
总损失确定单元74,配置为根据所述当前的第一损失和当前的第二损失,确定当前的总损失;
更新单元75,配置为在总损失减小的方向,至少更新所述策略网络,以用于从待分析句子中提取主干词。
在一个实施例中,策略网络包括第一嵌入层,第一处理层和第二处理层。第一确定单元72具体配置为:
在所述第一嵌入层,获得所述第一样本句子中的各个词的词嵌入向量;
在所述第一处理层,根据所述词嵌入向量,确定所述各个词作为主干词的概率;
在所述第二处理层,至少根据所述概率,从所述各个词中选择至少一部分词,构成所述第一主干词集合。
进一步的,在一个实施例中,在所述第二处理层,从所述各个词中选择概率值大于预设阈值的词,构成所述第一主干词集合。
在一个实施例中,分类网络包括第二嵌入层和第三处理层,第二确定单元73具体配置为:
在所述第二嵌入层,获得所述第一备选句子对应的句子嵌入向量;
在所述第三处理层,根据所述句子嵌入向量,确定所述第一备选句子的第一分类结果。
根据一种实施方式,所述策略网络和/或所述分类网络基于循环神经网络RNN。
在一种实施方式中,第一确定单元72还配置为,分别利用N组策略参数下的所述策略网络处理所述第一样本句子,获得对应的N个主干词集合,并分别确定N个第一损失;
所述第二确定单元73还配置为,利用所述分类网络,对所述N个主干词集合分别对应的N个备选句子进行分类处理,获得N个分类结果,并分别确定N个第二损失;
所述总损失确定单元74还配置为,根据N个第一损失和N个第二损失,确定对应的N个总损失,以及所述N个总损失的均值;
以及,确定损失值小于等于所述均值的至少一个第一总损失,以及损失值大于所述均值的至少一个第二总损失。
此外,更新单元75包括方向确定模块751以及更新模块752。其中,方向确定模块751配置为,基于所述至少一个第一总损失和所述至少一个第二总损失,确定所述总损失减小的方向;更新模块752配置为,根据方向确定模块751确定的方向,执行网络更新。
更具体的,在一个实施例中,第二确定单元73配置为:利用同一组分类参数下的所述分类网络,对所述N个备选句子分别进行分类处理,获得所述N个分类结果;在这样的情况下,所述N个总损失对应于所述N组策略参数;
如此,所述方向确定模块751配置为:
确定所述至少一个第一总损失对应的至少一组第一策略参数相对于当前策略参数的梯度的累积,作为正方向;
确定所述至少一个第二总损失对应的至少一组第二策略参数相对于当前策略参数 的梯度的累积,作为负方向;
将所述正方向与所述负方向的相反方向叠加,作为所述总损失减小的方向。
与此相应的,在一个实施例中,更新模块752配置为:在所述总损失减小的方向,更新所述策略网络中的当前策略参数。
在另一实施例中,第二确定单元73配置为:利用M组分类参数下的所述分类网络,对所述N个备选句子进行分类处理,获得所述N个备选句子对应的N个分类结果,其中M<=N;在这样的情况下,所述N个总损失对应于N个参数集,其中第i参数集包括第i组策略参数,和处理第i备选句子时所述分类网络对应的分类参数;
此时,所述方向确定模块751配置为:
确定所述至少一个第一总损失对应的至少一组第一参数集相对于当前策略参数的梯度的累积,作为第一正方向;
确定所述至少一个第二总损失对应的至少一组第二参数集相对于当前策略参数的梯度的累积,作为第一负方向;
将所述第一正方向与所述第一负方向的相反方向叠加,作为第一调整方向;
确定所述至少一个第一总损失对应的至少一组第一参数集相对于当前分类参数的梯度的累积,作为第二正方向;
确定所述至少一个第二总损失对应的至少一组第二参数集相对于当前分类参数的梯度的累积,作为第二负方向;
将所述第二正方向与所述第二负方向的相反方向叠加,作为第二调整方向;
将所述第一调整方向和第二调整方向的总和作为所述总损失减小的方向。
与此相应的,在一个实施例中,所述更新模块752配置为:
在所述第一调整方向,更新所述策略网络的当前策略参数;
在所述第二调整方向,更新所述分类网络的当前分类参数。
根据一种实施方式,所述装置700还包括预测单元(未示出),配置为:
将待分析的第二句子输入所述策略网络;
根据所述策略网络的输出,确定所述第二句子中的主干词。
通过以上装置,利用深度强化学习系统,实现主干词的提取。
根据另一方面的实施例,还提供一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行结合图2和图4所描述的方法。
根据再一方面的实施例,还提供一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现结合图3和图6所述的方法。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本发明的保护范围之内。

Claims (24)

  1. 一种通过强化学习提取主干词的方法,包括:
    利用句子样本集,训练用于句子分类的分类网络;
    利用当前策略参数下的策略网络,对所述句子样本集中的第一样本句子进行主干词提取,获得第一主干词集合,并根据所述第一样本句子中的词语数目和所述第一主干词集合中的词语数目,确定当前的第一损失;
    利用所述分类网络对由所述第一主干词集合构成的第一备选句子进行分类处理,获得所述第一备选句子的第一分类结果,并根据所述第一分类结果以及所述第一样本句子的分类标签,确定当前的第二损失;
    根据所述当前的第一损失和当前的第二损失,确定当前的总损失;
    在总损失减小的方向,至少更新所述策略网络,以用于从待分析句子中提取主干词。
  2. 根据权利要求1所述的方法,其中,所述策略网络包括第一嵌入层,第一处理层和第二处理层,所述利用策略网络对所述句子样本集中的第一样本句子进行主干词提取包括:
    在所述第一嵌入层,获得所述第一样本句子中的各个词的词嵌入向量;
    在所述第一处理层,根据所述词嵌入向量,确定所述各个词作为主干词的概率;
    在所述第二处理层,至少根据所述概率,从所述各个词中选择至少一部分词,构成所述第一主干词集合。
  3. 根据权利要求2所述的方法,其中,在所述第二处理层,从所述各个词中选择概率值大于预设阈值的词,构成所述第一主干词集合。
  4. 根据权利要求1所述的方法,其中,所述分类网络包括第二嵌入层和第三处理层,所述利用所述分类网络对由所述第一主干词集合构成的第一备选句子进行分类处理包括:
    在所述第二嵌入层,获得所述第一备选句子对应的句子嵌入向量;
    在所述第三处理层,根据所述句子嵌入向量,确定所述第一备选句子的第一分类结果。
  5. 根据权利要求1所述的方法,其中,所述策略网络和/或所述分类网络基于循环神经网络RNN。
  6. 根据权利要求1所述的方法,还包括:
    分别利用N组策略参数下的所述策略网络处理所述第一样本句子,获得对应的N个主干词集合,并分别确定N个第一损失;
    利用所述分类网络,对所述N个主干词集合分别对应的N个备选句子进行分类处理,获得N个分类结果,并分别确定N个第二损失;
    根据N个第一损失和N个第二损失,确定对应的N个总损失,以及所述N个总损失的均值;
    确定损失值小于等于所述均值的至少一个第一总损失,以及损失值大于所述均值的至少一个第二总损失;
    基于所述至少一个第一总损失和所述至少一个第二总损失,确定所述总损失减小的方向。
  7. 根据权利要求6所述的方法,其中,利用所述分类网络,对所述N个主干词集合分别对应的N个备选句子分别进行分类处理,获得N个分类结果包括:利用同一组分类参数下的所述分类网络,对所述N个备选句子分别进行分类处理,获得所述N个分类结果;
    其中,所述N个总损失对应于所述N组策略参数;
    基于所述至少一个第一总损失和所述至少一个第二总损失,确定所述总损失减小的方向,包括:
    确定所述至少一个第一总损失对应的至少一组第一策略参数相对于当前策略参数的梯度的累积,作为正方向;
    确定所述至少一个第二总损失对应的至少一组第二策略参数相对于当前策略参数的梯度的累积,作为负方向;
    将所述正方向与所述负方向的相反方向叠加,作为所述总损失减小的方向。
  8. 根据权利要求7所述的方法,其中,所述在总损失减小的方向,至少更新所述策略网络包括:
    在所述总损失减小的方向,更新所述策略网络中的当前策略参数。
  9. 根据权利要求6所述的方法,其中,利用所述分类网络,对所述N个主干词集合分别对应的N个备选句子分别进行分类处理,获得N个分类结果包括:利用M组分类参数下的所述分类网络,对所述N个备选句子进行分类处理,获得所述N个备选句子对应的N个分类结果,其中M<=N;
    其中,所述N个总损失对应于N个参数集,其中第i参数集包括第i组策略参数,和处理第i备选句子时所述分类网络对应的分类参数;
    所述确定所述总损失减小的方向包括:
    确定所述至少一个第一总损失对应的至少一组第一参数集相对于当前策略参数的 梯度的累积,作为第一正方向;
    确定所述至少一个第二总损失对应的至少一组第二参数集相对于当前策略参数的梯度的累积,作为第一负方向;
    将所述第一正方向与所述第一负方向的相反方向叠加,作为第一调整方向;
    确定所述至少一个第一总损失对应的至少一组第一参数集相对于当前分类参数的梯度的累积,作为第二正方向;
    确定所述至少一个第二总损失对应的至少一组第二参数集相对于当前分类参数的梯度的累积,作为第二负方向;
    将所述第二正方向与所述第二负方向的相反方向叠加,作为第二调整方向;
    将所述第一调整方向和第二调整方向的总和作为所述总损失减小的方向。
  10. 根据权利要求9所述的方法,其中,所述在总损失减小的方向,至少更新所述策略网络包括:
    在所述第一调整方向,更新所述策略网络的当前策略参数;
    在所述第二调整方向,更新所述分类网络的当前分类参数。
  11. 根据权利要求1所述的方法,还包括:
    将待分析的第二句子输入所述策略网络;
    根据所述策略网络的输出,确定所述第二句子中的主干词。
  12. 一种通过强化学习提取主干词的装置,包括:
    分类网络训练单元,配置为利用句子样本集,训练用于句子分类的分类网络;
    第一确定单元,配置为利用当前策略参数下的策略网络,对所述句子样本集中的第一样本句子进行主干词提取,获得第一主干词集合,并根据所述第一样本句子中的词语数目和所述第一主干词集合中的词语数目,确定当前的第一损失;
    第二确定单元,配置为利用所述分类网络对由所述第一主干词集合构成的第一备选句子进行分类处理,获得所述第一备选句子的第一分类结果,并根据所述第一分类结果以及所述第一样本句子的分类标签,确定当前的第二损失;
    总损失确定单元,配置为根据所述当前的第一损失和当前的第二损失,确定当前的总损失;
    更新单元,配置为在总损失减小的方向,至少更新所述策略网络,以用于从待分析句子中提取主干词。
  13. 根据权利要求12所述的装置,其中,所述策略网络包括第一嵌入层,第一处理层和第二处理层,所述第一确定单元配置为利用策略网络对所述句子样本集中的第一样 本句子进行主干词提取,具体包括:
    在所述第一嵌入层,获得所述第一样本句子中的各个词的词嵌入向量;
    在所述第一处理层,根据所述词嵌入向量,确定所述各个词作为主干词的概率;
    在所述第二处理层,至少根据所述概率,从所述各个词中选择至少一部分词,构成所述第一主干词集合。
  14. 根据权利要求13所述的装置,其中,在所述第二处理层,从所述各个词中选择概率值大于预设阈值的词,构成所述第一主干词集合。
  15. 根据权利要求12所述的装置,其中,所述分类网络包括第二嵌入层和第三处理层,所述第二确定单元配置为利用所述分类网络对由所述第一主干词集合构成的第一备选句子进行分类处理,具体包括:
    在所述第二嵌入层,获得所述第一备选句子对应的句子嵌入向量;
    在所述第三处理层,根据所述句子嵌入向量,确定所述第一备选句子的第一分类结果。
  16. 根据权利要求12所述的装置,其中,所述策略网络和/或所述分类网络基于循环神经网络RNN。
  17. 根据权利要求12所述的装置,其中:
    所述第一确定单元还配置为,分别利用N组策略参数下的所述策略网络处理所述第一样本句子,获得对应的N个主干词集合,并分别确定N个第一损失;
    所述第二确定单元还配置为,利用所述分类网络,对所述N个主干词集合分别对应的N个备选句子进行分类处理,获得N个分类结果,并分别确定N个第二损失;
    所述总损失确定单元还配置为,根据N个第一损失和N个第二损失,确定对应的N个总损失和所述N个总损失的均值;以及,确定损失值小于等于所述均值的至少一个第一总损失,以及损失值大于所述均值的至少一个第二总损失;
    所述更新单元包括:
    方向确定模块,配置为基于所述至少一个第一总损失和所述至少一个第二总损失,确定所述总损失减小的方向;
    更新模块,配置为根据所述总损失减小的方向,执行网络更新。
  18. 根据权利要求17所述的装置,其中,所述第二确定单元配置为:利用同一组分类参数下的所述分类网络,对所述N个备选句子分别进行分类处理,获得所述N个分类结果;
    其中,所述N个总损失对应于所述N组策略参数;
    所述方向确定模块配置为:
    确定所述至少一个第一总损失对应的至少一组第一策略参数相对于当前策略参数的梯度的累积,作为正方向;
    确定所述至少一个第二总损失对应的至少一组第二策略参数相对于当前策略参数的梯度的累积,作为负方向;
    将所述正方向与所述负方向的相反方向叠加,作为所述总损失减小的方向。
  19. 根据权利要求18所述的装置,其中,所述更新模块配置为:
    在所述总损失减小的方向,更新所述策略网络中的当前策略参数。
  20. 根据权利要求17所述的装置,其中,所述第二确定单元配置为:利用M组分类参数下的所述分类网络,对所述N个备选句子进行分类处理,获得所述N个备选句子对应的N个分类结果,其中M<=N;
    其中,所述N个总损失对应于N个参数集,其中第i参数集包括第i组策略参数,和处理第i备选句子时所述分类网络对应的分类参数;
    所述方向确定模块配置为:
    确定所述至少一个第一总损失对应的至少一组第一参数集相对于当前策略参数的梯度的累积,作为第一正方向;
    确定所述至少一个第二总损失对应的至少一组第二参数集相对于当前策略参数的梯度的累积,作为第一负方向;
    将所述第一正方向与所述第一负方向的相反方向叠加,作为第一调整方向;
    确定所述至少一个第一总损失对应的至少一组第一参数集相对于当前分类参数的梯度的累积,作为第二正方向;
    确定所述至少一个第二总损失对应的至少一组第二参数集相对于当前分类参数的梯度的累积,作为第二负方向;
    将所述第二正方向与所述第二负方向的相反方向叠加,作为第二调整方向;
    将所述第一调整方向和第二调整方向的总和作为所述总损失减小的方向。
  21. 根据权利要求20所述的装置,其中,所述更新模块配置为:
    在所述第一调整方向,更新所述策略网络的当前策略参数;
    在所述第二调整方向,更新所述分类网络的当前分类参数。
  22. 根据权利要求12所述的装置,还包括预测单元,配置为:
    将待分析的第二句子输入所述策略网络;
    根据所述策略网络的输出,确定所述第二句子中的主干词。
  23. 一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行权利要求1-11中任一项的所述的方法。
  24. 一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-11中任一项所述的方法。
PCT/CN2020/070149 2019-02-13 2020-01-02 通过强化学习提取主干词的方法及装置 WO2020164336A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910116482.X 2019-02-13
CN201910116482.XA CN110008332B (zh) 2019-02-13 2019-02-13 通过强化学习提取主干词的方法及装置

Publications (1)

Publication Number Publication Date
WO2020164336A1 true WO2020164336A1 (zh) 2020-08-20

Family

ID=67165738

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/070149 WO2020164336A1 (zh) 2019-02-13 2020-01-02 通过强化学习提取主干词的方法及装置

Country Status (3)

Country Link
CN (1) CN110008332B (zh)
TW (1) TWI717826B (zh)
WO (1) WO2020164336A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117350302A (zh) * 2023-11-04 2024-01-05 湖北为华教育科技集团有限公司 一种基于语义分析的语言撰写文本纠错方法、系统及人机交互装置

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008332B (zh) * 2019-02-13 2020-11-10 创新先进技术有限公司 通过强化学习提取主干词的方法及装置
CN111582371B (zh) * 2020-05-07 2024-02-02 广州视源电子科技股份有限公司 一种图像分类网络的训练方法、装置、设备及存储介质
CN113377884B (zh) * 2021-07-08 2023-06-27 中央财经大学 基于多智能体增强学习的事件语料库提纯方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130282627A1 (en) * 2012-04-20 2013-10-24 Xerox Corporation Learning multiple tasks with boosted decision trees
CN107679039A (zh) * 2017-10-17 2018-02-09 北京百度网讯科技有限公司 用于确定语句意图的方法和装置
CN108108094A (zh) * 2017-12-12 2018-06-01 深圳和而泰数据资源与云技术有限公司 一种信息处理方法、终端及计算机可读介质
CN110008332A (zh) * 2019-02-13 2019-07-12 阿里巴巴集团控股有限公司 通过强化学习提取主干词的方法及装置

Family Cites Families (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8929877B2 (en) * 2008-09-12 2015-01-06 Digimarc Corporation Methods and systems for content processing
CN101751437A (zh) * 2008-12-17 2010-06-23 中国科学院自动化研究所 基于强化学习的网页页面主动式检索系统
CN103984741B (zh) * 2014-05-23 2016-09-21 合一信息技术(北京)有限公司 用户属性信息提取方法及其系统
TW201612770A (en) * 2014-09-24 2016-04-01 Univ Chang Gung Science & Technology Method and system for scoring an english writing work
KR101882585B1 (ko) * 2016-11-29 2018-07-26 한양대학교 산학협력단 인간-로봇 상호작용을 위한 교육 환경에서의 자연어 문장/문단 가독성 분류 방법 및 시스템
CN106934008B (zh) * 2017-02-15 2020-07-21 北京时间股份有限公司 一种垃圾信息的识别方法及装置
CN107368524B (zh) * 2017-06-07 2020-06-02 创新先进技术有限公司 一种对话生成方法、装置以及电子设备
CN107423440B (zh) * 2017-08-04 2020-09-01 逸途(北京)科技有限公司 一种基于情感分析的问答上下文切换与强化选择方法
CN107491531B (zh) * 2017-08-18 2019-05-17 华南师范大学 基于集成学习框架的中文网络评论情感分类方法
CN107992467A (zh) * 2017-10-12 2018-05-04 北京知道未来信息技术有限公司 一种基于lstm的混合语料分词方法
CN107943783A (zh) * 2017-10-12 2018-04-20 北京知道未来信息技术有限公司 一种基于lstm‑cnn的分词方法
CN108108351B (zh) * 2017-12-05 2020-05-22 华南理工大学 一种基于深度学习组合模型的文本情感分类方法
CN108255934B (zh) * 2017-12-07 2020-10-27 北京奇艺世纪科技有限公司 一种语音控制方法及装置
CN108170736B (zh) * 2017-12-15 2020-05-05 南瑞集团有限公司 一种基于循环注意力机制的文档快速扫描定性方法
CN108090218B (zh) * 2017-12-29 2022-08-23 北京百度网讯科技有限公司 基于深度强化学习的对话系统生成方法和装置
CN108280058A (zh) * 2018-01-02 2018-07-13 中国科学院自动化研究所 基于强化学习的关系抽取方法和装置
CN108228572A (zh) * 2018-02-07 2018-06-29 苏州迪美格智能科技有限公司 基于强化学习的医学自然语言语义网络反馈式提取系统与方法
CN108280064B (zh) * 2018-02-28 2020-09-11 北京理工大学 分词、词性标注、实体识别及句法分析的联合处理方法
CN108491386A (zh) * 2018-03-19 2018-09-04 上海携程国际旅行社有限公司 自然语言理解方法及系统
CN108427771B (zh) * 2018-04-09 2020-11-10 腾讯科技(深圳)有限公司 摘要文本生成方法、装置和计算机设备
CN108595602A (zh) * 2018-04-20 2018-09-28 昆明理工大学 基于浅层模型与深度模型结合的问句文本分类方法
CN108628834B (zh) * 2018-05-14 2022-04-15 国家计算机网络与信息安全管理中心 一种基于句法依存关系的词语表示学习方法
CN108805268A (zh) * 2018-06-08 2018-11-13 中国科学技术大学 基于进化算法的深度强化学习策略网络训练方法
CN109189862A (zh) * 2018-07-12 2019-01-11 哈尔滨工程大学 一种面向科技情报分析的知识库构建方法
CN108897896B (zh) * 2018-07-13 2020-06-02 深圳追一科技有限公司 基于强化学习的关键词抽取方法
CN109191276B (zh) * 2018-07-18 2021-10-29 北京邮电大学 一种基于强化学习的p2p网络借贷机构风险评估方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130282627A1 (en) * 2012-04-20 2013-10-24 Xerox Corporation Learning multiple tasks with boosted decision trees
CN107679039A (zh) * 2017-10-17 2018-02-09 北京百度网讯科技有限公司 用于确定语句意图的方法和装置
CN108108094A (zh) * 2017-12-12 2018-06-01 深圳和而泰数据资源与云技术有限公司 一种信息处理方法、终端及计算机可读介质
CN110008332A (zh) * 2019-02-13 2019-07-12 阿里巴巴集团控股有限公司 通过强化学习提取主干词的方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117350302A (zh) * 2023-11-04 2024-01-05 湖北为华教育科技集团有限公司 一种基于语义分析的语言撰写文本纠错方法、系统及人机交互装置
CN117350302B (zh) * 2023-11-04 2024-04-02 湖北为华教育科技集团有限公司 一种基于语义分析的语言撰写文本纠错方法、系统及人机交互装置

Also Published As

Publication number Publication date
TW202030625A (zh) 2020-08-16
CN110008332A (zh) 2019-07-12
TWI717826B (zh) 2021-02-01
CN110008332B (zh) 2020-11-10

Similar Documents

Publication Publication Date Title
WO2020164336A1 (zh) 通过强化学习提取主干词的方法及装置
WO2020206957A1 (zh) 一种应用于智能客服机器人的意图识别方法及装置
WO2021155706A1 (zh) 利用不平衡正负样本对业务预测模型训练的方法及装置
CN111309912B (zh) 文本分类方法、装置、计算机设备及存储介质
EP3767536A1 (en) Latent code for unsupervised domain adaptation
US11854532B2 (en) System to detect and reduce understanding bias in intelligent virtual assistants
JP2019185521A (ja) リクエスト言換システム、リクエスト言換モデル及びリクエスト判定モデルの訓練方法、及び対話システム
KR20190140824A (ko) 트리플릿 기반의 손실함수를 활용한 순서가 있는 분류문제를 위한 딥러닝 모델 학습 방법 및 장치
CN111554276B (zh) 语音识别方法、装置、设备及计算机可读存储介质
CN112069801A (zh) 基于依存句法的句子主干抽取方法、设备和可读存储介质
US6859774B2 (en) Error corrective mechanisms for consensus decoding of speech
CN110751234A (zh) Ocr识别纠错方法、装置及设备
CN111554275B (zh) 语音识别方法、装置、设备及计算机可读存储介质
CN113537630A (zh) 业务预测模型的训练方法及装置
CN115935998A (zh) 多特征金融领域命名实体识别方法
CN114528387A (zh) 基于对话流自举的深度学习对话策略模型构建方法和系统
JP6127778B2 (ja) モデル学習方法、モデル学習プログラム及びモデル学習装置
WO2021180243A1 (zh) 基于机器学习的图像信息识别的优化方法及装置
CN116630714A (zh) 多标签识别的类别自适应标签发现与噪声拒绝方法及设备
US20230196733A1 (en) Method of unsupervised domain adaptation in ordinal regression
CN113988175B (zh) 聚类处理方法和装置
CN112069800A (zh) 基于依存句法的句子时态识别方法、设备和可读存储介质
CN114429140A (zh) 一种基于相关图信息进行因果推断的案由认定方法及系统
JP7161974B2 (ja) 品質管理方法
KR20220138257A (ko) 액티브 러닝 장치, 데이터 샘플링 장치 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20755760

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20755760

Country of ref document: EP

Kind code of ref document: A1