CN112036186A

CN112036186A - Corpus labeling method and device, computer storage medium and electronic equipment

Info

Publication number: CN112036186A
Application number: CN201910482496.3A
Authority: CN
Inventors: 张金超; 牛成; 周杰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-06-04
Filing date: 2019-06-04
Publication date: 2020-12-04

Abstract

The invention provides a corpus labeling method, a corpus labeling device, a computer storage medium and electronic equipment. The method comprises the following steps: acquiring a labeled corpus and an unlabeled corpus; semantic compression is respectively carried out on the labeled corpus and the unlabeled corpus to obtain a first vector corresponding to the labeled corpus and a second vector corresponding to the unlabeled corpus; acquiring intention information of the unmarked corpus according to the first vector and the second vector; acquiring a marked slot position value in a marked corpus and a candidate slot position value in an unmarked corpus, and performing semantic compression on the marked slot position value and the candidate slot position value respectively to acquire a third vector corresponding to the marked slot position value and a fourth vector corresponding to the candidate slot position value; and acquiring slot position information of the unmarked corpus according to the third vector and the fourth vector. According to the invention, the unmarked corpora are marked in a semi-supervised mode, so that manual marking is avoided, the marking cost is reduced, and the marking efficiency and the quantity of the marked corpora are improved.

Description

Corpus labeling method and device, computer storage medium and electronic equipment

Technical Field

The invention relates to the technical field of computers, in particular to a corpus labeling method, a corpus labeling device, a computer storage medium and electronic equipment.

Background

With the gradual development of artificial intelligence, artificial intelligence is gradually applied to various fields in life, such as payment through face recognition, determination of focus through image recognition, navigation through voice recognition, game or chess and card games with machines, and the like, so that artificial intelligence provides great convenience for life of people.

In the field of dialogs, task-based dialogs are one of the main research focuses, and the task-based dialog system aims to accurately and efficiently help users to accomplish certain specific purposes in the form of natural language. At present, the corpora adopted by the training task type dialog system are basically marked manually, so that the number of the corpora is very limited, and a large amount of manpower, time and cost are consumed for marking a large number of corpora, which is very unfavorable for training the task type dialog system.

In view of the above, there is a need in the art to develop a new corpus annotation method.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present application and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.

Disclosure of Invention

Embodiments of the present invention provide a corpus tagging method, a corpus tagging device, a computer storage medium, and an electronic device, which can improve corpus tagging efficiency, increase corpus number, facilitate a training dialog system, and improve the accuracy of the dialog system.

Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.

According to an aspect of an embodiment of the present invention, a corpus labeling method is provided, including: acquiring a labeled corpus and an unlabeled corpus; semantic compression is respectively carried out on the labeled corpus and the unlabeled corpus to obtain a first vector corresponding to the labeled corpus and a second vector corresponding to the unlabeled corpus; acquiring intention information of the unlabeled corpus according to the first vector and the second vector; acquiring a labeling slot position value in the labeling corpus and a candidate slot position value in the non-labeling corpus, and performing semantic compression on the labeling slot position value and the candidate slot position value respectively to acquire a third vector corresponding to the labeling slot position value and a fourth vector corresponding to the candidate slot position value; and acquiring the slot position information of the unmarked corpus according to the third vector and the fourth vector.

According to an aspect of the embodiments of the present invention, there is provided a corpus labeling apparatus, including: the corpus acquiring module is used for acquiring marked corpuses and unmarked corpuses; the first vector quantization module is used for performing semantic compression on the labeled corpus and the unlabeled corpus respectively by the semantic representation model so as to obtain a first vector corresponding to the labeled corpus and a second vector corresponding to the unlabeled corpus; the intention information acquisition module is used for acquiring intention information of the unlabeled corpus according to the first vector and the second vector; the second vector quantization module is used for acquiring a labeling slot position value in the labeling corpus and a candidate slot position value in the non-labeling corpus, and performing semantic compression on the labeling slot position value and the candidate slot position value respectively to acquire a third vector corresponding to the labeling slot position value and a fourth vector corresponding to the candidate slot position value; and the slot position information acquisition module is used for acquiring the slot position information of the unmarked corpus according to the third vector and the fourth vector.

In some embodiments of the present invention, based on the foregoing solution, the corpus tagging device further includes: the training module is used for acquiring a non-labeled corpus sample and training a semantic representation model to be trained on the basis of the non-labeled corpus sample so as to enable the semantic representation model to perform semantic compression on the labeled corpus and the unlabeled corpus respectively.

In some embodiments of the present invention, based on the foregoing scheme, the first vector quantization module is configured to: and respectively inputting the marked linguistic data and the unmarked linguistic data into the semantic representation model, and mapping words in the marked linguistic data and the unmarked linguistic data into a vector space with a first preset dimension through the semantic representation model so as to obtain the first vector and the second vector.

In some embodiments of the present invention, based on the foregoing solution, the intention information obtaining module includes: and the first calculating unit is used for acquiring the intention information of the unlabeled corpus based on a proximity algorithm according to the first vector and the second vector.

In some embodiments of the invention, the number of the first vector and the second vector is plural; based on the foregoing solution, the first computing unit is configured to: calculating a distance between any one of the second vectors and each of the first vectors; arranging the distances in a sequence from small to large to form a sequence, and acquiring a first preset number of distances in the sequence according to a first preset rule; determining a target labeling corpus according to the first preset number of distances, and acquiring a target intention according to the target labeling corpus; and counting a main intention graph in the target intention, and taking the main intention as intention information of the unlabeled corpus.

In some embodiments of the present invention, based on the foregoing scheme, the first computing unit may be further configured to: and comparing the distance with a first preset threshold value to obtain a target distance smaller than the first preset threshold value.

In some embodiments of the present invention, based on the foregoing solution, the corpus tagging device further includes: and the syntactic analysis module is used for carrying out syntactic analysis on the unmarked corpus so as to obtain nouns and noun phrases in the unmarked corpus, and taking the nouns and/or the noun phrases as the candidate slot position values.

In some embodiments of the present invention, based on the foregoing, the second quantization module is configured to: and respectively inputting the annotation slot position value and the candidate slot position value into the semantic representation model, and respectively mapping the annotation slot position value and the candidate slot position value into a vector space of a second preset dimension through the semantic representation model so as to obtain the third vector and the fourth vector.

In some embodiments of the present invention, based on the foregoing solution, the slot information obtaining module includes: and the second calculating unit is used for acquiring the slot position information of the unlabeled corpus based on a proximity algorithm according to the third vector and the fourth vector.

In some embodiments of the present invention, the number of the annotated slot bit values and the number of the candidate slot bit values are both multiple; based on the foregoing solution, the second computing unit is configured to: calculating a distance between any one of the fourth vectors and each of the third vectors; arranging the distances in a sequence from small to large to form a sequence, and acquiring a second preset number of distances in the sequence according to a second preset rule; determining a target marking slot position value according to the second preset number of distances; and counting a main slot position value in the target labeling slot position value, and acquiring target slot position information corresponding to the main slot position value so as to use the target slot position information as the slot position information of the unlabeled corpus.

In some embodiments of the present invention, based on the foregoing scheme, the labeled corpus is a manual labeled corpus, and the number of the manual labeled corpus is smaller than the number of the unlabeled corpus.

According to an aspect of the embodiments of the present invention, there is provided a computer storage medium having a computer program stored thereon, wherein the computer program is executed by a processor to implement the corpus tagging method as described in the above embodiments

According to an aspect of an embodiment of the present invention, there is provided an electronic apparatus including: one or more processors; a storage device, configured to store one or more programs, which when executed by the one or more processors, cause the one or more processors to implement the corpus tagging method as described in the above embodiments.

In the technical solutions provided by some embodiments of the present invention, semantic compression is performed on the tagged corpus and the untagged corpus respectively to obtain a first vector and a second vector, and then the intention information of the untagged corpus is obtained according to the first vector and the second vector; and then semantic compression is respectively carried out on the labeled slot position value in the labeled corpus and the candidate slot position value in the unlabeled corpus so as to obtain a third vector and a fourth vector, and then slot position information of the unlabeled corpus is obtained according to the third vector and the fourth vector. According to the technical scheme, on one hand, the intention information and the slot position information of the unmarked corpora can be obtained in a semi-supervision mode, so that manual marking is avoided, the marking cost is reduced, and the marking efficiency and the number of marked corpora are improved; on the other hand, the task-type dialogue system is trained by a large amount of labeled corpora, so that the accuracy and the stability of the task-type dialogue system can be improved, and the user experience is further improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 shows a schematic diagram of an exemplary system architecture to which aspects of embodiments of the invention may be applied;

FIG. 2 is a flow chart of a corpus tagging method according to an embodiment of the present invention;

FIG. 3 schematically illustrates a flow diagram of intent tagging, according to one embodiment of the invention;

FIG. 4 is a schematic flow chart illustrating slot information annotation according to one embodiment of the present invention;

FIG. 5 schematically illustrates a structural diagram of a conventional task-based dialog system, in accordance with one embodiment of the present invention;

FIG. 6 schematically illustrates a block diagram of a task-based dialog system in an end-to-end format in accordance with an embodiment of the present invention;

FIG. 7 schematically illustrates a block diagram of a corpus tagging device according to an embodiment of the present invention;

FIG. 8 schematically illustrates a block diagram of a corpus tagging device according to an embodiment of the present invention;

FIG. 9 schematically illustrates a block diagram of a corpus tagging device according to an embodiment of the present invention;

FIG. 10 illustrates a schematic structural diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Fig. 1 shows a schematic diagram of an exemplary system architecture to which the technical solution of the embodiments of the present invention can be applied.

As shown in fig. 1, the system architecture 100 may include terminal devices (e.g., one or more of a smartphone 101, a tablet computer 102, and a portable computer 103 shown in fig. 1, and of course, a desktop computer, etc.), a network 104, and a server 105. The network 104 serves as a medium for providing communication links between terminal devices and the server 105. Network 104 may include various connection types, such as wired communication links, wireless communication links, and so forth.

It should be understood that the number of terminals, networks, and servers in fig. 1 are merely illustrative. There may be any number of terminals, networks, and servers, as desired for an implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.

In an embodiment of the present invention, after obtaining tagged corpora and untagged corpora, the terminal device 101, which may also be

terminal devices

102 and 103, sends the tagged corpora and the untagged corpora to the server 105 through the network 104, the server 105 may be provided with a machine learning model, which may specifically be a semantic representation model, the semantic representation model trained based on large-scale untagged corpora may perform semantic compression on the received tagged corpora and untagged corpora to obtain a first vector corresponding to the tagged corpora and a second vector corresponding to the untagged corpora, and may obtain intention information of the untagged corpora according to the first vector and the second vector; meanwhile, semantic compression can be performed on the labeled slot position value in the labeled corpus and the candidate slot position value in the unlabeled corpus through a semantic representation model, a third vector corresponding to the labeled slot position value and a fourth vector corresponding to the candidate slot position value are obtained, and slot position information of the unlabeled corpus can be obtained according to the third vector and the fourth vector, so that labeling of the unlabeled corpus is achieved, and training of a dialogue system is facilitated. According to the technical scheme of the embodiment of the invention, on one hand, the intention information and the slot position information of the unmarked linguistic data can be obtained in a semi-supervision mode, so that manual marking is avoided, the marking cost is reduced, and the marking efficiency and the number of marked linguistic data are improved; on the other hand, the task-type dialogue system is trained by a large amount of labeled corpora, so that the accuracy and the stability of the task-type dialogue system can be improved, and the user experience is further improved.

It should be noted that the corpus tagging method provided by the embodiment of the present invention is generally executed by the server 105, and accordingly, the corpus tagging apparatus is generally disposed in the server 105. However, in other embodiments of the present invention, the terminal may also have a similar function as the server, so as to execute the corpus tagging scheme provided in the embodiments of the present invention.

In view of the problems in the related art, the embodiment of the present invention first provides a corpus tagging method, which can tag any corpus, and details of implementation of the technical solution of the embodiment of the present invention are described in detail below:

fig. 2 schematically shows a flowchart of a corpus tagging method according to an embodiment of the present invention, which may be performed by a server, which may be the server shown in fig. 1. Referring to fig. 2, the corpus tagging method at least includes steps S210 to S250, which are described in detail as follows:

in step S210, the tagged corpus and the untagged corpus are obtained.

In an embodiment of the present invention, a large amount of dialog corpora may be obtained through a man-machine dialog, a large amount of texts may be obtained through a network as corpora, and certainly, corpora may also be obtained according to communication between people in real life, and the like, which is not specifically limited in this embodiment of the present invention. The markup corpus is an artificial markup corpus, and after a large amount of corpora are obtained, part of corpora can be marked in an artificial markup mode, for example, information such as the intention, the slot position type and the slot position value of sentences in the corpora is marked. The dialogue system mainly carries out natural language processing on sentences input by a user, obtains intentions of the user, identifies slot positions in the sentences, determines strategies according to the intention information and the slot position information, and further generates natural language according to dialogue states, database query information and the determined strategies to carry out dialogue with the user, so that the embodiment of the invention can only mark the intentions and the slot position information which are not marked with linguistic data.

In an embodiment of the present invention, the labeled corpus may be a manually labeled corpus directly obtained, for example, a corpus may be directly obtained as a labeled corpus, or an unlabeled corpus may be obtained and then labeled to form a labeled corpus. Further, the labeled corpus and the unlabeled corpus may be obtained simultaneously or separately, which is not specifically limited in this embodiment of the present invention.

In step S220, semantic compression is performed on the tagged corpus and the untagged corpus respectively to obtain a first vector corresponding to the tagged corpus and a second vector corresponding to the untagged corpus.

In an embodiment of the present invention, after the tagged corpus and the untagged corpus are obtained, the tagged corpus and the untagged corpus may be processed to vectorize the tagged corpus and the untagged corpus. In the embodiment of the present invention, the labeled corpus and the unlabeled corpus may be subjected to semantic compression through a machine learning model, specifically, the machine learning model may be a semantic representation model, which may convert the corpus from a natural language form into word vectors, and further label the unlabeled corpus according to the vectors corresponding to the corpus, so as to obtain a large amount of dialogue training corpus.

In an embodiment of the present invention, before the labeled corpus and the unlabeled corpus are processed by using the semantic representation model, the semantic representation model may be trained based on a large number of corpus samples, and the semantic representation model may be word2vec, or may be a pre-trained Language processing model, such as elmo (expressions from Language models) model or bert (bidirectional Encoder retrieval from transformations), and by inputting the corpus samples into the semantic representation model, words in the corpus samples may be vectorized, and natural Language may be converted into vectors. Because the word2vec model does not consider the sequence between words, the features obtained by the model are discrete and sparse, and in the ELMo model, the representation of each word is a function of the whole input statement, specifically, a bidirectional LSTM model is trained on a large corpus by taking a language model as a target, and then the LSTM is used for generating the representation of the word, so the representation of the word determined by the ELMo model can consider the context of the text context, and the output result is more accurate; in addition, when a word is processed by the BERT model, the information of the words in front of and behind the word can be considered, so that the semantics of the context can be obtained, and therefore, compared with a word2vec model, the ELMo model or the BERT model can be used for obtaining more accurate representation of the word, and for the pre-training language processing model, an unlabeled corpus sample can be used for training the pre-training language processing model.

In an embodiment of the present invention, after the training of the semantic representation model is completed, the tagged corpus and the untagged corpus obtained in step S210 may be subjected to semantic compression through the trained semantic representation model to vectorize the tagged corpus and the untagged corpus, and specifically, words in the tagged corpus and the untagged corpus may be respectively mapped into a vector space with a first preset dimension through the semantic representation model to obtain a first vector corresponding to the tagged corpus and a second vector corresponding to the untagged corpus. The first preset dimension is a parameter set when the semantic representation model is trained, and may be, for example, 64-dimensional, 128-dimensional, and the like, which is not specifically limited in the embodiment of the present invention.

In step S230, the intention information of the unlabeled corpus is obtained according to the first vector and the second vector.

In an embodiment of the present invention, the number of the labeled corpus and the unlabeled corpus is multiple, so that after the labeled corpus and the unlabeled corpus are converted from a natural language form into vectors, the number of the first vector and the number of the second vector are multiple, and further after the first vector and the second vector are obtained, the intention of the unlabeled corpus can be labeled by calculating semantic similarity.

In one embodiment of the invention, semantic similarity may be calculated based on a proximity algorithm,

fig. 3 shows a schematic flow diagram of intent labeling, and as shown in fig. 3, the schematic flow diagram of intent labeling mainly includes S301 to S304, specifically:

in step S301, calculating a distance between any one of the second vectors and each of the first vectors;

in an embodiment of the present invention, when calculating the semantic similarity, the semantic similarity may be compared by calculating a distance between two vectors, where the distance may be any type of distance, such as a euclidean distance, a cosine distance, a mahalanobis distance, and the like, and this is not limited in this embodiment of the present invention. In addition, because the unlabeled corpora are subjected to intention labeling, and the labeled corpora contain intention information, the distance between the second vector corresponding to each unlabeled corpora and the first vector corresponding to all labeled corpora can be calculated, and then the labeled corpora similar to the unlabeled corpora are judged according to the similarity, so that the intention labeling is performed on each unlabeled corpora according to the intention information of the similar labeled corpora.

In step S302, the distances are arranged in order from small to large to form a sequence, and a first preset number of distances in the sequence is obtained according to a first preset rule.

In an embodiment of the present invention, after obtaining the distances between each second vector and all first vectors, all the distances corresponding to each second vector may be arranged in a sequence from small to large, and a first preset number of distances in the sequence is obtained according to a first preset rule. For the distances obtained according to different distance algorithms, the first preset rule is different, for example, for the cosine distance, when the cosine distance is closer to 1, the included angle between the two vectors is closer to 0 degrees, namely, the two vectors are similar; when the cosine distance is closer to-1, the closer the included angle between the two vectors is to 180 degrees, that is, the two vectors are dissimilar, so that for a sequence formed by sorting the cosine distances from small to large, the first preset rule can be that the distances of a first preset number are obtained in a reverse order to obtain a first vector with higher similarity with a second vector; for the euclidean distance or the mahalanobis distance, the values are not negative, so for the sequence formed by sorting the distances from small to large, the first preset rule may be that the positive sequence obtains a preset number of distances to obtain the first vector with higher similarity to the second vector. It should be noted that the first preset number may be a number set according to actual needs, for example, the first 10 or the first 20 second vectors with higher similarity to the second vector are selected, and the specific number of the preset number is not specifically limited in the embodiment of the present invention.

In step S303, a target markup corpus is determined according to the first preset number of distances, and a target intention is obtained according to the target markup corpus.

In an embodiment of the present invention, after the first preset number of distances are determined, the target labeling corpus corresponding to the first preset number of distances may be obtained, and the target intent corresponding to the target labeling corpus may be obtained according to the target labeling corpus. For example, the unlabeled corpus is "how much weather in the Mingtian Beijing? "calculating the distance between the unmarked corpus and the marked corpora, can obtain the" how much the weather is in tomorrow, beijing? "tomorrow is cloudy. "," does Beijing tomorrow rain? "," tomorrow go to Beijing on which route? And the target intention can be determined according to the target labeling linguistic data, namely weather query, weather query and route query.

In step S304, a main meaning graph in the target meaning is counted, and the main meaning is used as the meaning information of the unlabeled corpus.

In an embodiment of the present invention, after the target intentions are determined, all the target intentions may be counted to obtain the most significant target intention, which is the main intention of all the target intentions, as exemplified in step S303, where "weather query" accounts for 75%, and "route query" accounts for 25%, and thus "weather query" is the main intention. After the main intention is determined, the main intention graph can be given to the un-labeled linguistic data as the intention information of the un-labeled linguistic data, namely, the un-labeled linguistic data of "how much the weather of the Mingtian Beijing? "to speak, the corresponding intention information is" weather inquiry ".

In an embodiment of the present invention, after the distances between any one of the second vectors and all the first vectors are obtained, all the distances may be compared with a first preset threshold to obtain target distances smaller than the first preset threshold, and then the target distances are sorted from small to large to form a sequence, and a first preset number of distances are obtained from the sequence according to a first preset rule. By comparing all the distances with the first preset threshold, the data amount of sequencing can be reduced, and the data processing efficiency is improved.

In step S204, a labeled slot position value in the labeled corpus and a candidate slot position value in the unlabeled corpus are obtained, and the labeled slot position value and the candidate slot position value are subjected to semantic compression, respectively, to obtain a third vector corresponding to the labeled slot position value and a fourth vector corresponding to the candidate slot position value.

In an embodiment of the present invention, after the intent information of the un-labeled corpus is labeled, the slot information of the un-labeled corpus needs to be labeled, and the slot information is the information category and the information value related to the task in the sentence. In the embodiment of the present invention, the slot information of the unlabeled corpus may be labeled based on the slot information in the labeled corpus, before labeling, a candidate slot location value may be obtained from the unlabeled corpus, and then an accurate slot location value and a corresponding information category may be selected from the candidate slot location values, specifically, the unlabeled corpus may be subjected to syntactic analysis, and the part of speech of the word in the unlabeled corpus is labeled to obtain a noun and a noun phrase in the unlabeled corpus, where the slot location value is usually a noun and a noun phrase in a sentence, so that the obtained noun and/or noun phrase may be used as the candidate slot location value to perform subsequent screening and labeling.

In an embodiment of the present invention, the labeled slot position value in the labeled corpus may be mapped to the vector space of the second preset dimension through the trained semantic representation model to obtain a third vector corresponding to the labeled slot position value, and meanwhile, the candidate slot position value of the unlabeled corpus is mapped to the vector space of the second preset dimension to obtain a fourth vector corresponding to the candidate slot position value. Then, the similarity between the labeled slot position value and the candidate slot position value can be calculated based on the proximity algorithm, and then the slot position value and the slot position type of the unlabeled corpus are labeled. It should be noted that the second preset dimension is also a parameter set when the semantic representation model is trained, and may be the same as or different from the first preset dimension, which is not specifically limited in the embodiment of the present invention.

In an embodiment of the present invention, the number of the annotated slot bit values and the number of the candidate slot bit values are both multiple, fig. 4 shows a schematic diagram of a slot information annotation process, and as shown in fig. 4, the slot information annotation process mainly includes steps S401 to S404, specifically:

in step S401, a distance between any one of the fourth vectors and each of the third vectors is calculated.

In an embodiment of the present invention, similar to step S301, distances between the fourth vector corresponding to each candidate slot position value and the third vectors corresponding to all labeled slot position values may be calculated, and whether the candidate slot position value is similar to the labeled slot position value may be determined according to the distances, and if the distance between the candidate slot position value and some labeled slot position values is smaller and is within a preset range, the candidate slot position value may be considered to be similar to some labeled slot position values. Similarly, the distance calculated in step S401 may be any one of an euclidean distance, a cosine distance, a mahalanobis distance, and the like, and the type of the distance in step S401 may be the same as or different from the type of the distance in step S301, which is not specifically limited in this embodiment of the present invention.

In step S402, the distances are arranged in order from small to large to form a sequence, and a second preset number of distances in the sequence is obtained according to a second preset rule.

In an embodiment of the present invention, after the distances between each fourth vector and all third vectors are obtained, all the distances may be arranged in order from small to large to form a sequence, and a second preset number of distances in the sequence is obtained according to a second preset rule. Similar to the description in step S302, the second preset rule is correspondingly different for different distance types, but the obtained second preset number of distances are all distances that are closer to the fourth vector distance and within the preset range. It should be noted that the second preset number may be a numerical value set according to actual needs, and the second preset number may be the same as or different from the first preset number, which is not specifically limited in the embodiment of the present invention.

In step S403, determining a target annotation slot position value according to the second preset number of distances.

In an embodiment of the present invention, after obtaining the second preset number of distances, the slot position value of the markup corpus corresponding to the second preset number of distances may be determined, where the slot position value of the markup corpus is the target markup slot position value. Taking the example in step S303 as an example, the nouns and noun phrases in the corpus are not labeled, that is, the candidate slot bit values include: in tomorrow, Beijing and weather, the labeling slot bit values in the four labeling corpora with higher semantic similarity are respectively: the target marking slot position value can be determined to be tomorrow, weather, Beijing, tomorrow, cloudy, Beijing and cloudy, and the distance between the candidate slot position value and the marking slot position value.

In step S404, a main slot position value in the target labeled slot position value is counted, and target slot position information corresponding to the main slot position value is obtained, so that the target slot position information is used as the slot position information of the unlabeled corpus.

In an embodiment of the present invention, after the target annotation slot position value is determined, statistics may be performed on all target annotation slot position values to obtain a target annotation slot position value whose duty ratio meets a preset condition, where the target annotation slot position value is a main slot position value in all target annotation slot position values. Continuing with the above example as an example, the ratio of "tomorrow" is the largest, the ratio of "beijing" is the second, the ratio of "weather" is the third, and the ratio of "cloudy day" is the fourth, if the preset condition is that the ratio of the first three is the main slot position value, then the main slot position values can be determined to be tomorrow, beijing, and weather, further, the slot position categories corresponding to the main slot position values can be determined according to the main slot position values, specifically, the time attribute, the place attribute, and the weather attribute, and the slot position information of the unmarked corpus can be determined according to the determined main slot position value and the target slot position information.

Similarly, after the distances between any fourth vector and all third vectors are obtained, all the distances are compared with a second preset threshold to obtain target distances smaller than the second preset threshold, the target distances are sorted from small to large to form a sequence, and a preset number of distances are obtained from the sequence according to a second preset rule. By comparing all the distances with the second preset threshold, the data amount of sequencing can be reduced, and the data processing efficiency is improved.

The embodiment of the invention can determine the intention information and the slot position information of the unmarked corpus by a semi-supervision method, and particularly, labels the unmarked corpus on a large scale by utilizing a semantic representation model and a similarity judgment method according to a small-scale manual marked corpus.

In an embodiment of the present invention, after obtaining a large amount of markup corpuses, the dialog system can be trained through the markup corpuses, the dialog system is mainly a task-based dialog system, and the task-based dialog system is mainly divided into a traditional task-based dialog system and an end-to-end task-based dialog system.

Fig. 5 is a schematic structural diagram of a conventional task-based dialog system, and as shown in fig. 5, a conventional task-based dialog system 500 includes a natural language understanding module 501, a dialog state tracking module 502, a policy learning module 503, and a natural language generating module 504, where the natural language understanding module 501 performs intent recognition, that is, recognition of a task that a user wants to perform currently, such as weather query, route navigation, and so on, for example, a user inputs a sentence "please tell the weather in los angeles on tomorrow", the user's intent is weather query, slot information included in the sentence input by the user has a time attribute, and the attribute value is tomorrow; a place attribute, the attribute value being los Angeles; the required information is a weather attribute. The dialogue state tracking module 502 can track the states of multiple rounds of dialogues, update the state of the current dialogue according to the analysis result of the sentence input by the user each time by the natural language understanding module 501, including the information provided by the system, the information required by the system, the information provided by the user, and the information required by the user, and the system obtains the corresponding query result from the database according to the current information and sends the query result to the policy learning module 503. The policy learning module 503 may select a policy based on the received information, such as querying the information, providing the information, or failing the query. The natural language generation module 504 generates an input in natural language form, for example, for "please tell me the weather on the open day" according to the policy selected by the policy learning module 503, the current dialog state, and the database query result. "so user input, dialog policy is query information, in the current dialog state, the information provided by the user includes time attribute, attribute value is tomorrow, the information required by the user is weather attribute, there are many result matches for the database query result, the system will generate a query as" ask you where weather is to be queried? "by guiding the user to provide more information to get accurate query results.

Fig. 6 is a schematic structural diagram of an end-to-end form task-based dialog system, and as shown in fig. 6, the system is a system based on a neural network model, and after receiving a natural language sentence input by a user, the system performs feature extraction on the natural language sentence, directly interacts with a database through an attention mechanism, and finally outputs through a neural network.

A large amount of labeled corpora obtained by the corpus labeling method provided by the embodiment of the invention are input into a traditional task-based dialog system or an end-to-end task-based dialog system and trained, so that the processing efficiency and accuracy of the dialog system can be improved, and the user experience is further improved.

In an embodiment of the present invention, the foregoing embodiment describes that the server 105 executes a corpus tagging method, and similarly, the corpus tagging method may also be executed by a terminal device, where the terminal device may be the terminal device 101 shown in fig. 1, or the terminal device 102 or the terminal device 103, and accordingly, a semantic representation model is provided in the terminal device 101, after receiving the tagged corpus and the untagged corpus, the terminal device 101 may perform semantic compression on the tagged corpus or the slot position value therein, respectively, obtain a corresponding vector, and judge semantic similarity according to a distance between the vectors, and further determine intent information and slot position information of the untagged corpus according to the intent information and slot position information of the tagged corpus with similar semantics, so as to implement semi-supervised corpus tagging of the untagged corpus, and avoid large amount of manpower and cost for tagging a large amount of the untagged corpus, the quantity and the speed of obtaining the labeled corpora are improved, and the accuracy of the dialogue system is further improved.

The following describes an embodiment of an apparatus of the present invention, which can be used to implement the corpus tagging method in the above embodiment of the present invention. For details that are not disclosed in the embodiments of the apparatus of the present invention, please refer to the embodiments of the corpus tagging method of the present invention.

Fig. 7 schematically shows a block diagram of a corpus tagging apparatus according to an embodiment of the present invention.

Referring to fig. 7, a corpus tagging apparatus 700 according to an embodiment of the present invention includes: the system comprises a corpus acquisition module 701, a first vector quantization module 702, an intention information acquisition module 703, a second vector quantization module 704 and a slot position information acquisition module 705.

The corpus acquiring module 701 is configured to acquire a labeled corpus and an unlabeled corpus; a first vector quantization module 702, configured to perform semantic compression on the tagged corpus and the untagged corpus respectively to obtain a first vector corresponding to the tagged corpus and a second vector corresponding to the untagged corpus; an intention information obtaining module 703, configured to obtain intention information of the unlabeled corpus according to the first vector and the second vector; a second vector quantization module 704, configured to obtain a labeled slot position value in the labeled corpus and a candidate slot position value in the unlabeled corpus, and perform semantic compression on the labeled slot position value and the candidate slot position value respectively to obtain a third vector corresponding to the labeled slot position value and a fourth vector corresponding to the candidate slot position value; a slot position information obtaining module 705, configured to obtain slot position information of the unlabeled corpus according to the third vector and the fourth vector.

Referring to fig. 8, the corpus tagging apparatus 700 according to an embodiment of the present invention further includes: the training module 706 is configured to obtain a non-labeled corpus sample, and train a semantic representation model to be trained through the non-labeled corpus sample, so that the semantic representation model performs semantic compression on the labeled corpus and the unlabeled corpus respectively.

In one embodiment of the present invention, the first quantization module 702 is configured to: and respectively inputting the marked linguistic data and the unmarked linguistic data into the semantic representation model, and mapping words in the marked linguistic data and the unmarked linguistic data into a vector space with a first preset dimension through the semantic representation model so as to obtain the first vector and the second vector.

In an embodiment of the present invention, the intention information obtaining module 703 includes: and the first calculating unit is used for acquiring the intention information of the unlabeled corpus based on a proximity algorithm according to the first vector and the second vector.

In one embodiment of the present invention, the number of the first vector and the second vector is plural; the first computing unit is configured to: calculating a distance between any one of the second vectors and each of the first vectors; arranging the distances in a sequence from small to large to form a sequence, and acquiring a first preset number of distances in the sequence according to a first preset rule; determining a target labeling corpus according to the first preset number of distances, and acquiring a target intention according to the target labeling corpus; and counting a main intention graph in the target intention, and taking the main intention as intention information of the unlabeled corpus.

In an embodiment of the present invention, the first computing unit may be further configured to: and comparing the distance with a first preset threshold value to obtain a target distance smaller than the first preset threshold value.

Referring to fig. 9, the corpus tagging apparatus 700 according to an embodiment of the present invention further includes: a syntactic analysis module 707, configured to perform syntactic analysis on the unlabeled corpus to obtain nouns and noun phrases in the unlabeled corpus, and use the nouns and/or the noun phrases as the candidate slot positions.

In one embodiment of the present invention, the second quantization module 704 is configured to: and respectively inputting the annotation slot position value and the candidate slot position value into the semantic representation model, and respectively mapping the annotation slot position value and the candidate slot position value into a vector space of a second preset dimension through the semantic representation model so as to obtain the third vector and the fourth vector.

In an embodiment of the present invention, the slot information obtaining module 705 includes: and the second calculating unit is used for acquiring the slot position information of the unlabeled corpus based on a proximity algorithm according to the third vector and the fourth vector.

In one embodiment of the present invention, the number of the annotated slot bit values and the number of the candidate slot bit values are both multiple; the second computing unit is configured to: calculating a distance between any one of the fourth vectors and each of the third vectors; arranging the distances in a sequence from small to large to form a sequence, and acquiring a second preset number of distances in the sequence according to a second preset rule; determining a target marking slot position value according to the second preset number of distances; and counting a main slot position value in the target labeling slot position value, and acquiring target slot position information corresponding to the main slot position value so as to use the target slot position information as the slot position information of the unlabeled corpus.

In an embodiment of the present invention, the second computing unit may be further configured to: and comparing the distance with a second preset threshold value to obtain a target distance smaller than the second preset threshold value.

In an embodiment of the present invention, the markup corpus is a manual markup corpus, and the number of the manual markup corpus is smaller than the number of the unmarked corpus.

It should be noted that the computer system 1000 of the electronic device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiment of the present invention.

As shown in fig. 10, the computer system 1000 includes a Central Processing Unit (CPU)1001 that can perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for system operation are also stored. The CPU 1001, ROM 1002, and RAM 1003 are connected to each other via a bus 1004. An Input/Output (I/O) interface 1005 is also connected to the bus 1004.

The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output section 1007 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The driver 1010 is also connected to the I/O interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary.

In particular, according to an embodiment of the present invention, the processes described below with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1009 and/or installed from the removable medium 1011. When the computer program is executed by a Central Processing Unit (CPU)1001, various functions defined in the system of the present application are executed.

It should be noted that the computer readable medium shown in the embodiment of the present invention may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method described in the above embodiments.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the invention. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiment of the present invention.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A corpus tagging method is characterized by comprising the following steps:

acquiring a labeled corpus and an unlabeled corpus;

semantic compression is respectively carried out on the labeled corpus and the unlabeled corpus to obtain a first vector corresponding to the labeled corpus and a second vector corresponding to the unlabeled corpus;

acquiring intention information of the unlabeled corpus according to the first vector and the second vector;

acquiring a labeling slot position value in the labeling corpus and a candidate slot position value in the non-labeling corpus, and performing semantic compression on the labeling slot position value and the candidate slot position value respectively to acquire a third vector corresponding to the labeling slot position value and a fourth vector corresponding to the candidate slot position value;

and acquiring the slot position information of the unmarked corpus according to the third vector and the fourth vector.

2. The corpus tagging method according to claim 1, wherein before semantic compression is performed on the tagged corpus and the untagged corpus, respectively, the method further comprises:

obtaining a non-labeled corpus sample, and training a semantic representation model to be trained based on the non-labeled corpus sample so as to enable the semantic representation model to perform semantic compression on the labeled corpus and the unlabeled corpus respectively.

3. The corpus tagging method of claim 2, wherein performing semantic compression on the tagged corpus and the untagged corpus respectively to obtain a first vector corresponding to the tagged corpus and a second vector corresponding to the untagged corpus comprises:

and respectively inputting the marked linguistic data and the unmarked linguistic data into the semantic representation model, and mapping words in the marked linguistic data and the unmarked linguistic data into a vector space with a first preset dimension through the semantic representation model so as to obtain the first vector and the second vector.

4. The corpus tagging method according to claim 1, wherein obtaining intent information of the unlabeled corpus according to the first vector and the second vector comprises:

and acquiring the intention information of the unlabeled corpus based on a proximity algorithm according to the first vector and the second vector.

5. The corpus tagging method according to claim 1, wherein the number of said first vector and said second vector is plural;

acquiring intention information of the unlabeled corpus based on a proximity algorithm according to the first vector and the second vector, wherein the intention information comprises:

calculating a distance between any one of the second vectors and each of the first vectors;

arranging the distances in a sequence from small to large to form a sequence, and acquiring a first preset number of distances in the sequence according to a first preset rule;

determining a target labeling corpus according to the first preset number of distances, and acquiring a target intention according to the target labeling corpus;

and counting a main intention graph in the target intention, and taking the main intention as intention information of the unlabeled corpus.

6. The corpus tagging method according to claim 5, wherein before the distances are arranged in a sequence from small to large to form a sequence and a first preset number of distances in the sequence are obtained according to a first preset rule, the method further comprises:

and comparing the distance with a first preset threshold value to obtain a target distance smaller than the first preset threshold value.

7. The corpus tagging method according to claim 1, wherein before obtaining the tagged bin position value in the tagged corpus and the candidate bin position value in the untagged corpus, the method further comprises:

and carrying out syntactic analysis on the unlabeled corpus to obtain nouns and noun phrases in the unlabeled corpus, and taking the nouns and/or the noun phrases as the candidate slot position values.

8. The corpus tagging method according to claim 2, wherein the obtaining of the tagged slot position value in the tagged corpus and the candidate slot position value in the untagged corpus and the semantic compression of the tagged slot position value and the candidate slot position value respectively to obtain the third vector corresponding to the tagged slot position value and the fourth vector corresponding to the candidate slot position value comprises:

and respectively inputting the annotation slot position value and the candidate slot position value into the semantic representation model, and respectively mapping the annotation slot position value and the candidate slot position value into a vector space of a second preset dimension through the semantic representation model so as to obtain the third vector and the fourth vector.

9. The corpus tagging method according to claim 1, wherein obtaining slot position information of the unmarked corpus according to the third vector and the fourth vector comprises:

and acquiring the slot position information of the unmarked corpus based on a proximity algorithm according to the third vector and the fourth vector.

10. The corpus tagging method according to claim 9, wherein the number of the tagging slot bit values and the number of the candidate slot bit values are both plural;

obtaining the slot position information of the unlabeled corpus based on a proximity algorithm according to the third vector and the fourth vector, wherein the slot position information comprises:

calculating a distance between any one of the fourth vectors and each of the third vectors;

arranging the distances in a sequence from small to large to form a sequence, and acquiring a second preset number of distances in the sequence according to a second preset rule;

determining a target marking slot position value according to the second preset number of distances;

and counting a main slot position value in the target labeling slot position value, and acquiring target slot position information corresponding to the main slot position value so as to use the target slot position information as the slot position information of the unlabeled corpus.

11. The corpus tagging method according to claim 10, wherein before the distances are arranged in a sequence from small to large to form a sequence and a second preset number of distances in the sequence are obtained according to a second preset rule, the method further comprises:

and comparing the distance with a second preset threshold value to obtain a target distance smaller than the second preset threshold value.

12. The corpus tagging method according to claim 1, wherein the tagged corpus is manually tagged corpus, and the number of the manually tagged corpus is smaller than the number of the un-tagged corpus.

13. A corpus tagging device, comprising:

the corpus acquiring module is used for acquiring marked corpuses and unmarked corpuses;

the first vector quantization module is used for performing semantic compression on the labeled corpus and the unlabeled corpus respectively to acquire a first vector corresponding to the labeled corpus and a second vector corresponding to the unlabeled corpus;

the intention information acquisition module is used for acquiring intention information of the unlabeled corpus according to the first vector and the second vector;

the second vector quantization module is used for acquiring a labeling slot position value in the labeling corpus and a candidate slot position value in the non-labeling corpus, and performing semantic compression on the labeling slot position value and the candidate slot position value respectively to acquire a third vector corresponding to the labeling slot position value and a fourth vector corresponding to the candidate slot position value;

and the slot position information acquisition module is used for acquiring the slot position information of the unmarked corpus according to the third vector and the fourth vector.

14. A computer storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the corpus tagging method according to any one of claims 1 to 12.

15. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the corpus tagging method according to any one of claims 1 to 12.