CN113821593A - Corpus processing method, related device and equipment - Google Patents

Corpus processing method, related device and equipment Download PDF

Info

Publication number
CN113821593A
CN113821593A CN202110774306.2A CN202110774306A CN113821593A CN 113821593 A CN113821593 A CN 113821593A CN 202110774306 A CN202110774306 A CN 202110774306A CN 113821593 A CN113821593 A CN 113821593A
Authority
CN
China
Prior art keywords
corpus
expanded
candidate
semantic
corpora
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110774306.2A
Other languages
Chinese (zh)
Inventor
王明
包恒耀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110774306.2A priority Critical patent/CN113821593A/en
Publication of CN113821593A publication Critical patent/CN113821593A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The embodiment of the application discloses a corpus processing method, a related device and equipment, which are used for enabling a corpus to be expanded to be sufficiently expanded so as to meet the requirement of model training on the number of the corpus. The method in the embodiment of the application comprises the following steps: obtaining a corpus to be expanded, obtaining K candidate corpora according to the corpus to be expanded, inputting the K candidate corpora and the corpus to be expanded into a semantic recognition model to obtain K semantic recognition results, wherein each semantic recognition result is a similarity score or a similarity classification, the similarity score represents the semantic similarity between the candidate corpus and the corpus to be expanded, the similarity classification represents the category of the semantics between the candidate corpus and the corpus to be expanded, and if at least one semantic recognition result in the K semantic recognition results meets corpus extraction conditions, determining the candidate corpus corresponding to the at least one semantic recognition result as a target corpus to obtain at least one target corpus belonging to the corpus.

Description

Corpus processing method, related device and equipment
Technical Field
The embodiment of the application relates to the technical field of artificial intelligence, in particular to a corpus processing method, a related device and equipment.
Background
Along with the popularization of artificial intelligence, more and more artificial intelligence technologies can bring convenience to the life of people, for example, a user inputs some request sentences through an intelligent assistant, the intelligent assistant analyzes and processes the request sentences and transmits processing results to subsequent services to make corresponding feedback, and therefore an interaction process is completed with the voice of the user.
However, the existing semantic parsing platform usually parses the sentence by using an intention classification model, which is obtained by training a certain amount of corpus, but the existing corpus is insufficient to support normal training of the model because only a small amount of seed corpus can be provided, or the trained model is too fit with training data, so that the generalization effect of the model is too poor.
Disclosure of Invention
The embodiment of the application provides a corpus processing method, a related device and equipment, a large number of candidate corpora are obtained by mining the corpus of the corpus to be expanded, and a target corpus with similar semanteme to the corpus to be expanded is further mined from the large number of candidate corpora, so that the corpus to be expanded is sufficiently expanded, and the requirement of model training on the number of the corpora is met.
In view of the above, an aspect of the present application provides a corpus processing method, including:
obtaining a corpus to be expanded;
acquiring K candidate corpora according to the corpora to be expanded, wherein the semantic similarity between each candidate corpora and the corpora to be expanded is greater than or equal to a similarity threshold, and K is an integer greater than 1;
inputting a plurality of candidate corpora and corpora to be expanded into a semantic identification model to obtain K semantic identification results, wherein each semantic identification result is a similarity score or a similarity classification, the similarity score represents the semantic similarity between the candidate corpora and the corpora to be expanded, and the similarity classification represents the category to which the semantics between the candidate corpora and the corpora to be expanded belong;
and if at least one semantic recognition result in the K semantic recognition results meets the corpus extraction condition, determining the candidate corpus corresponding to the at least one semantic recognition result as the target corpus to obtain at least one target corpus belonging to the corpus to be expanded.
Another aspect of the present application provides a corpus processing apparatus, including:
the obtaining unit is used for obtaining the corpus to be expanded;
the acquiring unit is further used for acquiring K candidate corpora according to the corpora to be expanded, wherein the semantic similarity between each candidate corpus and the corpora to be expanded is greater than or equal to a similarity threshold, and K is an integer greater than 1;
the processing unit is used for inputting the K candidate corpora and the corpora to be expanded into the semantic identification model to obtain K semantic identification results, wherein each semantic identification result is a similarity score or a similarity classification, the similarity score represents the semantic similarity between the candidate corpora and the corpora to be expanded, and the similarity classification represents the category to which the semantics between the candidate corpora and the corpora to be expanded belong;
and the determining unit is used for determining the candidate corpus corresponding to at least one semantic recognition result as the target corpus if at least one semantic recognition result in the K semantic recognition results meets the corpus extraction condition so as to obtain at least one target corpus belonging to the corpus to be expanded.
In one possible design, in one implementation of another aspect of an embodiment of the present application,
the obtaining unit is further used for obtaining a corpus sample set, wherein the corpus sample set comprises a positive sample corpus and a negative sample corpus, and the positive sample corpus corresponds to the labeling label;
the processing unit is further used for performing feature extraction on the positive sample corpus to obtain positive sample corpus features and performing feature extraction on the negative sample corpus to obtain negative sample corpus features;
the processing unit is also used for inputting the positive sample corpus features and the negative sample corpus features into the semantic recognition model to obtain a semantic prediction result;
and the processing unit is also used for training the semantic recognition model according to the semantic prediction result and the labeling label.
In one possible design, in an implementation manner of another aspect of the embodiment of the present application, the processing unit may be specifically configured to:
respectively performing word segmentation processing on the positive sample corpus and the negative sample corpus to obtain at least two positive sample words to be processed and at least two negative sample words to be processed;
converting at least two positive sample words to be processed and at least two negative sample words to be processed into at least two positive sample word vectors and at least two negative sample word vectors;
and carrying out vector splicing on at least two positive sample word vectors to obtain positive sample corpus characteristics, and carrying out vector splicing on at least two negative sample word vectors to obtain negative sample corpus characteristics.
In one possible design, in one implementation of another aspect of an embodiment of the present application,
the processing unit is further used for translating the linguistic data to be expanded into N linguistic data corresponding to N languages if no semantic recognition result in the K semantic recognition results meets the linguistic data extraction condition;
and the processing unit is also used for translating the N language corpora into at least N retranslation corpora according to the languages of the corpora to be expanded.
In one possible design, in one implementation of another aspect of an embodiment of the present application,
the determining unit is further used for determining at least one target corpus and at least N retranslation corpora into a plurality of to-be-labeled corpora;
the processing unit is further configured to perform slot matching on each corpus to be annotated to obtain a slot matching result corresponding to each corpus to be annotated, where the slot matching result is a matching similarity score, and the matching similarity score represents a semantic similarity between a preset slot and each term to be annotated in each corpus to be annotated;
the determining unit is further configured to determine, if the matching similarity score is greater than or equal to a preset matching threshold, a preset slot corresponding to the matching similarity score as a target slot, and determine a word to be annotated corresponding to the matching similarity score as a target slot position, where the target slot position is used to represent an attribute of the target slot position;
and the processing unit is also used for carrying out de-duplication processing on the linguistic data to be labeled, the target slot position corresponding to the linguistic data to be labeled and the target slot position value to obtain the target labeling linguistic data.
In one possible design, in one implementation of another aspect of an embodiment of the present application,
the processing unit is further configured to classify the K candidate corpuses to obtain a first candidate corpus set and a second candidate corpus set if no semantic recognition result in the K semantic recognition results meets corpus extraction conditions, where the first candidate corpus set includes i first candidate corpuses, the second candidate corpus set includes j second candidate corpuses, and i and j are integers greater than 1 and smaller than K;
and the processing unit is further used for pairwise combining the i first candidate corpora and the j second candidate corpora respectively to obtain i x j target corpus pairs.
In one possible design, in an implementation manner of another aspect of the embodiment of the present application, the processing unit may be specifically configured to:
if each semantic recognition result is a similarity score, determining a candidate corpus corresponding to the similarity score which is greater than or equal to a preset similarity threshold as a target corpus;
and if each semantic recognition result is a similarity classification, determining the candidate corpus corresponding to the similarity classification which is greater than or equal to a preset classification probability threshold value as the target corpus.
Another aspect of the present application provides a computer device, including: a memory, a transceiver, a processor, and a bus system;
wherein, the memory is used for storing programs;
the processor, when executing the program in the memory, implements the methods as described above;
the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.
Another aspect of the present application provides a computer-readable storage medium having stored therein instructions, which when executed on a computer, cause the computer to perform the method of the above-described aspects.
In another aspect of the application, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the network device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the network device to perform the method provided by the above aspects.
According to the technical scheme, the embodiment of the application has the following advantages:
obtaining a corpus to be expanded, obtaining K candidate corpora according to the corpus to be expanded, inputting the candidate corpora and the corpus to be expanded into a semantic recognition model to obtain K semantic recognition results, and determining the candidate corpus corresponding to at least one semantic recognition result as a target corpus to obtain at least one target corpus belonging to the corpus to be expanded when at least one semantic recognition result in the K semantic recognition results meets corpus extraction conditions. By the method, a large number of candidate corpora are obtained by mining the corpora of the corpora to be expanded, the target corpora similar to the corpora to be expanded in semantic are further mined from the large number of candidate corpora, the target corpora with more semantemes can be expanded based on the corpora to be expanded, the corpora to be expanded are enriched, the corpus set is sufficiently expanded, and the requirement of model training on the number of the corpora is met.
Drawings
FIG. 1 is a schematic diagram of an architecture of a corpus control system in an embodiment of the present application;
FIG. 2 is a schematic diagram of an embodiment of a method for corpus processing in an embodiment of the present application;
FIG. 3 is a schematic diagram of another embodiment of a method for corpus processing in an embodiment of the present application;
FIG. 4 is a schematic diagram of another embodiment of a method for corpus processing in an embodiment of the present application;
FIG. 5 is a schematic diagram of another embodiment of a method for corpus processing in an embodiment of the present application;
FIG. 6 is a schematic diagram of another embodiment of a method for corpus processing in an embodiment of the present application;
FIG. 7 is a schematic diagram of another embodiment of a method for corpus processing in an embodiment of the present application;
FIG. 8 is a schematic diagram of another embodiment of a method for corpus processing in an embodiment of the present application;
FIG. 9 is a schematic flow chart illustrating a method of corpus processing in an embodiment of the present application;
FIG. 10 is a schematic diagram of a model training principle of the corpus processing method in the embodiment of the present application;
FIG. 11 is a diagram illustrating a corpus tagging interface of a corpus processing method according to an embodiment of the present application;
FIG. 12 is a schematic diagram illustrating a corpus parsing principle of a corpus processing method according to an embodiment of the present application;
FIG. 13 is a schematic diagram of an embodiment of a corpus processing device in an embodiment of the present application;
FIG. 14 is a schematic diagram of an embodiment of a computer device in the embodiment of the present application.
Detailed Description
The embodiment of the application provides a corpus processing method, which is used for mining a corpus of a corpus to be expanded to obtain a large number of candidate corpuses, and further mining a target corpus with similar semantics to the corpus to be expanded from the large number of candidate corpuses so as to sufficiently expand the corpus to be expanded, thereby meeting the requirement of model training on the number of corpuses.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that the method of corpus processing provided by the present application may be applied in the context of intelligent voice interaction through parsing sentences, such as, for example, a smart speaker parsing a received request sentence to play or pause music. As another example, for example, the smart tv parses the received request statement to turn on or switch a tv channel, etc. As another example, for example, a smart watch parses a received request sentence to set an alarm, in the above-mentioned various scenarios, in order to complete parsing of the sentence, a solution provided in the prior art is to parse the sentence through an intention classification model, and the intention classification model needs to be trained through a certain amount of training corpus, but the generalization effect of the model is too poor because only a small amount of seed corpus can be provided in the existing training corpus, which is not enough to support the normal training of the model, or the trained model fits too much training data.
In order to solve the above problem, the present application provides a corpus processing method, which is applied to a corpus control system shown in fig. 1, please refer to fig. 1, where fig. 1 is a schematic structural diagram of the corpus control system in the embodiment of the present application, as shown in the figure, a server obtains a corpus to be expanded provided by a client, obtains K corpus candidates according to the corpus to be expanded, inputs a plurality of corpus candidates and the corpus to be expanded to a semantic recognition model to obtain K semantic recognition results, and then determines, when at least one semantic recognition result in the K semantic recognition results satisfies a corpus extraction condition, the corpus candidate corresponding to the at least one semantic recognition result as a target corpus to obtain at least one target corpus belonging to the corpus to be expanded. By the method, a large number of candidate corpora are obtained by mining the corpora of the corpora to be expanded, the target corpora with the semanteme similar to that of the corpora to be expanded are further mined from the large number of candidate corpora, and the target corpora with more semantemes can be expanded based on the corpora to be expanded, so that the corpus is sufficiently expanded, and the requirement of model training on the number of the corpora is met.
It is understood that, a client and a server are communicatively connected, one server is shown in fig. 1, but in an actual scenario, multiple servers may participate, and particularly in a scenario of multi-model training interaction, the number of servers depends on the actual scenario, and is not limited herein.
It should be noted that, in this embodiment, the server or the transaction server may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and an artificial intelligence platform, and the like. The client may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like.
In order to solve the above problem, the present application proposes a method for corpus processing, which is generally performed by a server or a terminal device, and accordingly, an apparatus applied to corpus processing is generally disposed in the server or the terminal device.
It is understood that the method, related device and apparatus for corpus processing as disclosed in the present application, wherein a plurality of servers/terminal devices can be combined into a blockchain, and the servers/terminal devices are nodes on the blockchain. In practical application, data sharing between nodes can be performed in a block chain, and a corpus set and a corpus to be expanded can be stored in each node.
Referring to fig. 2, a method for corpus processing in the present application will be described, where an embodiment of the method for corpus processing in the present application includes:
in step S101, a corpus to be expanded is obtained;
in the embodiment, the corpus to be expanded generally refers to skill corpus. As shown in fig. 11, since, at the time of creating the skills of the dialogue system, the creator is required to provide a certain amount of corpus, in actual use, however, especially when the third-party creator builds skills by themselves, only a very small number of skill corpora are usually provided, such as "is it appropriate to move a home today", "what is appropriate to do today" or "what can be done today" etc., to avoid over-fitting the model by training the speech recognition model based on a very small corpus, therefore, the skill created based on the corpus recognition model has poor use effect, the embodiment obtains each skill corpus in a small amount of skill corpuses corresponding to each skill from the database as the corpus to be expanded, the acquired corpus to be expanded can be mined subsequently to enrich the skill corpus, so that the requirement of model training on the number of the corpus is met to a certain extent.
The skill refers to abstraction of specific ability in a task-based dialog system, and may be specifically expressed as a music skill, where the music skill may be used to represent that the dialog system can understand a short text (query) related to music, perform operations such as domain dropping and parameter lifting on the short text, express key information in the short text by using structured information, and then transmit the key information to a subsequent service to facilitate the dialog system to make corresponding feedback, so as to implement an interactive process between the dialog system and a user speech, and may also be expressed as other skills, such as a fault skill, a telephone skill, or a time skill, which is not specifically limited herein.
The short text can be embodied as a request sentence input by the user, and is usually used for representing an intention expectation of the user, such as "first and third rains", "give me a story of fool-down mountain" or "i want to see nothing in the movie", and the like.
Specifically, as shown in fig. 9, in the present embodiment, each skill corpus in a small number of skill corpuses corresponding to each skill is acquired from a database as a corpus to be expanded, so that the corpus to be expanded can be mined through Natural Language Processing (NLP) in the following process, so as to enrich the skill corpuses.
Among them, natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between people and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics into a whole. Thus, research in this area will involve natural language, i.e. the language people use daily, so it is somewhat germane to linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question answering, instruction mapping, and the like.
In step S102, K corpus candidates are obtained according to the corpus to be expanded, where semantic similarity between each corpus candidate and the corpus to be expanded is greater than or equal to a similarity threshold, and K is an integer greater than 1;
in this embodiment, after obtaining the corpus to be expanded, as shown in fig. 9, some candidate corpuses may be retrieved from an index library of a distributed full-text search Engine (ES), specifically, semantic similarity between the corpus to be expanded and the corpus in the index library may be calculated, and then, by comparing the semantic similarity with a preset similarity threshold, a candidate corpus with a higher similarity to the corpus to be expanded is screened out, for example, a corpus corresponding to the semantic similarity greater than or equal to the similarity threshold is determined as the candidate corpus, where the similarity threshold may be specifically set according to an actual application requirement, and no specific limitation is made here.
The corpus candidates may be understood as related corpora obtained from the index library from the viewpoint of literal hit, where the corpus candidates may be part literally similar to the corpus to be expanded, but the corpus candidates are not necessarily similar to the semantic meaning of the corpus to be expanded, and may be understood as including target corpora that are similar to the semantic meaning of the corpus to be expanded, and also including a large number of corpora that are semantically related to the corpus to be expanded but may not be similar to the semantic meaning.
The calculation method for calculating the semantic similarity between the corpus to be expanded and the corpus in the index library may specifically adopt hamming distance, cosine similarity, pearson correlation coefficient, and the like, and may also adopt other calculation methods for calculation, such as euclidean distance, manhattan distance, and the like, which is not limited herein.
The semantic similarity between the corpus to be expanded and the corpus in the index library is calculated based on the cosine similarity, specifically, the text vector of each corpus in the index library can be obtained by extracting the feature words in the corpus to be expanded, converting the extracted feature words into the text vectors, and then substituting the text vectors of the corpus to be expanded and the text vectors of each corpus in the index library into a cosine similarity calculation formula respectively, so that the semantic similarity between the corpus to be expanded and the corpus in the index library can be obtained.
Specifically, after the corpora to be expanded are obtained, a large number of corpora with semantic similarity larger than or equal to a similarity threshold between the corpora to be expanded can be searched in the index database through the search service provided by the ES, and then the searched corpora are determined as the corpus candidates.
For example, suppose a corpus to be expanded is "what is suitable to do today", the corpus candidates that can be retrieved in the index library through the ES are "what can do today", "what is hot today", and "what happens today", etc.
In step S103, inputting K candidate corpora and corpora to be expanded into the semantic identification model to obtain K semantic identification results, where each semantic identification result is a similarity score or a similarity classification, the similarity score represents a semantic similarity between the candidate corpora and the corpora to be expanded, and the similarity classification represents a category to which semantics between the candidate corpora and the corpora to be expanded belong;
in this embodiment, since the corpus candidates include the target corpus that is similar to the corpus to be expanded in semantic expression and also include a large number of corpora that are related to the corpus to be expanded in semantic expression but may not be similar in semantic expression, after obtaining a plurality of corpus candidates, this embodiment may input the K corpus candidates and the corpus to be expanded to the semantic recognition model to obtain K semantic recognition results, so that a plurality of corpora that are similar to the corpus to be expanded in semantic expression can be accurately obtained according to the K semantic recognition results in the following, so that the corpus to be expanded is meaningful to extend, and the corpus to be expanded is abundant, thereby meeting the requirement of model training on the number of corpora to a certain extent.
The semantic recognition model may specifically adopt a Logistic Regression (LR) model, a Linear Regression (LR) model, or may adopt other semantic recognition models, such as a recurrent neural network model, a feedforward neural network model, and the like, which is not limited herein.
Specifically, as shown in fig. 9, after K corpus candidates are obtained, the K corpus candidates and the corpus to be expanded may be input into a semantic recognition model, for example, into a logistic regression model, to obtain a category probability of a category to which a semantic meaning between the corpus candidates and the corpus to be expanded belongs, that is, a similarity classification, or input into a linear regression model, to score a semantic similarity between the corpus candidates and the corpus to be expanded, that is, a similarity score, so that the K corpus candidates obtained may be subsequently filtered according to the similarity score or the similarity classification, to accurately obtain a plurality of corpuses similar to the semantic meaning expressed by the corpus to be expanded, so that the corpus to be expanded is enriched, thereby meeting the requirement of the model training on the number of corpuses to a certain extent.
For example, assuming that one corpus to be expanded is "what song is suitable for listening in rainy days", assuming that 3 candidate corpora are "music related to rainy days", "song listening in rainy days", and "what happens in rainy days", etc., the candidate corpora and the corpus to be expanded are input into the logistic regression model, and semantic recognition results including categories to which the semantics between the corpus to be expanded and the candidate corpora belong, such as category probabilities of 0.91, 0.74, and 0.48, respectively, can be obtained.
In step S104, if at least one semantic recognition result in the K semantic recognition results satisfies the corpus extraction condition, determining the corpus candidate corresponding to the at least one semantic recognition result as the target corpus to obtain at least one target corpus belonging to the corpus to be expanded.
In this embodiment, after K semantic recognition results are obtained, a semantic recognition result meeting the corpus extraction condition may be screened from the K semantic recognition results, that is, at least one semantic recognition result meeting the corpus extraction condition exists in the K semantic recognition results, and then the candidate corpus corresponding to the semantic recognition result meeting the corpus extraction condition may be used as a candidate corpus having a higher semantic similarity with the corpus to be expanded, which is screened from the K candidate corpuses, that is, a target corpus, so that the corpus to be expanded can be meaningfully expanded while the corpus to be expanded is enriched, thereby meeting the requirement of model training on the number of corpuses to a certain extent.
Wherein, the semantic recognition result meeting the corpus extraction condition is selected from the K semantic recognition results, which can be expressed as K similarity classifications, that is, K classification probabilities representing classes to which semantics between the corpus candidate and the corpus to be expanded belong, the class probability greater than or equal to a preset probability threshold can be determined as the semantic recognition result meeting the corpus extraction condition, otherwise, the class probability less than the preset probability threshold is the semantic recognition result not meeting the corpus extraction condition, wherein, the probability threshold can be set according to the actual application requirement, such as 0.7, where no specific limitation is made, or the class probability approaching 1 is determined as the semantic recognition result meeting the corpus extraction condition, otherwise, the class probability approaching 0 is the semantic recognition result not meeting the corpus extraction condition, or can be in other expression forms, and is not particularly limited herein.
Specifically, as shown in fig. 9, after K semantic recognition results are obtained, K semantic recognition candidates may be filtered according to the K semantic recognition results to screen out semantic recognition results meeting the corpus extraction condition from the K semantic recognition results, if a plurality of semantic recognition results with higher probability values may be selected from the K semantic recognition results, then candidate corpora corresponding to the selected semantic recognition results with higher probability values may be determined as a plurality of target corpora that may represent semantemes similar to the corpus to be expanded, the corpus to be expanded may be enriched, so as to meet the requirement of model training on the number of corpora to a certain extent.
For example, assuming that category probabilities of categories to which semantics between a corpus to be expanded, "what song is suitable for listening in rainy days" and 3 candidate corpora "music related to rainy days", "song listening in rainy days", and "what thing has happened in rainy days" are respectively 0.91, 0.74, and 0.48, and assuming that a preset probability threshold is 0.7, by comparing these category probabilities with the preset probability threshold, it is apparent that the category probabilities 0.91 and 0.74 are greater than the probability threshold of 0.7, and the corpus to be expanded, "music related to rainy days", "song listening in rainy days" can be determined as the target corpus of the corpus "which song is suitable for listening in rainy days".
In the embodiment of the application, a corpus processing method is provided, and by the above manner, a large number of corpus candidates are obtained by mining the corpus of the corpus to be expanded, and the target corpus with the semantic similar to the corpus to be expanded is further mined from the large number of corpus candidates, so that the target corpus with more quantities can be expanded based on the corpus to be expanded, the corpus set is sufficiently expanded, and the requirement of model training on the quantity of the corpus is met.
Optionally, on the basis of the embodiment corresponding to fig. 2, in another optional embodiment of the method for corpus processing provided in the embodiment of the present application, as shown in fig. 3, the method further includes:
in step S301, a corpus sample set is obtained, where the corpus sample set includes a positive sample corpus and a negative sample corpus, and the positive sample corpus corresponds to a label;
in step S302, feature extraction is performed on the positive sample corpus to obtain positive sample corpus features, and feature extraction is performed on the negative sample corpus to obtain negative sample corpus features;
in step S303, the positive sample corpus features and the negative sample corpus features are input to a semantic recognition model to obtain a semantic prediction result;
in step S304, a semantic recognition model is trained according to the semantic prediction result and the label.
In this embodiment, after the K corpus candidates are obtained, in order to accurately obtain multiple targets similar to the semantics expressed by the corpus to be expanded from the K corpus candidates, in this embodiment, a corpus sample set is obtained, features of the corpus sample set are extracted, and then a semantic recognition model is trained by using the extracted features and label labels, so as to improve the learning capability of the semantic recognition model, so that the K corpus candidates and the corpus to be expanded are subsequently input into the semantic recognition model optimized through training, and a more accurate semantic recognition result can be obtained.
The corpus sample set is a test sample for testing a semantic recognition model, and includes a positive sample corpus and a negative sample corpus, where the positive sample corpus is a seed corpus with a label, i.e., a skill corpus, the label is a label capable of representing semantic attributes in the corpus, and the negative sample corpus is a common negative sample corpus, which can be understood as a corpus unrelated to the positive sample corpus or dissimilar to the semantics, and further, it can be understood that in an actual use process, since the skill corpus that a user can provide is less, the positive sample corpus is usually less than the negative sample corpus.
As shown in fig. 9, the feature extraction of the positive sample corpus and the negative sample corpus may specifically adopt Natural Language Understanding (NLU) service to perform feature extraction on the positive sample corpus or the negative sample corpus respectively to obtain features of the positive sample corpus or the negative sample corpus, where the Natural Language Understanding technology includes multiple fields of sentence detection, word segmentation, part-of-speech tagging, syntax analysis, text classification/clustering, text angle, information extraction/automatic summarization, machine translation, automatic question and answer, text generation, and the like.
In step S103, the processing mode of inputting the positive sample corpus feature and the negative sample corpus feature to the semantic recognition model to obtain the semantic prediction result is similar to the processing mode of inputting the K candidate corpora and the corpus to be expanded to the semantic recognition model to obtain the K semantic recognition results, which is not repeated here.
The semantic recognition model is trained according to the semantic prediction result and the label, specifically, the semantic recognition model can be trained based on a cross entropy loss function, and the semantic prediction result and the label are used for performing back propagation iterative training on the semantic recognition model.
It should be noted that, when the features of the positive sample corpus and the negative sample corpus are extracted, and complex features need to be considered, there may be a long time spent on extracting the features, which results in a long training process of the model, and thus results in a complex and time-consuming process of the whole target corpus parsing, and therefore, the processes of extracting the features of the corpus sample set, training the semantic recognition model, and the like may be performed in an off-line state in this embodiment, so as to reduce the pressure caused by performing on-line.
Specifically, as shown in fig. 10, a corpus sample set used for training a semantic recognition model may be obtained from a database, and then, an NLU technique may be used to perform feature extraction on a positive sample corpus and a negative sample corpus in the corpus sample set, so as to accurately and sufficiently obtain a positive sample corpus feature and a negative sample corpus feature that can be used as model inputs, and then, the positive sample corpus feature and the negative sample corpus feature may be input to the semantic recognition model, so as to obtain a semantic prediction result, and then, based on a cross entropy loss function, a back propagation iterative training may be performed on the semantic recognition model by using the semantic prediction result and a label, so as to adjust values of various parameters according to errors in a back propagation process, and continuously iterate the above process until convergence, so as to achieve optimization of model parameters.
Optionally, on the basis of the embodiment corresponding to fig. 2, in another optional embodiment of the method for corpus processing according to the embodiment of the present application, as shown in fig. 4, the performing feature extraction on the positive sample corpus to obtain positive sample corpus features, and performing feature extraction on the negative sample corpus to obtain negative sample corpus features includes:
in step S401, performing word segmentation processing on the positive sample corpus and the negative sample corpus respectively to obtain at least two positive sample words to be processed and at least two negative sample words to be processed;
in step S402, converting at least two positive sample words to be processed and at least two negative sample words to be processed into at least two positive sample word vectors and at least two negative sample word vectors;
in step S403, vector splicing is performed on at least two positive sample word vectors to obtain positive sample corpus features, and vector splicing is performed on at least two negative sample word vectors to obtain negative sample corpus features.
In this embodiment, after the positive sample corpus and the negative sample corpus are obtained, the positive sample corpus and the negative sample corpus may be preprocessed, specifically, the punctuation removal processing and the word segmentation processing may be performed on the positive sample corpus and the negative sample corpus, and the word segmentation aims to segment a continuous sentence into units of each constituent word, so that understanding of the corpus is converted into processing of words, thereby improving efficiency of processing the corpus. In this embodiment, a general bigram-based model may be used to segment a sentence to obtain at least two positive sample words to be processed and at least two negative sample words to be processed.
Further, as shown in fig. 10, in order to better understand the features in the positive sample corpus and the negative sample corpus by a computer so as to enable a model to be trained better according to the extracted features, the present embodiment can obtain word vectors with various dimensions by converting at least two to-be-processed positive sample words and at least two to-be-processed negative sample words into digital features usable for machine learning, i.e., at least two positive sample word vectors and at least two negative sample word vectors, specifically, by inputting at least two to-be-processed positive sample words and at least two to-be-processed negative sample words into various word vector extraction models, for example, in a Bidirectional Encoder Representation (BERT) model input to a Transformer, words can be converted into words capable of representing basic features, and context features of the whole sentence can be considered to a certain extent, and the entities and some specific attributes in the words can be represented by word vectors when the words are input into the fast text classification model fasttext, semantic features of the whole sentence can be considered to a certain extent, and other word vector extraction models can be used, such as an ELMo model, a word2vec model, or a glove model, and the like.
Further, after the word vector extraction model of various different types is used for converting the positive sample words to be processed and the negative sample words to be processed into various different word vectors for representation, vector splicing can be performed on at least two positive sample word vectors to obtain positive sample corpus features, and vector splicing is performed on at least two negative sample word vectors to obtain negative sample corpus features.
Specifically, after the positive sample corpus and the negative sample corpus are obtained, in order to enable the semantic recognition model to more accurately and rapidly recognize and process the positive sample corpus and the negative sample corpus, the positive sample corpus and the negative sample corpus can be converted into words through word segmentation processing, the efficiency of processing the corpus can be improved, furthermore, at least two to-be-processed positive sample words and at least two to-be-processed negative sample words obtained through processing are respectively input into the BERT model and the fasttext model, vector representations of various dimensions of the same to-be-processed positive sample word and vector representations of various dimensions of the same to-be-processed negative sample word can be obtained, then, vector splicing is carried out through vector splicing of various dimensions of all to-be-processed positive sample words, and vector representations of various dimensions of all to-be-processed negative sample words are subjected to vector splicing, the positive sample corpus features and the negative sample corpus features with increased feature dimensions can be obtained, and the model can learn the features with different dimensions more sufficiently and more quickly, so that the effect of predicting more accurately is achieved.
Optionally, on the basis of the embodiment corresponding to fig. 2, in another optional embodiment of the method for corpus processing provided in the embodiment of the present application, as shown in fig. 5, the method further includes:
in step S501, if there is no semantic recognition result in the K semantic recognition results that satisfies the corpus extraction condition, the corpus to be expanded is translated into N language corpora corresponding to the N languages;
in step S502, the N linguistic data are translated into at least N translated-back linguistic data according to the linguistic data to be expanded.
In this embodiment, when there is no semantic recognition result in the K semantic recognition results that satisfies the corpus extraction condition, the embodiment may obtain a large amount of translated-back corpora similar to the corpus semantic to be extended by the corpus translation processing method to enhance the corpus to be extended, so that the corpus to be extended is enriched, thereby satisfying the requirement of model training on the number of corpora to a certain extent.
Specifically, as shown in fig. 9, after the corpus to be expanded is obtained, the corpus to be expanded may be translated into N corpora according to N predetermined languages, which may be understood as converting the corpus to be expanded into N language representations, for example, converting a chinese corpus to be expanded into N language representations, such as english, german, spanish, french, russian, korean, and japanese language representations, and then translating the N corpus into at least N translated corpora according to the language of the corpus to be expanded, which may be understood as converting the N language representations into one language representation, for example, translating the english corpus into chinese, so as to obtain one or more chinese corpora similar to or consistent with the corpus to be expanded, i.e., the N corpus may be translated into at least N translated corpora similar to the corpus, the corpus to be expanded can be enriched, so that the requirement of model training on the number of the corpus can be met to a certain extent.
For example, assuming that a chinese corpus to be expanded "Beautiful Life" is translated into an english corpus, i.e., "beautifu Life", and then the english corpus "beautifu Life" is translated into chinese, at least 3 translated-back corpora having semantics similar to the corpus to be expanded, such as "Beautiful Life", or "Beautiful Life", can be obtained.
It should be noted that, in this embodiment, a synonym replacement mode may also be adopted to obtain a large amount of synonym corpora with semantics similar to the corpora to be expanded, so as to enhance the corpora to be expanded, and enrich the corpora to be expanded.
Optionally, on the basis of the embodiment corresponding to fig. 2, in another optional embodiment of the method for corpus processing provided in the embodiment of the present application, as shown in fig. 6, the method further includes:
in step S601, determining at least one target corpus and at least N translated corpuses as a plurality of corpuses to be labeled;
in step S602, performing slot matching on each corpus to be annotated to obtain a slot matching result corresponding to each corpus to be annotated, where the slot matching result is a matching similarity score, and the matching similarity score represents a semantic similarity between a preset slot and each term to be annotated in each corpus to be annotated;
in step S603, if the matching similarity score is greater than or equal to the preset matching threshold, determining a preset slot corresponding to the matching similarity score as a target slot, and determining a word to be annotated corresponding to the matching similarity score as a target slot value, where the target slot is used to represent an attribute of the target slot value;
in step S604, the corpus to be labeled, the target slot position corresponding to the corpus to be labeled, and the target slot position value are subjected to de-duplication processing to obtain the target labeled corpus.
In this embodiment, after obtaining at least one target corpus and at least N translated corpora, the slot matching may be performed on the at least one target corpus and the at least N translated corpora, which may be understood as performing semantic information labeling on the at least one target corpus and the at least N translated corpora, similar to label labeling, which may convert unstructured corpora into structuralization, and may help the model to better learn in training or prediction through slot labeling, thereby improving the prediction effect of the model, reducing the workload of manual labeling, and reducing the labor cost, thereby improving the corpus processing efficiency to a certain extent. The slot (slot) refers to an attribute that an entity has been clearly defined, is usually used in a task-based dialog system, represents slot design under a specific intention, and can be used for expressing important information in a query. For example, when the skill corpus of a musical skill is "i want to listen to raining of zhang sai", the entity "zhang sai" may be represented by the slot "singer", e.g., "singer" is zhang sai ", and" raining "may be represented by the slot" song ", e.g.," song "is raining".
Specifically, as shown in fig. 9, after the target corpus or the retranslated corpus is obtained, in order to enable the corpus recognition model to better and more accurately recognize or analyze the corpus, in this embodiment, at least one target corpus and at least N retranslated corpora may be used as a plurality of corpora to be annotated, and further, each word in each corpus to be annotated may be subjected to matching similarity calculation with each preset slot in a preset slot library to obtain a matching similarity score between each word and the preset slot, where, as shown in fig. 11, the preset slot is configured according to the entity library and is used to represent an attribute that an entity is clearly defined, and the preset slot may be specifically represented as "time", "birth year", or the like, or may be other preset slots such as "departure point", "destination", or "singer", and is not specifically limited herein, then, the preset slot position corresponding to the matching similarity score which is greater than or equal to the preset matching threshold value and the word can be understood as being matched, the preset slot position can be determined as a target slot position, the word can be determined as a slot position value, the attribute of the slot position value can be clearly represented through the slot position, the semantic recognition model can be helped to learn better through the marked slot position and slot position value, and therefore the recognition performance of the semantic recognition model is improved.
Furthermore, the target slot position and the target slot position value corresponding to the to-be-labeled corpus and the to-be-labeled corpus can be subjected to de-duplication and sequencing processing to obtain the neat and orderly target label corpus, so that the target label corpus can be meaningful, the repeated redundant residue of the target label corpus can be avoided, and the resource waste is avoided.
For example, assuming that a target corpus "query the flight from guangzhou to beijing of october four", and slot matching is performed on the terms "query/october four/from/guangzhou/to/beijing/flight" to be annotated, corresponding target slot and target slot position values, such as target slot "departure time" and target slot position value "october four", target slot "departure place" and target slot position value "guangzhou", target slot "destination" and target slot position value "beijing", etc., can be obtained.
It should be noted that, as shown in fig. 12, after the target labeled corpus is obtained, the skill corpus required for skill creation can be fully expanded, and then the corpus sample set added with the target labeled corpus can be used to perform iterative training on the corpus recognition model, and then the corpus recognition model can be in an online state, and the model obtained by training is used to analyze the newly obtained query, so that the ability of analyzing the new skill can be improved.
For example, as shown in table 1, when a test set of 5 skills, such as "calendar search", is analyzed by using a method a, such as template matching method, it can be observed that the accuracy P of the method a is very high, but the recall rate R is very low, which results in an undesirable final F1 value; and the method B is that the corpus to be expanded and the target labeled corpus corresponding to the corpus to be expanded are added into the corpus sample set, and the obtained corpus identification model is trained, so that when a test set of 5 skills is tested, the recall rate is greatly improved, and the integral F1 value is greatly improved.
TABLE 1
Figure BDA0003153933000000121
Optionally, on the basis of the embodiment corresponding to fig. 2, in another optional embodiment of the method for corpus processing provided in the embodiment of the present application, as shown in fig. 7, the method further includes:
in step S701, if there is no semantic recognition result in the K semantic recognition results that satisfies the corpus extraction condition, classifying the K candidate corpuses to obtain a first candidate corpus set and a second candidate corpus set, where the first candidate corpus set includes i first candidate corpuses, the second candidate corpus set includes j second candidate corpuses, and i and j are integers greater than 1 and smaller than K;
in step S702, the i first corpus candidates and the j second corpus candidates are combined pairwise to obtain i × j target corpus pairs.
In this embodiment, when there is no semantic recognition result in the K semantic recognition results that satisfies the corpus extraction condition, the embodiment may adopt a corpus clustering processing method, in which the K corpus candidates are first divided into a first corpus candidate set and a second corpus candidate set, and then each corpus in the first corpus candidate set and the second corpus candidate set is pairwise combined, so as to obtain a plurality of target corpus pairs similar to the corpus candidates, that is, target corpus pairs similar to the corpus to be expanded, so as to enrich the corpus to be expanded, thereby satisfying the requirement of model training on the number of corpuses to a certain extent.
The first corpus candidate set may be represented as a corpus set similar to the corpus to be expanded, i.e., i first corpus candidates included in the first corpus candidate set are corpuses similar to the corpus to be expanded, such as the number of words of the corpus is the same or the number of the same words included in the corpus exceeds half of the total number of words of the corpus, and the second corpus candidate set may be represented as a corpus set not similar to the corpus to be expanded, i.e., j second corpus candidates included in the second corpus candidate set are corpuses not similar to the corpus to be expanded, such as the number of words of the corpus is different or the number of the same words included in the corpus is small.
Specifically, after K corpus candidates similar to the corpus to be expanded are obtained, the K corpus candidates may be classified, specifically, the K corpus candidates may be classified according to the number of the same words in the corpus, or other classification manners, such as the number of words in the corpus, or a combination of several classification manners, without specific limitation, a first corpus candidate set similar to the corpus to be expanded and a second corpus candidate set not similar to the corpus to be expanded may be obtained, then, by pairwise combining i first corpus candidates in the first corpus candidate set and j first corpus candidates in the second corpus candidate set, i × j target corpus pairs greater than K may be obtained, i × j target corpus pairs greater than the K corpus candidates and similar to the corpus candidate may be obtained, more target corpus pairs similar to the corpus to be expanded can be mined, so that the corpus to be expanded is enriched, and the requirement of model training on the number of the corpus is met to a certain extent.
For example, assume that there are 5 corpus candidates, for example, a corpus having the same number of words as the corpus to be expanded and more than half of the total number of words of the corpus to be expanded is taken as the first corpus candidate, and conversely, a corpus having the same number of words as the corpus to be expanded and more than half of the total number of words of the corpus to be expanded is taken as the second corpus candidate, it is assumed that 2 first corpus candidates and 3 second corpus candidates are obtained, and then the 2 first corpus candidates and the 3 second corpus candidates are combined respectively, so as to obtain 6 target corpus pairs.
Optionally, on the basis of the embodiment corresponding to fig. 2, in another optional embodiment of the method for corpus processing according to the embodiment of the present application, as shown in fig. 8, if at least one semantic recognition result in the K semantic recognition results meets a corpus extraction condition, determining a corpus candidate corresponding to the at least one semantic recognition result as a target corpus to obtain at least one target corpus belonging to a corpus to be expanded, includes:
in step S801, if each semantic recognition result is a similarity score, determining a candidate corpus corresponding to the similarity score greater than or equal to a preset similarity threshold as a target corpus;
in step S802, if each semantic recognition result is a similarity classification, the corpus candidate corresponding to the similarity classification greater than or equal to the preset classification probability threshold is determined as the target corpus.
In this embodiment, after K semantic recognition results are obtained, in order to obtain the target corpus more quickly and accurately, when each semantic recognition result is a similarity score, the candidate corpus corresponding to the similarity score greater than or equal to the preset similarity threshold may be determined as the target corpus, wherein, the preset similarity threshold is set according to the practical application requirement, and is not specifically limited herein, or when each semantic recognition result is a similarity classification, determining the candidate corpus corresponding to the similarity classification which is greater than or equal to a preset classification probability threshold value as the target corpus, so as to quickly screen out the target corpus similar to the corpus to be expanded, thereby improving the processing efficiency of the corpus to be expanded to a certain extent, the preset classification probability threshold is set according to the actual application requirement, and is not specifically limited herein.
Specifically, when the K semantic recognition results are K similarity scores, it may be understood that the similarity score is larger and the similarity between the corpus candidate and the corpus to be expanded is higher, the similarity score may be compared with a preset similarity threshold, and the corpus candidate corresponding to the similarity greater than or equal to the similarity threshold is determined to be the target corpus similar to the corpus to be expanded, or, when the K semantic recognition results are K similarity classifications, since the similarity classification is the category to which the semantic between the corpus candidate and the corpus to be expanded belongs, it may be represented as K category probabilities, that is, it may be understood that the category probability is larger and the similarity of the category to which the semantic between the corpus candidate and the corpus to be expanded is higher, the similarity classification may be compared with the preset classification probability threshold, and the corpus candidate corresponding to the category probability greater than or equal to the preset classification probability threshold is determined to be the target corpus similar to the corpus to be expanded.
For example, assuming that the similarity scores between a corpus to be expanded, "what song is suitable for listening in rainy days" and 3 candidate corpora "music related to rainy days", "song desired to be listened to in rainy days" and "what is suitable for doing in rainy days" are 94, 91 and 37, respectively, and assuming that a preset similarity threshold is 72, by comparing these similarity scores with the preset similarity threshold, it is apparent that the similarity scores 94 and 91 are greater than the similarity threshold of 72, and the candidate corpora "music related to rainy days" and "song desired to be listened to in rainy days" can be determined as the target corpus of the corpus "what song is suitable for listening in rainy days".
For example, assuming that category probabilities of categories to which semantics between the corpus "what song is suitable for listening in rainy days" and the 3 candidate corpora "music related to rainy days", "song desired for listening in rainy days", and "what is suitable for doing in rainy days" are respectively 0.91, 0.89, and 0.39, and assuming that a preset probability threshold is 0.7, by comparing these classification probabilities with the preset classification probability threshold, the category probabilities 0.91 and 0.89 can be obviously obtained, which are greater than the probability threshold of 0.7, and the candidate corpora "music related to rainy days" and "song desired for listening in rainy days" can be determined as the target corpus of the corpus "what song is suitable for listening in rainy days" of the corpus to be augmented.
Referring to fig. 13, fig. 13 is a schematic diagram of an embodiment of a corpus processing apparatus 20 in the present application, which includes:
an obtaining unit 201, configured to obtain a corpus to be expanded;
the obtaining unit 201 is further configured to obtain K candidate corpora according to the corpora to be expanded, where semantic similarity between each candidate corpus and the corpora to be expanded is greater than or equal to a similarity threshold, and K is an integer greater than 1;
the processing unit 202 is configured to input the K candidate corpora and the corpus to be expanded into the semantic identification model to obtain K semantic identification results, where each semantic identification result is a similarity score or a similarity classification, the similarity score represents a semantic similarity between the candidate corpus and the corpus to be expanded, and the similarity classification represents a category to which semantics between the candidate corpus and the corpus to be expanded belong;
the determining unit 203 is configured to determine, if at least one semantic recognition result in the K semantic recognition results meets the corpus extraction condition, a corpus candidate corresponding to the at least one semantic recognition result as a target corpus to obtain at least one target corpus belonging to the corpus to be expanded.
Alternatively, on the basis of the embodiment corresponding to fig. 13, in another embodiment of the apparatus for corpus processing provided in the embodiment of the present application,
the obtaining unit 201 is further configured to obtain a corpus sample set, where the corpus sample set includes a positive sample corpus and a negative sample corpus, and the positive sample corpus corresponds to the tagging label;
the processing unit 202 is further configured to perform feature extraction on the positive sample corpus to obtain positive sample corpus features, and perform feature extraction on the negative sample corpus to obtain negative sample corpus features;
the processing unit 202 is further configured to input the positive sample corpus features and the negative sample corpus features to the semantic identification model to obtain a semantic prediction result;
the processing unit 202 is further configured to train the semantic recognition model according to the semantic prediction result and the label.
Optionally, on the basis of the embodiment corresponding to fig. 13, in another embodiment of the apparatus for corpus processing provided in the embodiment of the present application, the processing unit 202 may be specifically configured to:
respectively performing word segmentation processing on the positive sample corpus and the negative sample corpus to obtain at least two positive sample words to be processed and at least two negative sample words to be processed;
converting at least two positive sample words to be processed and at least two negative sample words to be processed into at least two positive sample word vectors and at least two negative sample word vectors;
and carrying out vector splicing on at least two positive sample word vectors to obtain positive sample corpus characteristics, and carrying out vector splicing on at least two negative sample word vectors to obtain negative sample corpus characteristics.
Alternatively, on the basis of the embodiment corresponding to fig. 13, in another embodiment of the apparatus for corpus processing provided in the embodiment of the present application,
the processing unit 202 is further configured to, if no semantic recognition result in the K semantic recognition results meets the corpus extraction condition, translate the corpus to be expanded into N language corpora corresponding to the N languages;
the processing unit 202 is further configured to translate the N linguistic data into at least N translated linguistic data according to the linguistic data to be extended.
Alternatively, on the basis of the embodiment corresponding to fig. 13, in another embodiment of the apparatus for corpus processing provided in the embodiment of the present application,
the determining unit 203 is further configured to determine at least one target corpus and at least N retranslation corpora as a plurality of to-be-annotated corpora;
the processing unit 202 is further configured to perform slot matching on each corpus to be annotated to obtain a slot matching result corresponding to each corpus to be annotated, where the slot matching result is a matching similarity score, and the matching similarity score represents a semantic similarity between a preset slot and each term to be annotated in each corpus to be annotated;
the determining unit 203 is further configured to determine, if the matching similarity score is greater than or equal to a preset matching threshold, a preset slot corresponding to the matching similarity score as a target slot, and determine, as a target slot value, a word to be annotated corresponding to the matching similarity score, where the target slot is used to represent an attribute of the target slot value;
the processing unit 202 is further configured to perform de-duplication processing on the corpus to be annotated, the target slot position corresponding to the corpus to be annotated, and the target slot position value to obtain the target annotation corpus.
Alternatively, on the basis of the embodiment corresponding to fig. 13, in another embodiment of the apparatus for corpus processing provided in the embodiment of the present application,
the processing unit 202 is further configured to classify the K semantic language candidates if no semantic recognition result in the K semantic recognition results meets a corpus extraction condition, so as to obtain a first candidate corpus set and a second candidate corpus set, where the first candidate corpus set includes i first candidate corpora, the second candidate corpus set includes j second candidate corpora, and i and j are integers greater than 1 and smaller than K;
the processing unit 202 is further configured to combine each two of the i first candidate corpuses and the j second candidate corpuses, so as to obtain i × j target corpus pairs.
Optionally, on the basis of the embodiment corresponding to fig. 13, in another embodiment of the apparatus for corpus processing provided in the embodiment of the present application, the processing unit 202 may be specifically configured to:
if each semantic recognition result is a similarity score, determining a candidate corpus corresponding to the similarity score which is greater than or equal to a preset similarity threshold as a target corpus;
and if each semantic recognition result is a similarity classification, determining the candidate corpus corresponding to the similarity classification which is greater than or equal to a preset classification probability threshold value as the target corpus.
Another exemplary computer device is provided, as shown in fig. 14, fig. 14 is a schematic structural diagram of a computer device provided in this embodiment, and the computer device 300 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 310 (e.g., one or more processors) and a memory 320, and one or more storage media 330 (e.g., one or more mass storage devices) storing an application 331 or data 332. Memory 320 and storage media 330 may be, among other things, transient or persistent storage. The program stored on the storage medium 330 may include one or more modules (not shown), each of which may include a sequence of instructions operating on the computer device 300. Still further, the central processor 310 may be configured to communicate with the storage medium 330 to execute a series of instruction operations in the storage medium 330 on the computer device 300.
The computer device 300 may also include one or more power supplies 340, one or more wired or wireless network interfaces 350, one or more input-output interfaces 360, and/or one or more operating systems 333, such as a Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTMAnd so on.
The computer device 300 described above is also used to perform the steps in the embodiments corresponding to fig. 2 to 8.
Another aspect of the present application provides a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the steps in the method as described in the embodiments shown in fig. 2 to 8.
Another aspect of the application provides a computer program product comprising instructions which, when run on a computer or processor, cause the computer or processor to perform the steps of the method as described in the embodiments shown in fig. 2 to 8.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Claims (10)

1. A corpus processing method, comprising:
obtaining a corpus to be expanded;
acquiring K candidate corpora according to the corpora to be expanded, wherein the semantic similarity between each candidate corpora and the corpora to be expanded is greater than or equal to a similarity threshold, and K is an integer greater than 1;
inputting the K candidate corpora and the corpus to be expanded into a semantic identification model to obtain K semantic identification results, wherein each semantic identification result is a similarity score or a similarity classification, the similarity score represents the semantic similarity between the candidate corpora and the corpus to be expanded, and the similarity classification represents the category to which the semantics between the candidate corpora and the corpus to be expanded belong;
if at least one semantic recognition result in the K semantic recognition results meets the corpus extraction condition, determining the candidate corpus corresponding to the at least one semantic recognition result as a target corpus to obtain at least one target corpus belonging to the corpus to be expanded.
2. The method of claim 1, further comprising:
obtaining a corpus sample set, wherein the corpus sample set comprises a positive sample corpus and a negative sample corpus, and the positive sample corpus corresponds to a labeling label;
performing feature extraction on the positive sample corpus to obtain positive sample corpus features, and performing feature extraction on the negative sample corpus to obtain negative sample corpus features;
inputting the positive sample corpus features and the negative sample corpus features into the semantic recognition model to obtain a semantic prediction result;
and training the semantic recognition model according to the semantic prediction result and the labeling label.
3. The method according to claim 2, wherein said extracting features of the positive sample corpus to obtain positive sample corpus features and extracting features of the negative sample corpus to obtain negative sample corpus features comprises:
performing word segmentation processing on the positive sample corpus and the negative sample corpus respectively to obtain at least two to-be-processed positive sample words and at least two to-be-processed negative sample words;
converting the at least two positive sample words to be processed and the at least two negative sample words to be processed into at least two positive sample word vectors and at least two negative sample word vectors;
and carrying out vector splicing on the at least two positive sample word vectors to obtain the positive sample corpus characteristics, and carrying out vector splicing on the at least two negative sample word vectors to obtain the negative sample corpus characteristics.
4. The method according to claim 1, wherein after inputting the K corpus candidates and the corpus to be expanded into the semantic recognition model to obtain K semantic recognition results, the method further comprises:
if no semantic recognition result in the K semantic recognition results meets the corpus extraction condition, translating the corpus to be expanded into N language corpora corresponding to N languages;
and translating the N language corpora into at least N retranslated corpora according to the languages of the corpora to be expanded.
5. The method according to claim 4, wherein after said translating said N linguistic data into N translated linguistic data according to the language of said linguistic data to be expanded, said method further comprises:
determining the at least one target corpus and the at least N retranslation corpora into a plurality of corpora to be labeled;
performing slot position matching on each linguistic data to be annotated to obtain a slot position matching result corresponding to each linguistic data to be annotated, wherein the slot position matching result is a matching similarity score which represents the semantic similarity between a preset slot position and each word to be annotated in each linguistic data to be annotated;
if the matching similarity score is larger than or equal to a preset matching threshold value, determining a preset slot position corresponding to the matching similarity score as a target slot position, and determining a word to be annotated corresponding to the matching similarity score as a target slot position, wherein the target slot position is used for representing the attribute of the target slot position;
and performing de-duplication processing on the linguistic data to be labeled, the target slot position corresponding to the linguistic data to be labeled and the target slot position value to obtain the target labeling linguistic data.
6. The method according to claim 1, wherein after inputting the K corpus candidates and the corpus to be expanded into the semantic recognition model to obtain K semantic recognition results, the method further comprises:
if no semantic recognition result in the K semantic recognition results meets the corpus extraction condition, classifying the K candidate corpuses to obtain a first candidate corpus set and a second candidate corpus set, wherein the first candidate corpus set comprises i first candidate corpuses, the second candidate corpus set comprises j second candidate corpuses, and i and j are integers which are larger than 1 and smaller than K;
and combining the i first candidate corpora and the j second candidate corpora pairwise respectively to obtain i x j target corpus pairs.
7. The method according to claim 1, wherein if at least one semantic recognition result among the K semantic recognition results satisfies a corpus extraction condition, determining a corpus candidate corresponding to the at least one semantic recognition result as a target corpus to obtain at least one target corpus belonging to the corpus to be expanded, including:
if each semantic recognition result is a similarity score, determining a candidate corpus corresponding to the similarity score which is greater than or equal to a preset similarity threshold as the target corpus;
and if each semantic recognition result is a similarity classification, determining the candidate corpus corresponding to the similarity classification larger than or equal to a preset classification probability threshold as the target corpus.
8. A corpus processing apparatus, comprising:
the obtaining unit is used for obtaining the corpus to be expanded;
the obtaining unit is further configured to obtain K candidate corpora according to the corpus to be expanded, where semantic similarity between each candidate corpus and the corpus to be expanded is greater than or equal to a similarity threshold, and K is an integer greater than 1;
the processing unit is used for inputting the candidate corpora and the corpus to be expanded into a semantic identification model to obtain K semantic identification results, wherein each semantic identification result is a similarity score or a similarity classification, the similarity score represents the semantic similarity between the candidate corpora and the corpus to be expanded, and the similarity classification represents the category to which the semantics between the candidate corpora and the corpus to be expanded belong;
and the determining unit is used for determining the candidate corpus corresponding to the at least one semantic recognition result as a target corpus if at least one semantic recognition result in the K semantic recognition results meets corpus extraction conditions, so as to obtain at least one target corpus belonging to the corpus to be expanded.
9. A computer device, comprising: a memory, a transceiver, a processor, and a bus system;
wherein the memory is used for storing programs;
the processor, when executing the program in the memory, implementing the method of any one of claims 1 to 7;
the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.
10. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of any of claims 1 to 7.
CN202110774306.2A 2021-07-08 2021-07-08 Corpus processing method, related device and equipment Pending CN113821593A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110774306.2A CN113821593A (en) 2021-07-08 2021-07-08 Corpus processing method, related device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110774306.2A CN113821593A (en) 2021-07-08 2021-07-08 Corpus processing method, related device and equipment

Publications (1)

Publication Number Publication Date
CN113821593A true CN113821593A (en) 2021-12-21

Family

ID=78924139

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110774306.2A Pending CN113821593A (en) 2021-07-08 2021-07-08 Corpus processing method, related device and equipment

Country Status (1)

Country Link
CN (1) CN113821593A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114049884A (en) * 2022-01-11 2022-02-15 广州小鹏汽车科技有限公司 Voice interaction method, vehicle and computer-readable storage medium
CN115879458A (en) * 2022-04-08 2023-03-31 北京中关村科金技术有限公司 Corpus expansion method, apparatus and storage medium
CN116167455A (en) * 2022-12-27 2023-05-26 北京百度网讯科技有限公司 Model training and data deduplication method, device, equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114049884A (en) * 2022-01-11 2022-02-15 广州小鹏汽车科技有限公司 Voice interaction method, vehicle and computer-readable storage medium
CN115879458A (en) * 2022-04-08 2023-03-31 北京中关村科金技术有限公司 Corpus expansion method, apparatus and storage medium
CN116167455A (en) * 2022-12-27 2023-05-26 北京百度网讯科技有限公司 Model training and data deduplication method, device, equipment and storage medium
CN116167455B (en) * 2022-12-27 2023-12-22 北京百度网讯科技有限公司 Model training and data deduplication method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
EP3648099B1 (en) Voice recognition method, device, apparatus, and storage medium
US10657325B2 (en) Method for parsing query based on artificial intelligence and computer device
Chen et al. Unsupervised induction and filling of semantic slots for spoken dialogue systems using frame-semantic parsing
US11816441B2 (en) Device and method for machine reading comprehension question and answer
CN111753060A (en) Information retrieval method, device, equipment and computer readable storage medium
US11521603B2 (en) Automatically generating conference minutes
CN113821593A (en) Corpus processing method, related device and equipment
WO2021121198A1 (en) Semantic similarity-based entity relation extraction method and apparatus, device and medium
CN103154936A (en) Methods and systems for automated text correction
US11875585B2 (en) Semantic cluster formation in deep learning intelligent assistants
CN102214189B (en) Data mining-based word usage knowledge acquisition system and method
CN112528001B (en) Information query method and device and electronic equipment
CN111666764B (en) Automatic abstracting method and device based on XLNet
US11514034B2 (en) Conversion of natural language query
US20220414463A1 (en) Automated troubleshooter
CN114003682A (en) Text classification method, device, equipment and storage medium
Hassani et al. LVTIA: A new method for keyphrase extraction from scientific video lectures
El Janati et al. Adaptive e-learning AI-powered chatbot based on multimedia indexing
Nehar et al. Rational kernels for Arabic root extraction and text classification
WO2019163642A1 (en) Summary evaluation device, method, program, and storage medium
CN113868389A (en) Data query method and device based on natural language text and computer equipment
WO2008017188A1 (en) System and method for making teaching material of language class
Sarkar et al. Bengali noun phrase chunking based on conditional random fields
Wu et al. Research on Intelligent Retrieval Model of Multilingual Text Information in Corpus
Che et al. A Chinese text correction and intention identification method for speech interactive context

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination