CN113554107A

CN113554107A - Corpus generating method, apparatus, device, storage medium and program product

Info

Publication number: CN113554107A
Application number: CN202110857190.9A
Authority: CN
Inventors: 秦行
Original assignee: Industrial and Commercial Bank of China Ltd ICBC; ICBC Technology Co Ltd
Current assignee: Industrial and Commercial Bank of China Ltd ICBC; ICBC Technology Co Ltd
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2021-10-26

Abstract

The disclosure provides a corpus generation method, which can be applied to the technical field of artificial intelligence. The method for generating the corpus comprises the following steps: acquiring an initial corpus; performing word segmentation on the initial corpus, and determining a word vector of each word from the initial corpus after word segmentation to obtain at least one first word vector; acquiring a word vector with the similarity to the at least one first word vector meeting a preset condition from a preset word vector set; and performing word vector replacement on the initial corpus by using word vectors acquired from the word vector set to generate a target corpus, wherein the language order of at least one corpus is different from the language order of the initial corpus in the generated target corpus. The disclosure also provides a corpus generation device, equipment, storage medium and program product.

Description

Corpus generating method, apparatus, device, storage medium and program product

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a corpus generation method, apparatus, device, storage medium, and program product.

Background

In the natural language processing technology, a semantic similarity model is a model algorithm for searching other sentences having a higher similarity to a given sentence, and is commonly used in a scenario such as a question and answer robot.

At present, a semantic similarity model is generated by training a large number of manually labeled corpora through an algorithm. These corpora are usually composed of a large number of sentence pairs and manually labeled similarities, such as "how to social insurance" and "how to social insurance".

However, the requirement of the semantic similarity model for the corpora is huge, and usually at least 10G of data is needed, so how to collect a large number of corpora becomes a difficult problem.

Disclosure of Invention

In view of the foregoing, the present disclosure provides a generating method, apparatus, device, storage medium, and program product capable of quickly generating a corpus of large forecasts.

According to a first aspect of the present disclosure, there is provided a corpus generation method, including:

acquiring an initial corpus;

performing word segmentation on the initial corpus, and determining a word vector of each word from the initial corpus after word segmentation to obtain at least one first word vector;

acquiring a word vector with the similarity to the at least one first word vector meeting a preset condition from a preset word vector set;

and performing word vector replacement on the initial corpus by using word vectors acquired from the word vector set to generate a target corpus, wherein the language order of at least one corpus is different from the language order of the initial corpus in the generated target corpus.

According to an embodiment of the present disclosure, the preset condition includes a first preset similarity and a second preset similarity, and the step of obtaining, from a preset word vector set, a word vector whose similarity with the at least one first word vector satisfies the preset condition includes:

obtaining word vectors with the similarity degree with the at least one first word vector larger than a first preset similarity degree from the word vector set to obtain at least one second word vector; and the number of the first and second groups,

obtaining word vectors with the similarity to the at least one first word vector being smaller than a second preset similarity from the word vector set to obtain at least one third word vector;

the step of performing word vector replacement on the initial corpus to generate a target corpus by using the word vectors acquired from the word vector set includes:

and replacing the first word vector in the initial corpus by using the at least one second word vector and the at least one third word vector to generate a target corpus set.

According to an embodiment of the present disclosure, the step of replacing the first word vector in the initial corpus with the at least one second word vector and the at least one third word vector to generate the target corpus includes:

replacing at least one first word vector in the initial corpus with the at least one second word vector to obtain a first corpus, and replacing at least one first word vector in the initial corpus with the at least one third word vector to obtain a second corpus;

performing language order exchange on each language material in the first language material set to obtain a third language material set, and performing language order exchange on each language material in the second language material set to obtain a fourth language material set, wherein the semantics of the language material before the language order exchange and the language material after the language order exchange are the same;

and forming the target corpus by the initial corpus, the corpus in the first corpus set, the corpus in the second corpus set, the corpus in the third corpus set and the corpus in the fourth corpus set.

performing language order exchange on the initial corpus, wherein the initial corpus before the language order exchange has the same semantics as the initial corpus after the language order exchange;

replacing at least one first word vector in the initial corpus before sequence conversion by using the at least one second word vector, and replacing at least one first word vector in the initial corpus after sequence conversion to obtain a fifth corpus;

replacing at least one first word vector in the initial corpus before sequence conversion by using the at least one third word vector, and replacing at least one first word vector in the initial corpus after sequence conversion to obtain a sixth corpus;

and the initial corpus before the language order exchange, the initial corpus after the language order exchange, the corpus in the fifth corpus set and the corpus in the sixth corpus set form the target corpus set.

According to an embodiment of the present disclosure, the step of segmenting the initial corpus and determining a word vector of each word from the segmented initial corpus to obtain at least one first word vector includes:

determining a part of speech of each of the at least one first word vector;

the step of replacing at least one first word vector in the initial corpus with the at least one second word vector comprises:

for each of the at least one first word vector, determining a second word vector matching the part of speech of the first word vector from the at least one second word vector, and replacing the first word vector with the second word vector;

the step of replacing at least one first word vector in the initial corpus with the at least one third word vector comprises:

for each of the at least one first word vector, a third word vector that matches the part of speech of the first word vector is determined from the at least one third word vector, and the first word vector is replaced with the third word vector.

According to an embodiment of the present disclosure, the corpus generating method further includes a step of establishing the word vector set, and the step of establishing the word vector set includes:

acquiring text data;

performing word segmentation on the text data, and determining a word vector of each word in the text data after word segmentation;

and forming the word vector set by using the word vector of each word in the text data.

A second aspect of the present disclosure provides an apparatus for generating a corpus, including:

the first obtaining module is used for obtaining the initial corpus;

the processing module is used for segmenting the initial corpus and determining a word vector of each word from the segmented initial corpus to obtain at least one first word vector;

the second acquisition module is used for acquiring word vectors of which the similarity with the at least one first word vector meets a preset condition from a preset word vector set;

and the corpus generating module is configured to replace a first word vector in the initial corpus with a word vector acquired from the word vector set to generate a target corpus, and a language order of at least one corpus in the generated target corpus is different from a language order of the initial corpus.

A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; a memory for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors are caused to execute the method for generating a corpus as described above.

The fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions, which when executed by a processor, cause the processor to execute the above-mentioned corpus generation method.

The fifth aspect of the present disclosure also provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the corpus generating method described above.

One or more of the above-described embodiments may provide the following advantages or benefits:

by adopting the corpus generation method of the embodiment of the disclosure, a large amount of corpora (namely, the target corpus) can be automatically generated through word vector replacement and other modes according to the initial corpus, the working efficiency is high, errors are not easy to occur, and the embodiment of the disclosure can also adjust the language order of the initial corpus when the target corpus is generated, so that the types of the generated corpora in the target corpus are richer, and the training of a similarity model is facilitated.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following description of embodiments of the disclosure, which proceeds with reference to the accompanying drawings, in which:

FIG. 1 is a diagram schematically illustrating an application scenario of a corpus generation method, apparatus, device, storage medium and program product according to an embodiment of the present disclosure;

FIG. 2 schematically shows one of the flow charts of a corpus generation method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a second flowchart of a corpus generation method according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow chart for tokenizing initial corpus in accordance with an embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow diagram for obtaining a word vector from a set of word vectors according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a flow diagram for generating a target corpus in accordance with an embodiment of the present disclosure;

FIG. 7a is a schematic diagram illustrating one of the detailed flow charts for generating a target corpus according to an embodiment of the present disclosure;

FIG. 7b schematically illustrates a second specific flowchart for generating a target corpus according to an embodiment of the present disclosure;

fig. 8 is a block diagram schematically illustrating a structure of a corpus generation apparatus according to an embodiment of the present disclosure;

fig. 9 schematically shows a block diagram of an electronic device adapted to implement a corpus generation method according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

It should be noted that, in the technical solution of the present disclosure, the acquisition, storage, application, and the like of the personal information of the related user all conform to the regulations of the related laws and regulations, and necessary security measures are taken without violating the good customs of the public order.

At present, in the related art, a manual method is usually adopted to obtain or rewrite text information from existing text information to obtain a corpus used for training a similarity model, where the text information is, for example, information such as questions, messages, and feedback of a user on a system, and the work efficiency is low and errors are easy to occur.

In view of this, an embodiment of the present disclosure provides a corpus generation method, which relates to the field of artificial intelligence, and specifically includes: and acquiring initial corpora. And performing word segmentation on the initial corpus, and determining a word vector of each word from the segmented initial corpus to obtain at least one first word vector. And obtaining a word vector with the similarity meeting a preset condition with at least one first word vector from a preset word vector set. And performing word vector replacement on the initial corpus by using word vectors acquired from the word vector set to generate a target corpus, wherein the language order of at least one corpus is different from the language order of the initial corpus in the generated target corpus.

Compared with the mode of manually acquiring the corpus in the related art, the corpus generating method of the embodiment of the disclosure can automatically generate a large amount of corpora (namely, the target corpus) through word vector replacement and other modes according to the initial corpus, is high in working efficiency and not prone to error, and can adjust the language order of the initial corpus when the target corpus is generated, so that the variety of the generated corpus in the target corpus is richer, and the training of a similarity model is facilitated.

Fig. 1 schematically illustrates an application scenario diagram of a corpus generation method, apparatus, device, storage medium and program product according to an embodiment of the present disclosure, and as shown in fig. 1, an application scenario 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (for example only) providing support for websites browsed by users using the

terminal devices

101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the corpus generation method provided by the embodiment of the present disclosure may be generally executed by the server 105. Accordingly, the apparatus for generating corpus provided by the embodiment of the present disclosure may be generally disposed in the server 105. The corpus generation method provided by the embodiment of the present disclosure may also be executed by a server or a server cluster that is different from the server 105 and can communicate with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the apparatus for generating corpus provided in the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The corpus generation method of the disclosed embodiment will be described in detail below with reference to fig. 2 to 7b based on the scenario described in fig. 1.

Fig. 2 schematically shows one of flowcharts of a corpus generation method according to an embodiment of the present disclosure, and as shown in fig. 2, the corpus generation method of this embodiment includes steps S210 to S240.

In step S210, an initial corpus is obtained.

In the disclosed embodiments, the linguistic data is linguistic data. The corpus is a basic unit constituting a corpus set, and the corpus may include text information, for example, "help me turn on an air conditioner in a vehicle" is a corpus. The initial corpus may refer to a user-entered corpus. The embodiment of the disclosure can generate the target corpus according to the initial corpus, and when a user inputs a plurality of initial corpuses, the target corpus corresponding to the initial corpuses one by one can be respectively generated, and the corpus can form a corpus which can be applied to aspects such as model training, for example, semantic similarity model training.

In step S220, the initial corpus is segmented, and a word vector of each word is determined from the segmented initial corpus to obtain at least one first word vector.

The word segmentation refers to a process of splitting a corpus to obtain a target word, i.e., recombining continuous word sequences (corpora) into word sequences according to a certain standard. For example, the initial corpus is "how to turn on the air conditioner in the vehicle", and the words are segmented to obtain (how to) (turn on) (vehicle) (inside) (air conditioner), and the total of five target words are obtained. The word vector, i.e. the vector of words, may be calculated by, for example, word2vec algorithm or other algorithms that can occur to those skilled in the art, and is not limited herein. And performing word vector calculation on the five target words to obtain five first word vectors.

In step S230, a word vector whose similarity to at least one first word vector satisfies a preset condition is obtained from a preset set of word vectors.

In the embodiment of the present disclosure, the word vector set may be established according to existing text information, and a specific manner will be described in detail below, which is not described herein again. The preset condition may be determined according to actual needs, and for example, the preset condition may include: the similarity of the word vector and the first word vector is higher than a certain set value, namely, the word vector which is more similar to the first word vector is found out from the word vector set; the preset conditions may also include: the similarity with the first word vector is lower than a certain set value, that is, the word vector which is not similar to the first word vector is found from the word vector set. For example, the word vector set includes word vectors of three words of "television", "humidifier", and "air conditioner" (for convenience of description, hereinafter, the word vector of "Xa" is used to denote X, for example, the word vector of "air conditioner" is "air conditioner a"), for a first word vector of the word "air conditioner" in the above five target words, a word vector similar to the first word vector in the word vector set is "air conditioner a", and word vectors not similar to the first word vector in the word vector set are "television a" and "humidifier a".

In step S240, a word vector replacement is performed on the initial corpus by using the word vectors obtained from the word vector set to generate a target corpus, wherein in the generated target corpus, a language order of at least one corpus is different from a language order of the initial corpus.

In this embodiment of the present disclosure, for any first word vector in the initial corpus, the first word vector may be replaced by a word vector that satisfies a preset condition with the first word vector and is obtained from a word vector set. For example, the initial corpus is "how to turn on an air conditioner in a vehicle", and the word vectors acquired from the word vector set and satisfying the preset condition with the first word vector "air conditioner a" are: "air conditioner a", "television a" and "humidifier a", then, the linguistic data obtained after performing word vector substitution on the initial linguistic data is: "how to turn on an in-vehicle air conditioner", "how to turn on an in-vehicle television", and "how to turn on an in-vehicle humidifier".

In the embodiment of the present disclosure, besides the word vector replacement, the word order may be adjusted, for example, for three linguistic data of "how to turn on the in-car air conditioner", "how to turn on the in-car television", and "how to turn on the in-car humidifier", after the word order adjustment, the obtained linguistic data may be "how to turn on the in-car air conditioner", "how to turn on the in-car television", and "how to turn on the in-car humidifier".

The corpora generated as above constitute a target corpus, that is, the target corpus may include: how to turn on the in-car air conditioner, how to turn on the in-car tv, how to turn on the in-car humidifier, how to turn on the in-car air conditioner, how to turn on the in-car tv, and how to turn on the in-car humidifier, that is, the embodiment of the present disclosure generates five new corpora by one initial corpus.

The corpus generation method according to the embodiment of the present disclosure will be further described with reference to fig. 2 to 7. The method for generating the corpus comprises two schemes, wherein the first scheme is that word vector replacement is carried out firstly, and then language order exchange is carried out; the second scheme is to perform word order conversion first and then perform word vector replacement, and the first scheme is first described below.

Specifically, fig. 3 schematically shows a second flowchart of the corpus generation method according to the embodiment of the present disclosure, and as shown in fig. 3, the corpus generation method according to the embodiment of the present disclosure further includes a step of establishing a word vector set, and the step of establishing the word vector set includes steps S310 to S330.

In step S310, text data is acquired.

In the embodiment of the present disclosure, the text data may be obtained in various ways, for example, by an automatic crawler, that is, a program or script of the web information is automatically captured according to a certain rule. Or downloaded from a data platform. Or collected manually. The method for acquiring the text data is not particularly limited in the present application.

In step S320, the text data is segmented, and a word vector of each word in the segmented text data is determined.

In the embodiment of the present disclosure, the word segmentation may be performed on the text data through a word segmentation tool, for example, the word segmentation tool may employ a jieba word segmentation tool, and of course, other word segmentation tools that can occur to those skilled in the art may also be employed, which is not limited herein. The word vector of each word may be determined by word2vec algorithm, but it is also possible to use an algorithm that can be conceived by those skilled in the art, and the method is not limited herein.

In step S330, a word vector set is formed with the word vector of each word in the text data.

Fig. 4 schematically illustrates a flowchart of segmenting words in an initial corpus according to an embodiment of the present disclosure, and as shown in fig. 4, in some specific embodiments, step S220 includes step S221.

In step S221, a part of speech of each of the at least one first word vector is determined.

In the embodiment of the disclosure, a word segmentation tool can be used to assign a part-of-speech tag to each word while performing word segmentation. The word classes of modern Chinese can be divided into four major classes of real words, imaginary words, sigh words and pseudonyms. The real words (words with actual meanings, which can be independently used as sentence components, namely words with lexical meanings and grammatical meanings) comprise body words, predicates, addition words and pronouns, the body words comprise nouns, numbers and quantifiers, the predicates comprise verbs and adjectives, the addition words comprise distinguished words and adverbs, the pronouns are the real words independent of the body words, the predicates and the addition words, the main function of the pronouns is substitution, the pronouns can substitute the nouns, the numbers, the quantifiers, the verbs, the adjectives and the adverbs, objects substituted by the pronouns are different, and grammatical functions are different. The term "related words" includes related words and auxiliary words, the related words include conjunctive words and prepositions, and the auxiliary words include auxiliary words and adversaries. The speech-sounds words and sigh words are special words, which are not real words or imaginary words, and they are characterized by that in the sentence they do not usually have structural relationship with other words. The parts of speech are classified into categories according to the grammatical function of the word. The part of speech refers to the grammatical features of the word, i.e., the grammatical functions of the word. For example, after the word segmentation result (how) (open) (car) (inside) (air conditioner) is assigned with the part-of-speech tag, (how [ r ]) (open [ v ]) (car [ n ]) (inside [ n ]) (air conditioner [ n ]). Where r represents a pronoun, v represents a verb, and n represents a noun.

Optionally, in the embodiment of the present disclosure, the syntax of the initial expectation may be further analyzed by a syntax analysis tool to assign a syntax label to the initial corpus, so as to be used for performing a language order exchange in a subsequent step. For example, the syntactic analysis tool may employ an LTP tool. Syntactic labels may mainly include: a main-predicate relationship (SBV), a moving-object relationship (VOB), an inter-object relationship (IOB), a front-object (FOB), a Doublet (DBL), a centering relationship (ATT), an intermediate (ADV), a dynamic (CMP), a parallel (COO), a preposition-object (POB), a left additional (left, LAD), a right additional (right, RAD), an independent (independent structure, IS), and a core (head, HED). For example, the syntax character tag is assigned to the part-of-speech tag result (how [ r ]) (open [ v ]) (car [ n ]) (inner [ n ]) (air conditioner [ n ]) to obtain (how [ r ] [ HED ]) (open [ v ] [ VOB ]) (car [ n ] [ ATT ]) (inner [ n ] [ ATT ]) (air conditioner [ n ] [ VOB ]).

Fig. 5 schematically shows a flowchart for obtaining a word vector from a word vector set according to an embodiment of the present disclosure, and as shown in fig. 5, step S230 includes steps S231 to S232.

In step S231, a word vector having a similarity greater than a first preset similarity with at least one first word vector is obtained from the word vector set to obtain at least one second word vector.

In step S232, a word vector having a similarity smaller than a second preset similarity with at least one first word vector is obtained from the word vector set, so as to obtain at least one third word vector.

In the embodiment of the present disclosure, Word2vec may be adopted to convert each Word in the initial corpus into a first Word vector, and calculate the similarity between each first Word vector and each Word vector in the Word vector set. For example, the first word vector is "air conditioner a", the word vector set comprises "television a", "humidifier a" and "air conditioner a", the similarity calculated by word2vec is 0.035, the similarity of "television a" and "air conditioner a" is 0.413, and the similarity of "air conditioner a" and "air conditioner a" is 0.928.

Alternatively, the first preset degree of similarity may be set to be greater than or equal to 0.5, so that the second word vector is a word vector similar to the first word vector. The second preset degree of similarity may be set to be less than or equal to 0.5 so that the third word vector is a word vector that is dissimilar from the first word vector.

Fig. 6 schematically shows a flowchart of generating a target corpus according to an embodiment of the present disclosure, and as shown in fig. 6, step S240 includes step S241.

In step S241, the first word vector in the initial corpus is replaced by at least one second word vector and at least one third word vector to generate a target corpus.

In the embodiment of the present disclosure, the second word vector and the third word vector may be used to replace the first word vector in the initial corpus; or replacing a part of the first word vectors in the initial corpus with the second word vectors, and replacing another part of the first word vectors in the initial corpus with the third word vectors. Optionally, in this embodiment of the present disclosure, a manner of replacing the first word vector in the initial corpus with the second word vector and the third word vector, respectively, is adopted.

Specifically, fig. 7a schematically illustrates one of the specific flowcharts for generating the target corpus according to the embodiment of the disclosure, and as shown in fig. 7a, in some specific embodiments, the step S241 includes steps S2411 to S2415.

In step S2411, at least one first word vector in the initial corpus is replaced with at least one second word vector to obtain a first corpus.

In step S2412, at least one first word vector in the initial corpus is replaced with at least one third word vector to obtain a second corpus.

In the embodiment of the disclosure, since the second word vector is a word vector similar to the first word vector, the corpora in the first corpus are similar to the initial corpora, for example, the initial corpora is "how to turn on the in-vehicle air conditioner", and the corpora in the first corpus is "how to turn on the in-vehicle air conditioner". The second word vector is a word vector that is dissimilar to the first word vector, so the corpus in the second corpus is a corpus dissimilar to the initial corpus, for example, the initial corpus is "how to turn on the in-vehicle air conditioner", and the corpus in the second corpus is "how to turn on the in-vehicle television". Thus, the embodiment of the present disclosure can generate similar positive corpora (the first corpus) and dissimilar negative corpora (the second corpus) according to the initial corpus, so that the generated corpora are richer.

It should be noted that, in the embodiment of the present disclosure, only "air conditioner" in the initial corpus is exemplarily described to be replaced by "air conditioner", but other words in the initial corpus may also be replaced by the embodiment of the present disclosure, for example, "open" in the initial corpus is replaced by "open", and the like, and the embodiment of the present disclosure is not listed any more. Optionally, in the embodiment of the present disclosure, when the initial corpus is obtained, part-of-speech tags are assigned to word vectors in the initial corpus by using a jieba word segmentation tool, so that the word vectors in the initial corpus may be replaced according to the part-of-speech in the embodiment of the present disclosure.

In some embodiments, step S2411 includes: for each of at least one first word vector, a second word vector matching the part of speech of the first word vector is determined from at least one second word vector, and the first word vector is replaced with the second word vector.

In the embodiment of the present disclosure, the second word vector matching the part of speech of the first word vector may refer to a second word vector having the same part of speech as the part of speech of the first word vector in the at least one second word vector. Illustratively, the initial corpus is "how to turn on the in-vehicle air conditioner", and the at least one second word vector includes "how a", "turn on a", "start a", and "air conditioner a", wherein "how a" and "how a" have the same part of speech as the first word vector "how a" in the initial corpus, and thus "how a" and "how a" may be used to replace "how a"; the 'turn-on a' and the 'turn-on a' have the same part of speech as the first word vector 'turn-on a' in the initial corpus, so the 'turn-on a' can be replaced by the 'turn-on a' and the 'turn-on a'; the "air conditioner a" has the same part of speech as the first word vector "air conditioner a" in the initial corpus, and thus "air conditioner a" may be replaced with "air conditioner a". Therefore, the generated corpus of the first corpus may include "how to turn on the in-car air-conditioner" and the like.

In some embodiments, step S2412 includes: for each of the at least one first word vector, a third word vector is determined from the at least one third word vector that matches the part of speech of the first word vector, and the first word vector is replaced with the third word vector.

In the embodiment of the present disclosure, the third word vector matching the part of speech of the first word vector may refer to a third word vector having the same part of speech as the part of speech of the first word vector in the at least one third word vector. Illustratively, the initial corpus is "how to turn on the in-vehicle air conditioner", the at least one third word vector includes "turn off a" and "tv a", where "turn off a" has the same part of speech as the first word vector "turn on a" in the initial corpus, and "tv a" has the same part of speech as the first word vector "air conditioner a" in the initial corpus, and thus, the corpus in the generated second corpus may include "how to turn off the in-vehicle tv", and the like.

It should be noted that, in the embodiment of the present disclosure, when performing word vector replacement on the initial corpus, all word vectors in the initial corpus may be replaced, or partial word vectors may also be replaced, which may be determined specifically according to actual needs, and is not limited herein.

In step S2413, performing order swapping on each corpus in the first corpus to obtain a third corpus.

In step S2414, performing order swapping on each corpus in the second corpus to obtain a fourth corpus.

In step S2413 and step S2414, the semantic meanings of the corpus before the language order exchange and the semantic meanings of the corpus after the language order exchange are the same.

In the embodiment of the present disclosure, the semantic meaning refers to the meaning of a corpus as a sentence, and the word order refers to the sequence of words in the corpus according to a certain rule. Since the syntax tags may be assigned to the initial corpus in step S221, the first corpus and the second corpus generated according to the initial corpus may also have corresponding syntax tags. According to the syntactic label, the language order can be exchanged on the basis of not changing the semantics. For example, the initial corpus is "how to turn on an in-vehicle air conditioner", the corpus in the first corpus set includes "how to turn on an in-vehicle air conditioner", and the obtained corpus may be "how to turn on an in-vehicle air conditioner" after the corpus is transposed on the basis of not changing semantics. As described above, the corpus in the second corpus may include "how to turn off the in-vehicle television", and the obtained corpus may be "how to turn off the in-vehicle television" after the corpus is transposed on the basis of not changing the semantic.

In step S2415, a target corpus is formed by the initial corpus, the corpus (positive corpus) in the first corpus, the corpus (negative corpus) in the second corpus, the corpus (positive corpus) in the third corpus, and the corpus (negative corpus) in the fourth corpus.

As described above, the initial corpus is "how to turn on the car air conditioner", and the corpora such as "how to turn on the car air conditioner", "how to turn off the car tv", "how to turn on the car air conditioner", and "how to turn off the car tv" can be obtained through the steps S2411 to S2414, and finally, the final target corpus is formed by these corpora through the step S2415.

In summary, the embodiment of the present disclosure can automatically generate the positive corpus and the negative corpus according to the initial corpus, and adjust the word order of the corpus, compared with a manner of manually obtaining the corpus in the related art, the embodiment of the present disclosure has higher working efficiency and richer corpus types. Moreover, the method is strong in reusability for the same service field, and a large number of copies can be copied and applied by the same method. For different business fields, new language materials in the field can be generated only by updating the word vector set, and the method can be conveniently and quickly applied to other business fields and has strong expandability.

A second scheme of the corpus generation method according to the embodiment of the present disclosure is described below, and specifically, in step S210 to step S230, the embodiment of the present disclosure is the same as the foregoing embodiment, and therefore, no further description is provided herein. Referring to step S240, a second solution of the embodiment of the disclosure is specifically described below, fig. 7b schematically illustrates a second specific flowchart of generating a target corpus according to the embodiment of the disclosure, and as shown in fig. 7b, in other specific embodiments, step S241 includes steps S2416 to S2419.

In step S2416, performing language sequence conversion on the initial corpus, wherein the semantics of the initial corpus before the language sequence conversion is the same as the semantics of the initial corpus after the language sequence conversion.

In step S2417, at least one second word vector is used to replace at least one first word vector in the initial corpus before the sequence conversion, and at least one first word vector in the initial corpus after the sequence conversion is replaced, so as to obtain a fifth corpus.

In step S2418, at least one third word vector is used to replace at least one first word vector in the initial corpus before the sequence conversion, and at least one first word vector in the initial corpus after the sequence conversion is replaced, so as to obtain a sixth corpus.

In step S2419, the initial corpus before the language order exchange, the initial corpus after the language order exchange, the corpus in the fifth corpus and the corpus in the sixth corpus constitute a target corpus.

For example, the initial corpus is "how to turn on the in-vehicle air conditioner", in step S2416, the initial corpus is transposed to obtain "how to turn on the in-vehicle air conditioner", in step S2417, after the corpus "how to turn on the in-vehicle air conditioner" and the corpus "how to turn on the in-vehicle air conditioner" are replaced by word vectors, the corpus "how to turn on the in-vehicle air conditioner" and the corpus "how to turn on the in-vehicle air conditioner" (the corpus in the fifth corpus set) can be obtained; in step S2418, after performing word vector replacement on the corpus "how to turn on the in-vehicle air conditioner" and the corpus "how to turn on the in-vehicle air conditioner", the corpus "how to turn off the in-vehicle television" and the corpus "how to turn off the in-vehicle television" (corpus in the sixth corpus). It should be noted that, in the embodiment of the present disclosure, the specific steps for performing the word order replacement and the word vector replacement may be obtained according to the steps for performing the word order replacement and the word vector replacement in the foregoing embodiment, and therefore, the detailed description is omitted here.

Based on the corpus generating method, the disclosure also provides a corpus generating device. The apparatus will be described in detail below with reference to fig. 8.

Fig. 8 schematically shows a block diagram of a corpus generation apparatus according to an embodiment of the present disclosure.

As shown in fig. 8, the apparatus 800 for generating corpus according to this embodiment includes a first obtaining module 810, a processing module 820, a second obtaining module 830, and a corpus generating module 840.

The first obtaining module 810 is configured to obtain an initial corpus. In an embodiment, the first obtaining module 810 may be configured to perform the step S210 described above, which is not described herein again.

The processing module 820 is configured to perform word segmentation on the initial corpus, and determine a word vector of each word from the segmented initial corpus to obtain at least one first word vector. In an embodiment, the processing module 820 may be configured to perform the step S220 described above, which is not described herein again.

The second obtaining module 830 is configured to obtain, from a preset word vector set, a word vector whose similarity to at least one first word vector meets a preset condition. In an embodiment, the second obtaining module 830 may be configured to perform the step S230 described above, and is not described herein again.

The corpus generating module 840 is configured to replace a first word vector in the initial corpus with a word vector obtained from the word vector set to generate a target corpus, where a language order of at least one corpus is different from a language order of the initial corpus in the generated target corpus. In an embodiment, the corpus generating module 840 may be configured to perform the step S240 described above, which is not described herein again.

By adopting the corpus generation device of the embodiment of the disclosure, a large amount of corpora (namely, the target corpus) can be automatically generated through word vector replacement and other modes according to the initial corpus, the working efficiency is high, errors are not easy to occur, and the embodiment of the disclosure can also adjust the language order of the initial corpus when the target corpus is generated, so that the generated corpus in the target corpus is richer in type, and the training of a similarity model is facilitated.

According to an embodiment of the present disclosure, any multiple modules of the first obtaining module 810, the processing module 820, the second obtaining module 830 and the corpus generating module 840 may be combined and implemented in one module, or any one module thereof may be split into multiple modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the first obtaining module 810, the processing module 820, the second obtaining module 830 and the corpus generating module 840 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or implemented by any one of three implementations of software, hardware and firmware, or implemented by a suitable combination of any of them. Alternatively, at least one of the first obtaining module 810, the processing module 820, the second obtaining module 830 and the corpus generating module 840 may be at least partially implemented as a computer program module, which, when executed, may perform a corresponding function.

As shown in fig. 9, an electronic apparatus 900 according to an embodiment of the present disclosure includes a processor 901 which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage portion 908 into a Random Access Memory (RAM) 903. Processor 901 may comprise, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 901 may also include on-board memory for caching purposes. The processor 901 may include a single processing unit or a plurality of processing units for performing different actions of the corpus generation method according to the embodiment of the present disclosure.

In the RAM 903, various programs and data necessary for the operation of the electronic apparatus 900 are stored. The processor 901, the ROM 902, and the RAM 903 are connected to each other through a bus 904. The processor 901 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 902 and/or the RAM 903. Note that the programs may also be stored in one or more memories other than the ROM 902 and the RAM 903. The processor 901 may also perform various steps of the corpus generation method according to the embodiments of the present disclosure by executing programs stored in the one or more memories.

Electronic device 900 may also include input/output (I/O) interface 905, input/output (I/O) interface 905 also connected to bus 904, according to an embodiment of the present disclosure. The electronic device 900 may also include one or more of the following components connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.

The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement a corpus generating method according to an embodiment of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM 902 and/or the RAM 903 described above and/or one or more memories other than the ROM 902 and the RAM 903.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method illustrated in the flow chart. When the computer program product runs in a computer system, the program code is used for causing the computer system to realize the corpus generating method provided by the embodiment of the disclosure.

The computer program performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure when executed by the processor 901. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of a signal on a network medium, and downloaded and installed through the communication section 909 and/or installed from the removable medium 911. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The computer program, when executed by the processor 901, performs the above-described functions defined in the system of the embodiment of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A corpus generation method is characterized by comprising the following steps:

acquiring an initial corpus;

2. The generation method according to claim 1, wherein the preset condition includes a first preset similarity and a second preset similarity, and the step of obtaining, from a preset word vector set, a word vector whose similarity with the at least one first word vector satisfies the preset condition includes:

3. The method according to claim 2, wherein the step of replacing the first word vector in the initial corpus with the at least one second word vector and the at least one third word vector to generate the target corpus comprises:

4. The method according to claim 2, wherein the step of replacing the first word vector in the initial corpus with the at least one second word vector and the at least one third word vector to generate the target corpus comprises:

5. The method according to claim 3 or 4, wherein the step of segmenting the initial corpus and determining a word vector of each word from the segmented initial corpus to obtain at least one first word vector comprises:

determining a part of speech of each of the at least one first word vector;

6. The generating method according to claim 1, wherein said corpus generating method further comprises a step of establishing said set of word vectors, said step of establishing said set of word vectors comprising:

acquiring text data;

7. An apparatus for generating corpus, comprising:

the first obtaining module is used for obtaining the initial corpus;

8. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of generating a corpus according to any of claims 1 to 6.

9. A computer-readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform a method of generating a corpus according to any one of claims 1 to 6.

10. A computer program product comprising a computer program which, when executed by a processor, implements a method of generating a corpus according to any one of claims 1 to 6.