CN109241286B

CN109241286B - Method and device for generating text

Info

Publication number: CN109241286B
Application number: CN201811109660.8A
Authority: CN
Inventors: 余路; 史南胜; 李廷
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd; Shanghai Xiaodu Technology Co Ltd
Priority date: 2018-09-21
Filing date: 2018-09-21
Publication date: 2020-03-17
Anticipated expiration: 2038-09-21
Also published as: CN109241286A

Abstract

The embodiment of the application discloses a method and a device for generating a text. One embodiment of the above method comprises: acquiring a target text; performing word segmentation processing on the target text and determining the part of speech of at least one obtained word; determining a target part-of-speech sequence formed by parts-of-speech of at least one word according to the position of the at least one word in the target text; and generating a generalized text of the target text according to a preset binary sequence set and the target part-of-speech sequence, wherein the binary sequence set comprises at least one binary sequence set, the binary sequence set comprises at least one binary sequence, and the binary sequence comprises at least one part-of-speech and at least one word. The implementation method can expand the target text to obtain the generalized text with the same expression significance as the target text, so that the accuracy of recognizing the user language by the artificial intelligent device can be improved.

Description

Method and device for generating text

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method and a device for generating texts.

Background

With the development of artificial intelligence, artificial intelligence devices have more and more functions. Users are also demanding more and more on artificial intelligence devices, especially in terms of human-computer interaction, where they want artificial intelligence devices to be able to react correctly to various forms of speech interaction. For example, in a scenario where a hotel applies an artificial intelligence device, a user wants to know about information about a hotel room through the artificial intelligence device. Some users may say "how many rooms there are in the hotel," and some users may say "there are several rooms in the hotel. Technicians desire that artificial intelligence devices recognize various forms of voice interaction to satisfy users of different habits.

Disclosure of Invention

The embodiment of the application provides a method and a device for generating a text.

In a first aspect, an embodiment of the present application provides a method for generating a text, including: acquiring a target text; performing word segmentation processing on the target text and determining the part of speech of at least one obtained word; determining a target part-of-speech sequence formed by the parts-of-speech of the at least one word according to the position of the at least one word in the target text; and generating a generalized text of the target text according to a preset binary sequence set and the target part-of-speech sequence, wherein the binary sequence set comprises at least one binary sequence set, the binary sequence set comprises at least one binary sequence, and the binary sequence comprises at least one part-of-speech and at least one word.

In some embodiments, the order of parts of speech of the binary sequences in the same binary sequence group is the same; and generating a generalized text of the target text according to a preset binary sequence group set and the target part-of-speech sequence, including: determining the binary sequence group in the binary sequence group set, wherein the sequence of the part of speech in the binary sequence is the same as the sequence of the part of speech in the target part of speech sequence, and the binary sequence group is a first binary sequence group; and generating the generalized text of the target text according to the words in at least one binary sequence included in the first binary sequence group.

In some embodiments, at least two binary sequence groups having a correlation exist in the binary sequence group set; and generating a generalized text of the target text according to a preset binary sequence group set and the target part-of-speech sequence, including: determining the binary sequence group which is in the correlation relationship with the first binary sequence group in the binary sequence group set as a second binary sequence group; and generating the generalized text of the target text according to the words in at least one binary sequence included in the second binary sequence group.

In some embodiments, the generating a generalized text of the target text according to a preset set of bigram sequence groups and the target part-of-speech sequence includes: determining a second binary sequence group which is different from the first binary sequence group and is in a correlation relationship with the first binary sequence group in the second binary sequence group set as a second binary sequence group; and generating the generalized text of the target text according to the words in at least one binary sequence included in the third binary sequence group.

In some embodiments, the above method further comprises: determining co-occurrence of words in the generated generalized text; and deleting the generalized texts with the co-occurrence degree smaller than a first preset threshold value.

In some embodiments, the above-mentioned set of binary sequence groups is obtained by the following steps: acquiring a text pair set, wherein a text pair in the text pair set comprises a first language text and a second language text obtained by translating the first language text; determining a text pair subset having the same first language text in the text pair set, wherein the second language text in the text pair subset is different from each other; generating a binary sequence corresponding to the second language text according to different words in the two second language texts and the part of speech of the words in the two second language texts for any two second language texts in the text pair subset; and clustering the generated binary sequence to obtain the binary sequence group set.

In some embodiments, the clustering the generated binary sequence to obtain the binary sequence group set includes: determining a vector of terms included in the generated sequence of tuples; determining at least two binary sequences with the same word sequence in the generated binary sequences; determining the similarity of any two binary sequences in the at least two binary sequences according to the vectors of the words included in the at least two binary sequences; and determining at least two binary sequences belonging to the same binary sequence group in the at least two binary sequences according to the determined similarity.

In some embodiments, the above method further comprises: for the binary sequences in the binary sequence group, determining a second language text corresponding to the binary sequences as an index text; determining the index text and other second language texts in the text subset to which the index text belongs as related texts; determining a binary sequence group to which a binary sequence corresponding to the related text belongs as a related binary sequence group; and determining that the binary sequence group to which the index text belongs and the related binary sequence group are in a correlation relationship.

In a second aspect, an embodiment of the present application provides an apparatus for generating text, including: a target text acquisition unit configured to acquire a target text; the word segmentation processing unit is configured to perform word segmentation processing on the target text and determine the part of speech of at least one obtained word; a part-of-speech sequence determination unit configured to determine a target part-of-speech sequence formed by parts of speech of the at least one word according to a position of the at least one word in the target text; and the generalized text generating unit is configured to generate a generalized text of the target text according to a preset binary sequence set and the target part-of-speech sequence, wherein the binary sequence set comprises at least one binary sequence set, the binary sequence set comprises at least one binary sequence, and the binary sequence comprises at least one part-of-speech and at least one word.

In some embodiments, the order of parts of speech of the binary sequences in the same binary sequence group is the same; and the generalized text generating unit is further configured to: determining the binary sequence group in the binary sequence group set, wherein the sequence of the part of speech in the binary sequence is the same as the sequence of the part of speech in the target part of speech sequence, and the binary sequence group is a first binary sequence group; and generating the generalized text of the target text according to the words in at least one binary sequence included in the first binary sequence group.

In some embodiments, at least two binary sequence groups having a correlation exist in the binary sequence group set; and the generalized text generating unit is further configured to: determining the binary sequence group which is in the correlation relationship with the first binary sequence group in the binary sequence group set as a second binary sequence group; and generating the generalized text of the target text according to the words in at least one binary sequence included in the second binary sequence group.

In some embodiments, the generalized text generating unit is further configured to: determining a second binary sequence group which is different from the first binary sequence group and is in a correlation relationship with the first binary sequence group in the second binary sequence group set as a second binary sequence group; and generating the generalized text of the target text according to the words in at least one binary sequence included in the third binary sequence group.

In some embodiments, the apparatus further comprises a deletion unit configured to: determining co-occurrence of words in the generated generalized text; and deleting the generalized texts with the co-occurrence degree smaller than a first preset threshold value.

In some embodiments, the apparatus further includes a set determining unit, and the set determining unit includes: the system comprises a text pair set acquisition module, a text pair acquisition module and a text pair acquisition module, wherein the text pair set acquisition module is configured to acquire a text pair set, and a text pair in the text pair set comprises a first language text and a second language text obtained by translating the first language text; a text pair subset determining module configured to determine a text pair subset having the same text in a first language in the text pair set, wherein the texts in a second language in the text pair subset are different from each other; the binary sequence generating module is configured to generate a binary sequence corresponding to any two second language texts in the text pair subset according to different words in the two second language texts and the parts of speech of the words in the two second language texts; and the binary sequence clustering module is configured to cluster the generated binary sequences to obtain the binary sequence group set.

In some embodiments, the binary sequence clustering module is further configured to: determining a vector of terms included in the generated sequence of tuples; determining at least two binary sequences with the same word sequence in the generated binary sequences; determining the similarity of any two binary sequences in the at least two binary sequences according to the vectors of the words included in the at least two binary sequences; and determining at least two binary sequences belonging to the same binary sequence group in the at least two binary sequences according to the determined similarity.

In some embodiments, the set determining unit further includes a correlation determining module configured to: for the binary sequences in the binary sequence group, determining a second language text corresponding to the binary sequences as an index text; determining the index text and other second language texts in the text subset to which the index text belongs as related texts; determining a binary sequence group to which a binary sequence corresponding to the related text belongs as a related binary sequence group; and determining that the binary sequence group to which the index text belongs and the related binary sequence group are in a correlation relationship.

In a third aspect, an embodiment of the present application provides a server, including: one or more processors; a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any of the embodiments of the first aspect.

In a fourth aspect, the present application provides a computer-readable medium, on which a computer program is stored, which when executed by a processor implements the method as described in any one of the embodiments of the first aspect.

The method and the device for generating the text provided by the above embodiments of the present application may first obtain the target text. And then, performing word segmentation on the target text to obtain at least two words, and determining the parts of speech of the obtained at least two words. And then determining a target part-of-speech sequence formed by the parts-of-speech of the at least two words according to the positions of the at least two words in the target text. And finally, generating a generalized text of the target text according to a preset binary sequence group set and the target part-of-speech sequence. The method and the device can expand the target text to obtain the generalized text with the same expression meaning as the target text, so that the accuracy of recognizing the user language by the artificial intelligent equipment can be improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for generating text in accordance with the present application;

FIG. 3 is a schematic illustration of an application scenario of a method for generating text according to the present application;

FIG. 4 is a flow diagram of yet another embodiment of a method for generating text in accordance with the present application;

FIG. 5 is a schematic diagram illustrating one embodiment of an apparatus for generating text according to the present application;

FIG. 6 is a schematic block diagram of a computer system suitable for use in implementing a server according to embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the method for generating text or the apparatus for generating text of the present application may be applied.

As shown in FIG. 1, the system architecture 100 may include

artificial intelligence devices

101, 102, 103, a network 104, and a server 105. Network 104 is the medium used to provide communication links between

artificial intelligence devices

101, 102, 103 and server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may interact with the

artificial intelligence device

101, 102, 103, for example, by talking to the

artificial intelligence device

101, 102, 103. The

artificial intelligence devices

101, 102, 103 may interact with the server 105 over the network 104 to receive or send messages or the like. Various communication client applications, such as a language recognition application, a speech synthesis application, an audio playing type application, etc., may be installed on the

artificial intelligence devices

101, 102, 103.

The

artificial intelligence devices

101, 102, 103 may be hardware or software. When the

artificial intelligence devices

101, 102, 103 are hardware, they can be various electronic devices that support artificial intelligence, including but not limited to artificial intelligence speakers, artificial intelligence televisions, artificial intelligence robots, and the like. When the

artificial intelligence devices

101, 102, 103 are software, they can be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing support for the

artificial intelligence devices

101, 102, 103. The background server may analyze and perform other processing on the acquired data such as the target text, and feed back a processing result (e.g., a generalized text) to the

artificial intelligence devices

101, 102, and 103.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for generating text provided by the embodiment of the present application is generally performed by the server 105, and accordingly, the apparatus for generating text is generally disposed in the server 105. It is to be understood that the system architecture 100 described above may also be configured without

artificial intelligence devices

101, 102, 103 and network 104.

It should be understood that the number of artificial intelligence devices, networks, and servers in FIG. 1 is merely illustrative. There may be any number of artificial intelligence devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for generating text in accordance with the present application is shown. The method for generating the text of the embodiment comprises the following steps:

step 201, obtaining a target text.

In the present embodiment, an execution subject (e.g., the server 105 shown in fig. 1) of the method for generating a text may acquire a target text in various ways. For example, the executing agent may obtain the target text from an artificial intelligence device (e.g.,

artificial intelligence devices

101, 102, 103 shown in FIG. 1). The artificial intelligence device may translate the words spoken by the user into text and send the generated text as target text to the execution subject. The execution main body can also acquire a locally stored text and take the acquired text as a target text. The target text may be a text including at least one word. The target text may be text in various languages.

Step 202, performing word segmentation processing on the target text and determining the part of speech of at least one obtained word.

After the execution main body obtains the target text, word segmentation processing can be carried out on the target text. Specifically, the executing body may adopt various existing word segmentation algorithms (such as a natural language processing algorithm) to segment the target text. And obtaining at least one word after segmenting the target text. The execution subject may determine a part-of-speech of the resulting at least one word. Parts of speech are used herein to describe characteristics of a word. Parts of speech may include nouns, verbs, adjectives, and so forth. For example, the target text is "how many employees the company has", and the words "company", "have", "how many" and "employees" are obtained after the word segmentation processing is performed on the target text. Wherein, the part of speech of the company is noun, the part of speech of the company is verb, the part of speech of the number is pronoun, and the part of speech of the employee is noun.

Step 203, determining a target part-of-speech sequence formed by the parts-of-speech of the at least one word according to the position of the at least one word in the target text.

After determining the part of speech of the word obtained after the word segmentation, the execution main body may determine a target part of speech sequence formed by the part of speech of the at least one word according to the position of the at least one word in the target text. For example, after the part-of-speech of each word in the target text "how many employees the company has" is determined, the target part-of-speech sequence "noun verb pronoun" can be obtained. It is understood that the parts of speech may be represented by english letters in a specific application. For example, n. denotes a noun, v. denotes a verb, and pronouns. The target part-of-speech sequence may also be denoted as "n.v.pron.n.

And 204, generating a generalized text of the target text according to the preset binary sequence group set and the target part-of-speech sequence.

In this embodiment, the execution body may store the binary sequence group set in advance. The binary sequence group set comprises at least one binary sequence group, the binary sequence group comprises at least one binary sequence, and the binary sequence comprises at least one part of speech and at least one word. For example, the sequence of doublets may be "noun verb pronouns/how many nouns", where "how many" is the word, "pronouns" is the part of speech of the word "how many", and "nouns", "verbs" are the parts of speech.

The execution body can generate a generalized text of the target text according to the binary sequence group set and the target part-of-speech sequence. In this embodiment, the generalized text refers to a text that has the same meaning as the target text but has a different expression. Specifically, the executing agent may determine, from the set of binary sequence groups, a binary sequence group having a part-of-speech sequence in the same order as the target part-of-speech sequence. And then replacing the words at the same position in the target text by the words included in the binary sequence group to generate a generalized text of the target text. For example, the executing entity determines that the part-of-speech order of a group of binary sequences in the set of group of binary sequences is the same as the part-of-speech order in the target part-of-speech sequence, and is "noun verb pronoun". Where one sequence of doublets is "noun verb pronouns/nouns", the executive can replace "how many" in the target text with "several", resulting in the generalized text "company has several employees". Alternatively, the execution body may also replace words in the same position in the target text with words in the plurality of binary sequences to generate the generalized text. For example, if another binary sequence in the binary sequence group is "noun verb/include pronoun noun", the executing entity may generate the generalized text "how many employees the company includes" and may also generate the generalized text "how many employees the company includes".

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for generating text according to the present embodiment. In the application scenario of fig. 3, the server obtains the target text "how much text the company has", and then, after segmenting the target text, obtains the words "company", "has", "how much", "employee". Meanwhile, the parts of speech of each word are determined to be nouns, verbs, pronouns and nouns respectively. And obtaining a target part-of-speech sequence 'noun verb pronoun noun'. And combining the binary sequence group set to obtain a generalized text 'the company has several employees'. The generalized text generated by the application scene can be sent to the intelligent robot so that the intelligent robot can better identify the language of the user, and can also be sent to a technician for modeling a language model so that the intelligent robot can better identify the language of the user.

The method for generating a text provided by the above embodiment of the present application may first obtain a target text. And then, performing word segmentation on the target text to obtain at least one word, and determining the part of speech of the obtained at least one word. And then determining a target part-of-speech sequence formed by the parts-of-speech of the at least one word according to the position of the at least one word in the target text. And finally, generating a generalized text of the target text according to a preset binary sequence group set and the target part-of-speech sequence. The method of the embodiment can expand the target text to obtain the generalized text with the same expression meaning as the target text, so that the accuracy of recognizing the user language by the artificial intelligence equipment can be improved.

In some optional implementations of this embodiment, the sequence of parts of speech of the binary sequences in the same binary sequence group is the same. The step 204 may be specifically implemented by the following steps not shown in fig. 2: determining a binary sequence group in the binary sequence group set, wherein the sequence of the part of speech in the binary sequence is the same as the sequence of the part of speech in the target part of speech sequence, and the binary sequence group is a first binary sequence group; and generating a generalized text of the target text according to the words in at least one binary sequence included in the first binary sequence group.

In this implementation, the execution subject may first determine a binary sequence group in the set of binary sequence groups, where the binary sequence group is the same as the part-of-speech sequence of the target part-of-speech sequence, and mark the binary sequence group as the first binary sequence group. And then replacing the words at the same position in the target text by the words included in each binary sequence in the first binary sequence group to generate a generalized text of the target text. It will be appreciated that the generated generalized text includes at least one word that is different from the words in the target text.

In some optional implementations of this embodiment, at least two binary-series groups having a correlation exist in the binary-series group set. In this embodiment, if the meanings of the text expressions corresponding to the binary sequences in the two binary sequence groups are the same, but the parts of speech sequences and/or the number of parts of speech are different, the relationship between the two binary sequence groups can be determined as a correlation relationship. For example, the binary sequence "noun verb pronouns/how many nouns" corresponds to a text of "how many employees the company has", and the binary sequence "noun verb name pronouns/several" corresponds to a text of "several employees the company has". The two text expressions have the same meaning, but the number of words, the sequence of parts of speech and the number of parts of speech included in the two text expressions are different. Then there is a correlation between the group of binary sequences to which the binary sequence "noun verb pronouns/how many nouns" belongs and the group of binary sequences to which the binary sequence "noun verb pronouns/few" belongs. It is understood that the two-tuple sequence set can be pre-labeled with two-tuple sequence sets related to the two-tuple sequence sets.

The step 204 may be implemented by the following steps not shown in fig. 2: determining a binary sequence group which is in a correlation relation with the first binary sequence group in the binary sequence group set as a second binary sequence group; and generating a generalized text of the target text according to the words in at least one binary sequence included in the second binary sequence group.

The executing agent may determine that a binary sequence group related to the first binary sequence group is determined in the binary sequence group set, and mark the binary sequence group as a second binary sequence group. And then, replacing the words with the same part of speech in the target text by using the words in at least one binary sequence included in the second binary sequence group to obtain the generalized text. For example, the second bigram sequence group includes a bigram sequence "noun help word name verb pronoun/several", and the executive body can replace the pronoun "several" with the pronoun "several" in the target text to obtain the generalized text.

In some optional implementations of this embodiment, the step 204 may also be implemented by the following steps not shown in fig. 2: determining a second binary sequence group in the second binary sequence group set as a second binary sequence group; and generating a generalized text of the target text according to the words in at least one binary sequence included in the third binary sequence group.

In this implementation, the executing entity may further determine that a relation between the second binary-sequence group and the second binary-sequence group in the set of binary-sequence groups is a correlation relation, and a binary-sequence group different from the first binary-sequence group is a third binary-sequence group. And then replacing the words with the same part of speech in the target text by using the words in at least one binary sequence included in the third binary sequence group to obtain the generalized text.

In some optional implementations of the embodiment, in order to improve the accuracy of the generated generalized text, the executing main body may further perform the following steps: determining co-occurrence of words in the generated generalized text; and deleting the generalized texts with the co-occurrence degree smaller than a first preset threshold value.

In this implementation, co-occurrence may refer to words in the generalized text appearing in the same sentence, the same paragraph, or the same article. The degree of co-occurrence may be a product of the following parameters: the probability of the occurrence of the first word in the generalized text, the probability of the occurrence of the second word based on the occurrence of the first word, the probability of the occurrence of the third word based on the occurrence of the first word and the second word … … the probability of the occurrence of the last word based on the occurrence of all preceding words.

For example, if the generalized text is "zhangsan prospecting for newborn", the performing subject may first determine the probability of occurrence of zhangsan in a preset information set. The information collection may be a collection of topics of web pages, a collection of articles, etc. Assuming that the information set includes 10000 pieces of information, including 100 pieces of information of "zhang san", the probability of occurrence of "zhang san" is 1%. The executing agent may then determine the probability of "spying" appearing in the information set described above, including the information of "zhang san". Assuming that "spy" is included in 20 pieces of information out of the above 100 pieces of information including "zhangsan", the probability of occurrence of "spy" is 20% based on the occurrence of "zhangsan". Then, the executive body can determine the probability of the "newborn" appearing after the "visit" on the basis of the "Zhang III" and the "visit" according to the same method as 50%. The degree of co-occurrence is 1% × 20% × 50% ═ 0.1%.

It is understood that the degree of co-occurrence herein can be used to represent the probability that the generalized text is a normal sentence, i.e., the probability that the generalized text is not a sentence.

After determining the degree of co-occurrence, the execution subject may delete the generalized text whose degree of co-occurrence is less than a first preset threshold. In this implementation, if the degree of co-occurrence is smaller than the first preset threshold, it indicates that the generalized text may be an abnormal sentence or a pathological sentence. The execution subject may delete these generalized texts to improve the accuracy of the generalized texts.

With continued reference to FIG. 4, a flow 400 of one embodiment of determining a set of binary-series groups in a method for generating text according to the present application is shown. As shown in fig. 4, in this embodiment, the binary sequence group set may be determined by the following steps:

step 401, a text pair set is obtained.

In this embodiment, the text pairs in the text pair set include a first language text and a second language text obtained by translating the first language text. For example, the first language text included in the text pair is "How many employees there are in your company" and the second language text is "How many employees there are in your company". The first language text included in the text pair may be "How many employees your company has".

In a specific implementation manner, the execution subject may determine the text pair set from a movie subtitle acquired from a preset website. For example, the execution subject may acquire translation subtitles for multiple versions of the same movie from a preset website. And taking the Chinese translation corresponding to each English sentence as a text pair.

Step 402, determining a text pair set having the same text in the first language in the text pair set.

The execution subject may further determine a set of text pairs from the set of text pairs that have the same text in the first language. Wherein the second language texts in the text pair subsets are different from each other. For example, a text pair "How many employees there are in your company" may belong to the same text pair set as a text pair "How many employees there are in your company" and a text pair "How many employees there are in your company" may belong to the same text pair set.

In a specific implementation manner, the execution main body may obtain a text pair subset by using different chinese translation sets corresponding to the same sentence of english at the same time for multiple versions of subtitles. Or, the execution main body may translate different chinese languages corresponding to the same sentence in the same version of the subtitle into a set to obtain a text pair subset.

Step 403, for any two second language texts in the text pair subset, generating a binary sequence corresponding to the second language texts according to the words in the two second language texts which are different and the parts of speech of the words in the two second language texts.

For any two second language texts in each text pair subset, the executing subject may first compare the two second language texts to determine words that are not the same in the two. Then, parts of speech of words included in the above two second language texts are determined. And generating a binary sequence corresponding to each second language text according to the determined different words and the part of speech of each word. For example, the two second language texts are "how many employees the company has" and "several employees the company has", respectively. By comparison, the terms "how much" and "several" are not the same. Meanwhile, the part of speech of each word in the second language text 'how many employees the company has' is determined to be noun, verb, pronoun and noun, and the part of speech of each word in the second language text 'several employees the company has' is determined to be noun, verb, pronoun and noun. The executing agent may generate a sequence of two-tuples "verb pronouns/how many nouns" and a sequence of two-tuples "verb pronouns/few nouns". The two-tuple sequence "noun verb pronouns/how many nouns" corresponds to how many employees the second language text "company has", and the two-tuple sequence "noun verb pronouns/several nouns" corresponds to the second language text "company has several employees".

And step 404, clustering the generated binary sequence to obtain a binary sequence group set.

After obtaining each binary sequence, the execution agent may perform clustering on each binary sequence to obtain a binary sequence group set. Specifically, the execution subject may divide the binary sequences having the same part-of-speech order but including different words into the same binary sequence group. Alternatively, the executing agent may divide the binary sequences having the same part of speech or the same word into the same binary sequence group.

In some optional implementations of this embodiment, the executing agent may implement clustering of the sequence of tuples by the following steps not shown in fig. 4:

first, a vector of words included in the generated sequence of tuples is determined.

The executing agent may utilize various algorithms to determine the vectors of terms included in each sequence of tuples. For example, the execution principal may utilize the word2vec algorithm to determine a vector of words. word2vec was created by a lead research team for tomas mikolov (tomas mikolov) to map words to vectors.

Secondly, at least two binary sequences with the same word sequence in the generated binary sequences are determined.

Thirdly, according to the vectors of the words included in the at least two binary sequences, the similarity of any two binary sequences in the at least two binary sequences is determined.

In the determined at least two binary sequences, the execution subject may determine a similarity of any two binary sequences of the at least two binary sequences according to a vector of a word included in each binary sequence. The above-described similarity can be represented by a cosine distance, a manhattan distance, or the like.

And finally, determining at least two binary sequences belonging to the same binary sequence group in the at least two binary sequences according to the determined similarity.

After the similarity of every two binary sequences in the at least two binary sequences is determined, the binary sequences with the similarity larger than a second preset threshold value can be divided into the same binary sequence group. It can be understood that the similarity between any two binary sequences in the same binary sequence group is greater than a second preset threshold.

In some optional implementations of this embodiment, the executing entity may further determine the correlation relationship between the two-tuple sequence sets through the following steps not shown in fig. 4:

first, for a binary sequence in the binary sequence group, a second language text corresponding to the binary sequence is determined as an index text.

After determining the set of bigram sequences, for each bigram sequence in the set of bigram sequences, the executing entity may determine the second language text corresponding to the bigram sequence as the index text.

Secondly, the index text and the text which belongs to the index text are determined to be related texts in other second language texts in the subset.

The execution principal may determine a text pair subset to which the index text belongs. Then, the other second language text in the text pair subset is determined to be related text.

And thirdly, determining the binary sequence group to which the binary sequence corresponding to the related text belongs as a related binary sequence group.

After determining the relevant texts, for each relevant text, the execution subject may determine a sequence of tuples to which the relevant text corresponds. And then determining the binary sequence group to which each binary sequence belongs. Then, each binary sequence set is determined as a related binary sequence set.

And finally, determining that the binary sequence group to which the index text belongs and the related binary sequence group are in a related relationship.

The execution subject may determine that the two-tuple sequence group to which the index text belongs and the related two-tuple sequence group are in a correlation relationship.

The method for generating the text can generate a plurality of binary sequence groups, and the text expressions corresponding to the binary sequences in each binary sequence group have the same meaning, so that the method can be conveniently used for generalizing the target text.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for generating a text, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices.

As shown in fig. 5, the apparatus 500 for generating text of the present embodiment includes: a target text acquisition unit 501, a participle processing unit 502, a part-of-speech sequence determination unit 503, and a generalized text generation unit 504.

The target text acquiring unit 501 is configured to acquire a target text.

A word segmentation processing unit 502 configured to perform word segmentation processing on the target text and determine a part of speech of the obtained at least one word.

A part-of-speech sequence determination unit 503 configured to determine a target part-of-speech sequence formed by parts-of-speech of the at least one word according to a position of the at least one word in the target text.

The generalized text generating unit 504 is configured to generate a generalized text of the target text according to a preset set of binary sequence groups and the target part-of-speech sequence, where the set of binary sequence groups includes at least one binary sequence group, the binary sequence group includes at least one binary sequence, and the binary sequence includes at least one part-of-speech and at least one word.

In some optional implementations of this embodiment, the sequence of parts of speech of the binary sequences in the same binary sequence group is the same. The generalized text generating unit 504 may be further configured to: determining a binary sequence group in the binary sequence group set, wherein the sequence of the part of speech in the binary sequence is the same as the sequence of the part of speech in the target part of speech sequence, and the binary sequence group is a first binary sequence group; and generating a generalized text of the target text according to the words in at least one binary sequence included in the first binary sequence group.

In some optional implementations of this embodiment, at least two binary-sequence groups having a correlation exist in the set of binary-sequence groups. The generalized text generating unit 504 may be further configured to: determining a binary sequence group which is in a correlation relation with the first binary sequence group in the binary sequence group set as a second binary sequence group; and generating a generalized text of the target text according to the words in at least one binary sequence included in the second binary sequence group.

In some optional implementations of this embodiment, the generalized text generating unit 504 may be further configured to: determining a second binary sequence group in the second binary sequence group set as a second binary sequence group; and generating a generalized text of the target text according to the words in at least one binary sequence included in the third binary sequence group.

In some optional implementations of this embodiment, the apparatus 500 may further include a deleting unit not shown in fig. 5, where the deleting unit is configured to: determining co-occurrence of words in the generated generalized text; and deleting the generalized texts with the co-occurrence degree smaller than a first preset threshold value.

In some optional implementations of this embodiment, the apparatus 500 may further include a set determining unit not shown in fig. 5. The set determining unit may further include a text pair set obtaining module, a text pair set determining module, a binary sequence generating module, and a binary sequence clustering module.

A text pair set acquisition module configured to acquire a text pair set. And the text pairs in the text pair set comprise first language texts and second language texts obtained by translating the first language texts.

A text pair subset determination module configured to determine a set of text pairs in the set of text pairs having the same text in the first language. Wherein the second language texts in the text pair subsets are different from each other.

And the binary sequence generating module is configured to generate a binary sequence corresponding to any two second language texts in the text pair subset according to different words in the two second language texts and the parts of speech of the words in the two second language texts.

And the binary sequence clustering module is configured to cluster the generated binary sequences to obtain a binary sequence group set.

In some optional implementations of this embodiment, the binary sequence clustering module is further configured to: determining a vector of terms included in the generated sequence of tuples; determining at least two binary sequences with the same word sequence in the generated binary sequences; determining the similarity of any two binary sequences in the at least two binary sequences according to vectors of words included in the at least two binary sequences; and determining at least two binary sequences belonging to the same binary sequence group in the at least two binary sequences according to the determined similarity.

In some optional implementations of this embodiment, the set determining unit further includes a correlation determining module configured to: for the binary sequences in the binary sequence group, determining a second language text corresponding to the binary sequences as an index text; determining the index text and other second language texts in the text pair sets to which the index text belongs as related texts; determining a binary sequence group to which a binary sequence corresponding to the related text belongs as a related binary sequence group; and determining that the binary sequence group to which the index text belongs and the related binary sequence group are in a correlation relationship.

The apparatus for generating a text provided in the foregoing embodiments of the present application may first acquire a target text. And then, performing word segmentation on the target text to obtain at least two words, and determining the parts of speech of the obtained at least two words. And then determining a target part-of-speech sequence formed by the parts-of-speech of the at least two words according to the positions of the at least two words in the target text. And finally, generating a generalized text of the target text according to a preset binary sequence group set and the target part-of-speech sequence. The device of the embodiment can expand the target text to obtain different texts with the same expression significance as the target text, thereby improving the accuracy of recognizing the user language by the artificial intelligence product.

It should be understood that units 501 to 504, which are recited in the apparatus 500 for generating text, correspond to respective steps in the method described with reference to fig. 2, respectively. Thus, the operations and features described above for the method for generating text are equally applicable to the apparatus 500 and the units contained therein and will not be described in detail here.

Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use in implementing a server according to embodiments of the present application. The server shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a machine-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 601.

It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a target text acquisition unit, a word segmentation processing unit, a part-of-speech sequence determination unit, and a generalized text generation unit. The names of these units do not in some cases constitute a limitation to the unit itself, and for example, the target text acquisition unit may also be described as "a unit that acquires a target text".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring a target text; performing word segmentation processing on the target text and determining the part of speech of at least one obtained word; determining a target part-of-speech sequence formed by parts-of-speech of at least one word according to the position of the at least one word in the target text; and generating a generalized text of the target text according to a preset binary sequence set and the target part-of-speech sequence, wherein the binary sequence set comprises at least one binary sequence set, the binary sequence set comprises at least one binary sequence, and the binary sequence comprises at least one part-of-speech and at least one word.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for generating text, comprising:

acquiring a target text;

performing word segmentation processing on the target text and determining the part of speech of at least one obtained word;

determining a target part-of-speech sequence formed by parts-of-speech of the at least one word according to the position of the at least one word in the target text;

generating a generalized text of the target text according to the sequence of parts of speech in the sequence of two-tuple in a preset set of sequence of two-tuple groups, the included words and the sequence of parts of speech in the sequence of target parts of speech, wherein the set of sequence of two-tuple groups comprises at least one sequence of two-tuple, the sequence of two-tuple comprises at least one part of speech and at least one word, and the sequences of parts of speech of two-tuple sequences in the same sequence of two-tuple groups are the same and the included words are different;

generating a generalized text of the target text according to the sequence of parts of speech in the binary sequences in the preset binary sequence group set, the included words and the sequence of parts of speech in the target part of speech sequence, wherein the generating comprises:

determining a binary sequence group in the binary sequence group set, wherein the sequence of the part of speech in the binary sequence is the same as the sequence of the part of speech in the target part of speech sequence, and the binary sequence group is a first binary sequence group;

replacing the words at the same position in the target text with the words in at least one binary sequence included in the first binary sequence group to generate a generalized text of the target text.

2. The method according to claim 1, wherein at least two binary-sequence groups having a correlation relationship exist in the set of binary-sequence groups; and

generating a generalized text of the target text according to a preset binary sequence set and the target part-of-speech sequence, wherein the generating comprises:

determining a binary sequence group which is in a correlation relationship with the first binary sequence group in the binary sequence group set as a second binary sequence group;

and generating a generalized text of the target text according to the words in at least one binary sequence included in the second binary sequence group.

3. The method of claim 2, wherein the generating the generalized text of the target text according to the preset set of bigram sequence groups and the target part-of-speech sequence comprises:

determining that a relation between the two-tuple sequence set and the second two-tuple sequence set in the two-tuple sequence set is a correlation relation and a two-tuple sequence set different from the first two-tuple sequence set is a third two-tuple sequence set;

and generating a generalized text of the target text according to the words in at least one binary sequence included in the third binary sequence group.

4. The method according to any one of claims 1-3, wherein the method further comprises:

determining co-occurrence of words in the generated generalized text;

and deleting the generalized texts with the co-occurrence degree smaller than a first preset threshold value.

5. A method according to any of claims 1-3, wherein the set of binary-series groups is obtained by:

acquiring a text pair set, wherein a text pair in the text pair set comprises a first language text and a second language text obtained by translating the first language text;

determining a text pair subset having the same first language text in the text pair set, wherein the second language text in the text pair subset is different from each other;

generating a binary sequence corresponding to the second language text according to different words in the two second language texts and the part of speech of the words in the two second language texts for any two second language texts in the text pair subset;

and clustering the generated binary sequence to obtain the binary sequence group set.

6. The method of claim 5, wherein the clustering the generated binary sequences to obtain the set of binary sequence groups comprises:

determining a vector of terms included in the generated sequence of tuples;

determining at least two binary sequences with the same word sequence in the generated binary sequences;

determining the similarity of any two binary sequences in the at least two binary sequences according to vectors of words included in the at least two binary sequences;

and determining at least two binary sequences belonging to the same binary sequence group in the at least two binary sequences according to the determined similarity.

7. The method of claim 6, wherein the method further comprises:

for the binary sequences in the binary sequence group, determining a second language text corresponding to the binary sequences as an index text;

determining the index text and other second language texts in the text pair sets to which the index text belongs as related texts;

determining a binary sequence group to which a binary sequence corresponding to the related text belongs as a related binary sequence group;

and determining that the binary sequence group to which the index text belongs and the related binary sequence group are in a correlation relationship.

8. An apparatus for generating text, comprising:

a target text acquisition unit configured to acquire a target text;

the word segmentation processing unit is configured to perform word segmentation processing on the target text and determine the part of speech of at least one obtained word;

a part-of-speech sequence determination unit configured to determine a target part-of-speech sequence formed by parts of speech of the at least one word according to a position of the at least one word in the target text;

the generalized text generating unit is configured to generate a generalized text of the target text according to the sequence of parts of speech in the sequence of binary elements in a preset set of sequences of binary elements, included words and the sequence of parts of speech in the sequence of target parts of speech, wherein the set of sequences of binary elements includes at least one sequence of binary elements, the sequence of binary elements includes at least one part of speech and at least one word, and the sequences of parts of speech of the sequence of binary elements in the same sequence of binary elements are the same and the included words are different;

the generalized text generation unit is further configured to:

9. The apparatus of claim 8, wherein there are at least two tuple-sequence sets in the set of tuple-sequence sets that have a correlation relationship; and

the generalized text generating unit is further configured to:

10. The apparatus of claim 9, wherein the generalized text generating unit is further configured to:

11. The apparatus according to any one of claims 8-10, wherein the apparatus further comprises a deletion unit configured to:

determining co-occurrence of words in the generated generalized text;

12. The apparatus according to any one of claims 8-10, wherein the apparatus further comprises a set determination unit comprising:

the system comprises a text pair set acquisition module, a text pair acquisition module and a text pair acquisition module, wherein the text pair set acquisition module is configured to acquire a text pair set, and text pairs in the text pair set comprise a first language text and a second language text obtained by translating the first language text;

a text pair subset determination module configured to determine a set of text pairs in the set of text pairs having text in a same first language, wherein text in a second language in the set of text pairs is different from text in a different first language;

the binary sequence generating module is configured to generate a binary sequence corresponding to any two second language texts in the text pair subset according to different words in the two second language texts and the parts of speech of the words in the two second language texts;

and the binary sequence clustering module is configured to cluster the generated binary sequences to obtain the binary sequence group set.

13. The apparatus of claim 12, wherein the binary sequence clustering module is further configured to:

determining a vector of terms included in the generated sequence of tuples;

14. The apparatus of claim 12, wherein the set determination unit further comprises a correlation determination module configured to:

15. A server, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

16. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.