CN113177399B - Text processing method, device, electronic equipment and storage medium - Google Patents

Text processing method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113177399B
CN113177399B CN202110451337.4A CN202110451337A CN113177399B CN 113177399 B CN113177399 B CN 113177399B CN 202110451337 A CN202110451337 A CN 202110451337A CN 113177399 B CN113177399 B CN 113177399B
Authority
CN
China
Prior art keywords
text
texts
real
comment
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110451337.4A
Other languages
Chinese (zh)
Other versions
CN113177399A (en
Inventor
浦东旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN202110451337.4A priority Critical patent/CN113177399B/en
Publication of CN113177399A publication Critical patent/CN113177399A/en
Application granted granted Critical
Publication of CN113177399B publication Critical patent/CN113177399B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application provides a text processing method, a text processing device, electronic equipment and a storage medium, and relates to the field of text processing. The text processing method comprises the following steps: extracting features of an input text to obtain the features of the input text; according to the characteristics of the input text, selecting n real texts from a pre-created database; the database stores: characteristics of a plurality of real texts and characteristics of comment texts corresponding to each real text; selecting m comment texts from the comment texts corresponding to the n real texts as target comment texts according to the characteristics of the input texts and the characteristics of the comment texts corresponding to the n real texts; wherein n is an integer greater than or equal to 1, and m is an integer greater than 1. The method and the device can generate corresponding comment texts aiming at the input texts with undefined topics, and can reduce the time cost of generating the texts.

Description

Text processing method, device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of text processing, and in particular, to a text processing method, apparatus, electronic device, and storage medium.
Background
Natural language processing (Natural Language Processing, NLP) is an important direction in the fields of computer science and technology and artificial intelligence, and can realize natural communication between man and machine. And natural language processing, namely converting natural corpus into digital information to obtain machine-recognizable information.
Natural language processing techniques typically require coordination of neural network models. Among them, the generation technology of natural language, particularly text generation, is limited to specific topics. At present, in the text generation technology, training is required based on a corpus under a preset topic to obtain a text generation model of a neural network structure. When receiving the input text under the preset topic, the text generation model can output the output text corresponding to the input text under the preset topic.
That is, in the text generation model of the present most neural network structures, topics of the text need to be defined, the accuracy of training the models can be guaranteed, the text generation model cannot generate corresponding output text for the input text without limiting the topics, and the model complexity of the text generation model of the neural network model structure is usually higher, the model training and parameter adjusting time is longer, and the time cost of the text generation model is higher.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provide a text processing method, a device, electronic equipment and a storage medium, so as to generate comment texts of input texts with undefined topics, and reduce the time cost for generating the texts.
In order to achieve the above purpose, the technical scheme adopted by the embodiment of the invention is as follows:
in a first aspect, an embodiment of the present invention provides a text processing method, including:
extracting features of an input text to obtain the features of the input text;
according to the characteristics of the input text, selecting n real texts from a pre-created database; the database stores: characteristics of a plurality of real texts and characteristics of comment texts corresponding to each real text;
selecting m comment texts from the comment texts corresponding to the n real texts as target comment texts according to the characteristics of the input texts and the characteristics of the comment texts corresponding to the n real texts; wherein n is an integer greater than or equal to 1, and m is an integer greater than 1.
Optionally, the extracting the features of the input text to obtain the features of the input text includes:
Acquiring a plurality of target words in the input text;
mapping the target words according to a pre-constructed corpus, so as to obtain the characteristics of the target words, wherein the corpus comprises: features of a plurality of base words; the characteristics of each target word are the characteristics of target basic words matched with each target word in the corpus;
and obtaining the characteristics of the input text according to the characteristics of the target words.
Optionally, the obtaining the characteristics of the input text according to the characteristics of the target words includes:
and carrying out weighted sum operation on the characteristics of the target words by adopting the preset weights of the target words to obtain the characteristics of the input text.
Optionally, before the weighting and calculating the characteristics of the target words by adopting the preset weights of the target words to obtain the characteristics of the input text, the method further includes:
and determining the preset weight of each target word according to the inverse document frequency index of the target basic word matched with each target word in the corpus.
Optionally, the plurality of base terms in the corpus originate from a plurality of documents; the method further comprises the steps of:
And calculating an inverse document frequency index of each basic word according to the word frequency of each basic word in the corpus, the total number of the documents with each basic word in the plurality of documents and a preset scale factor.
Optionally, the obtaining a plurality of words in the input text includes:
word segmentation is carried out on the input text to obtain a plurality of initial words;
and processing the plurality of initial words to remove stop words and/or fixed combination words in the plurality of initial words so as to obtain the plurality of target words.
Optionally, the method further comprises:
if the target basic words matched with each target word do not exist in the corpus, determining the input text as semantic-free input text;
randomly selecting comment texts from a preset comment text library to serve as target comment texts.
Optionally, before mapping the plurality of target words according to the pre-constructed corpus to obtain the features of the plurality of target words, the method further includes:
and processing a plurality of documents in a preset data source by adopting a preset word vector model to obtain the characteristics of the plurality of basic words.
Optionally, before selecting n pieces of real text corresponding to the input text from a pre-created database according to the characteristics of the input text, the method further includes:
crawling social text content from a preset network platform;
extracting features of the multiple real texts in the social text content to obtain features of the multiple real texts;
and extracting features of comment texts corresponding to each real text in the social text content to obtain features of the comment texts corresponding to each real text.
Optionally, the selecting n pieces of real text from a pre-created database according to the characteristics of the input text includes:
and selecting the n pieces of real texts which are most relevant to the characteristics of the input text from the database according to the characteristics of the input text.
Optionally, the selecting m comment texts from the comment texts corresponding to the n real texts according to the characteristics of the input text and the characteristics of the comment texts corresponding to the n real texts, includes:
respectively calculating the similarity of the characteristics of the input text and the characteristics of comment texts corresponding to the n real texts;
And selecting m comment texts with highest similarity from the comment texts corresponding to the n real texts according to the calculated similarity, and taking the m comment texts with highest similarity as target comment texts.
In a second aspect, an embodiment of the present application further provides a text processing apparatus, including:
the feature extraction module is used for extracting features of the input text to obtain the features of the input text;
the first selection module is used for selecting n real texts from a pre-created database according to the characteristics of the input text; the database stores: characteristics of a plurality of real texts and characteristics of comment texts corresponding to each real text;
the second selecting module is used for selecting m comment texts from the comment texts corresponding to the n real texts according to the characteristics of the input text and the characteristics of the comment texts corresponding to the n real texts; wherein n is an integer greater than or equal to 1, and m is an integer greater than 1.
In a third aspect, embodiments of the present application further provide an electronic device, including: the text processing device comprises a memory and a processor, wherein the memory stores a computer program executable by the processor, and the processor realizes any text processing method provided in the first aspect when executing the computer program.
In a fourth aspect, embodiments of the present application further provide a computer readable storage medium, where a computer program is stored, where the computer program is read and executed to implement any one of the text processing methods provided in the first aspect.
The beneficial effects of this application are:
in the text processing method, the device, the electronic equipment and the storage medium, the characteristics of the input text can be obtained by extracting the characteristics of the input text, and n real texts are selected from a database which is created in advance according to the characteristics of the input text; the database stores: and selecting m comment texts from the comment texts corresponding to the n real texts as target comment texts according to the characteristics of the input text and the characteristics of the comment texts corresponding to the n real texts. The text processing method is realized without adopting a neural network, so that the time cost caused by training and parameter adjustment of the neural network model is greatly reduced, topics of input texts are not limited, comment texts can be generated by combining a pre-built database, and the applicability of the text processing method to the input texts of undefined topics is improved; meanwhile, the target comment text selected to be played is selected from the comment text corresponding to the real text, so that the authenticity of the comment text is stronger, the target comment text perceived by the user is closer to the real comment, the participation of the user is improved, and the richness of the virtual social scene is enriched.
Secondly, by adopting the text processing method, the comment text generated aiming at the input text is not limited to the one-to-many comment text generation, namely, the selected target comment text can be multiple, so that the richness of the generated comment text is enriched.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a text processing method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for obtaining characteristics of an input text in a text processing method according to an embodiment of the present application;
FIG. 3 is a flowchart of a method for determining word weights in a text processing method according to an embodiment of the present application;
fig. 4 is a flowchart of a method in the case that input text has no semantics in the text processing method provided in the embodiment of the present application;
FIG. 5 is a flowchart of a method for constructing a database in a text processing method according to an embodiment of the present application;
fig. 6 is a scene block diagram of a text processing method according to an embodiment of the present application;
fig. 7 is a schematic diagram of a text processing device according to an embodiment of the present application;
fig. 8 is a schematic diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention.
The embodiment of the application can be applied to the virtual social scene, and the corresponding comment text can be automatically generated based on the input text, so that the simulation of the real comment is realized, the participation of the user is improved, and the richness of the virtual social scene is enriched. The virtual social scene may be, for example: the method for processing the text provided by the embodiment of the application can be executed when an input text input by a game player is received in a virtual social scene in a game, and a corresponding target comment text is generated based on the input text. Of course, the virtual social scene in the game is only one possible application scene example of the text processing method provided by the embodiment of the application, and in the usage scenes of other applications, the text processing method provided by the embodiment of the application can be executed as long as social simulation is involved.
In the conventional technology, the text generation technology based on the input text is usually realized based on a pre-trained neural network model, and the training and parameter adjustment of the neural network model are usually time-consuming, but in the text processing method provided by the application, the corresponding target comment text can be generated based on the input text and a pre-constructed database, so that the time cost of training and parameter adjustment of the neural network model is reduced without the need of the pre-trained neural network model.
The text processing method provided in the present application is exemplified by a number of examples as follows.
Fig. 1 is a flow chart of a text processing method according to an embodiment of the present invention, where the text processing method may be implemented by a client device supporting virtual social interaction, or may be implemented by a server, which is a server supporting virtual social interaction, and the process executed by the server is similar to the process executed by the client device, and is not repeated herein.
As shown in fig. 1, the text processing method may include:
s101, extracting features of the input text to obtain the features of the input text.
In one possible implementation manner, the input text may be subjected to word unit feature extraction to obtain features of a plurality of words in the input text, and then the features of the input text are calculated based on the features of the plurality of words. The feature of each word can be represented by a feature vector, that is, the feature of each word is the feature vector of each word. Then, based on the feature vectors for the plurality of words, the resulting feature of the input text may be a feature value of the input text.
The characteristics of each word can be obtained by processing by adopting a preset word vector tool, and the characteristics of the basic word matched with each word can be selected from the characteristics of a plurality of basic words obtained in advance according to each word to serve as the characteristics of each word. Therefore, the obtained characteristics of each word are actually the characteristics of the basic words, the characteristics are more reliable, and the problem that the characteristics are inaccurate due to the fact that the word input in the input text is not standard enough is effectively avoided.
In another possible implementation manner, a preset feature extraction tool may be used to perform feature extraction on the input text, so as to obtain features of the input text. The predetermined feature extraction tool may be, for example, a feature extraction model of a neural network structure.
Of course, other manners may be used to perform feature extraction on the input text, which is not limited by the embodiments of the present application.
In a possible example scenario, the input text may be a text input by the user through a text input interface, or may be an input text obtained by converting a voice input by the user, which is not limited in the embodiment of the present application.
S102, selecting n real texts from a pre-created database according to the characteristics of the input text.
The database stores: characteristics of a plurality of real texts and characteristics of comment texts corresponding to each real text.
The database may be, for example, features including real text obtained by processing based on text content in a real social scene, and features of comment text corresponding to the real text. The text content in the real social scene may be, for example, text content published by a preset user in a preset social platform. Text content from a real social scene may be referred to as social text content, i.e. text content where comment text exists.
The plurality of real texts may be a plurality of text sentences from text content in the real social scene; the comment text corresponding to each piece of real text can be the comment text corresponding to each piece of real text in the real social scene.
In a possible example scenario, the input text may be: text having a text length less than or equal to a preset length threshold, such as text not exceeding a preset number of characters, may be referred to as short text. If the predetermined number of characters is 160 characters, for example. Accordingly, each piece of real text in the database, and the corresponding comment text, may also be short text.
Under the condition that the characteristics of the input text are obtained, the characteristics of the input text and the characteristics of the plurality of real texts can be compared, and n real texts can be selected from the plurality of real texts according to the characteristic comparison result of the input text and the plurality of real texts. The n pieces of real texts may be real texts satisfying a preset feature condition among the plurality of real texts. n may be a preset integer greater than or equal to 1.
For example, the n pieces of real text that are most relevant to the characteristics of the input text may be selected from the database based on the characteristics of the input text.
In this implementation, a feature correlation of the input text and the plurality of real texts may be calculated, and n real texts most relevant to the feature of the input text may be selected from the plurality of real texts according to the calculated feature correlation.
S103, selecting m comment texts from the comment texts corresponding to the n real texts as target comment texts according to the characteristics of the input text and the characteristics of the comment texts corresponding to the n real texts.
Wherein m is an integer greater than 1.
And selecting n pieces of real texts from the plurality of real texts, namely, selecting the real texts matched with the input text from the plurality of real texts. Since each real text in the database has a corresponding comment text, after n real texts are selected, the comment text corresponding to the n real texts is determined.
In a possible implementation manner, the characteristics of the input text and the characteristics of all comment texts corresponding to n real texts are compared, so that the characteristic comparison result of all comment texts corresponding to the input text and the n real texts is obtained, and m comment texts are selected from all comment texts corresponding to the n real texts. The m comment texts can be comment texts meeting preset characteristic conditions in all comment texts corresponding to the n real texts.
For example, similarity between the features of the input text and the features of the comment texts corresponding to the n real texts is calculated respectively; and selecting m comment texts with highest similarity from the comment texts corresponding to the n real texts according to the calculated similarity, and taking the m comment texts with highest similarity as target comment texts.
In a specific implementation example, similarity sorting can be performed on comment texts corresponding to n real texts according to the calculated similarity, and m comment texts with highest similarity are selected as target comment texts according to a similarity sorting result.
After the m pieces of comment text are selected, the m pieces of comment text can be used as target comment text corresponding to the input text. In a specific application scenario example, the target comment text may be output so as to be displayed on a community interface, or the target comment text may be generated by voice, so as to obtain and output target comment voice. By outputting the target comment text, simulation of real comments of the input text can be realized.
In the text processing method provided by the embodiment, the characteristics of the input text can be obtained by extracting the characteristics of the input text, and n pieces of real texts are selected from a database created in advance according to the characteristics of the input text; the database stores: and selecting m comment texts from the comment texts corresponding to the n real texts as target comment texts according to the characteristics of the input text and the characteristics of the comment texts corresponding to the n real texts. The text processing method is realized without adopting a neural network, so that the time cost caused by training and parameter adjustment of the neural network model is greatly reduced, topics of input texts are not limited, comment texts can be generated by combining a pre-built database, and the applicability of the text processing method to the input texts of undefined topics is improved; meanwhile, the target comment text selected to be played is selected from the comment text corresponding to the real text, so that the authenticity of the comment text is stronger, the target comment text perceived by the user is closer to the real comment, the participation of the user is improved, and the richness of the virtual social scene is enriched.
Secondly, by adopting the text processing method, the comment text generated aiming at the input text is not limited to the one-to-many comment text generation, namely, the selected target comment text can be multiple, so that the richness of the generated comment text is enriched.
On the basis of the text processing method shown in fig. 1, the embodiment of the application may also provide an exemplary feature extraction manner of the input text in the text processing method. Fig. 2 is a flowchart of a method for obtaining characteristics of an input text in a text processing method according to an embodiment of the present application. As shown in fig. 2, the extracting features of the input text in S101 may include:
s201, acquiring a plurality of target words in the input text.
In a possible implementation manner, the input text may be segmented to obtain a plurality of initial words, and the plurality of initial words are screened to obtain the plurality of target words. Wherein, can adopt the word segmentation tool of preseting, word segmentation to the input text.
For example, in one implementation example, the input text may be segmented to obtain a plurality of initial terms; and processing the plurality of initial words to remove stop words and/or fixed combination words in the plurality of initial words so as to obtain the plurality of target words.
For example, the stop words in the plurality of initial words can be removed by processing the stop words of the plurality of initial words; and eliminating the fixed combination words in the initial words through the fixed combination word processing of the initial word information, such as special word processing. The stop word may be a word in a stop word bank of a preset language, for example; the fixed combination may include, for example, a person name, place name, or other form of fixed combination.
S202, mapping the plurality of target words according to a pre-constructed corpus to obtain the characteristics of the plurality of target words.
Wherein the corpus comprises: features of a plurality of base words; the features of each target word are the features of the target base word in the corpus that match the each target word.
For example, the mapping may be performed in the following manner:
whether the basic words which are the same as each target word exist or not can be determined from the corpus according to the target words, if so, the basic words which are the same as each target word are the characteristics of the target basic words which are matched with each target word, so that the characteristics of the target basic words which are matched with each target word in the corpus can be used as the characteristics of each target word.
The feature of each base word in the corpus may be represented by a feature vector, such that, for the input text, a feature matrix composed of feature vectors of the plurality of target words is obtained by performing the step S202. Thus, the mapping may also be referred to as a mapping of feature matrices, or a matrix mapping.
If the basic words corresponding to each target word exist in the corpus, determining that the input text is a text with semantics; otherwise, if the basic word corresponding to each target word does not exist in the corpus, the input text can be determined to be a text without semantics.
S203, obtaining the characteristics of the input text according to the characteristics of the target words.
In a possible implementation manner, the features of the target words can be accumulated to obtain the features of the input text; other operations may also be performed on the characteristics of the plurality of target words to obtain the characteristics of the input text.
For example, the preset weights of the target words may be used to perform a weighted sum operation on the features of the target words, so as to obtain the features of the input text. Each target word has a preset weight, which may be different for different words.
According to the embodiment, the characteristics of the plurality of target words in the input text can be obtained by carrying out characteristic mapping on the plurality of target words according to the pre-constructed corpus, and the characteristics of the input text can be obtained based on the characteristics of the plurality of target words, so that the characteristics of the input text can come from the characteristics of basic words, the characteristics of the input text can be more reliable and more accurate, and the extracted characteristics of the input text can more accurately reflect the semantics of the input user.
Optionally, in a possible implementation example, the embodiment may further provide a method for determining a word weight in a text processing method. Fig. 3 is a flowchart of a method for determining a word weight in a text processing method according to an embodiment of the present application. As shown in fig. 3, in the above method, the preset weights of the target words are adopted, and before the characteristics of the words are weighted and calculated to obtain the characteristics of the input text, the method may further include:
s301, determining preset weights of the target words according to inverse document frequency indexes of the target basic words matched with the target words in the corpus.
In a possible implementation manner, the inverse document frequency index of the target basic word in the corpus can be directly determined as the preset weight of each target word; and a preset calculation formula can be adopted according to the inverse document frequency index of the target basic words in the corpus to obtain the preset weight of each target word.
The corpus stores not only the characteristics of a plurality of basic words, but also the inverse document frequency index of the basic words. Wherein the inverse document frequency index of each base word may be an inverse document frequency index determined based on the word frequency of each target word, and thus, the inverse document frequency index of each base word may be expressed as: word Frequency inverse document Frequency index (Term Frequency-Inverse Document Frequency, TF-IDF).
In the case that the target base word matched with each target word is determined from the corpus, the inverse document frequency index of the target base word can be determined from the corpus.
According to the method, the preset weight of each target word can be determined according to the inverse document frequency index of the target basic word in the corpus, and then the characteristics of the target words are weighted and calculated according to the preset weight of each target word, so that the characteristics of the obtained input text are more accurate and more reliable.
Optionally, the plurality of base terms in the corpus as shown above originate from a plurality of documents.
In S301, before determining the preset weight of each target word according to the inverse document frequency index of the target base word matched with each target word in the corpus, the method may further include:
S301a, calculating an inverse document frequency index of each basic word according to the word frequency of each basic word in the corpus, the total number of the documents with each basic word in the plurality of documents and a preset scale factor.
For example, the inverse document frequency index of each base term may be calculated using the following formula. Idf_factor=math_log (consts. Idf_factor_total/(frequency+1) +1-consts. Idf_factor)
Wherein idf_factor represents an inverse document frequency index of each base word, frequency is a word frequency of each base word, total doccount is a total number of documents having each base word, and subjects.
The specific value of the consts. Idf_factor can be determined by repeated tuning during the course of the experiment.
According to the method provided by the embodiment, when the inverse document frequency index of each basic word is calculated, the slope in the calculation formula of the inverse document frequency index is gradually flattened by introducing the preset scaling factor, the larger the increment of the numerical axis is at the position closer to Y, the original value val and 1 (the negative direction of x=1 is a negative value) are translated through the preset scaling factor, and the obtained inverse document frequency index can enable the difference of different basic words to be more obvious, and the distinguishing degree of different basic words is increased.
Optionally, based on any one of the methods shown above, the embodiment of the present application may further provide a comment text processing method in a case where the input text has no semantics. Fig. 4 is a flowchart of a method for text processing in the case of no semantic input text according to an embodiment of the present application. As shown in fig. 4, the method may further include:
s401, if the target basic words matched with each target word do not exist in the corpus, determining that the input text is a semantic-free input text.
For example, when the input text is. . . . . . By mapping the corpus, it can be determined that the target basic word matched with each word in the input text does not exist in the corpus, so that the input text can be determined as: semantic-free input text.
Of course, in other scene examples, other ways may be used to perform semantic discrimination on the input text, and in the embodiment of the present application, the semantic discrimination is implemented by combining with a corpus, so that the semantic discrimination on the input text is more accurate.
S402, randomly selecting comment texts from a preset comment text library to serve as target comment texts.
The preset comment text library may store: the comment text in the preset comment text library can be the comment text of the real text or the machine-generated semantic-free comment text.
And under the condition that the input text is determined to have no semantics, randomly selecting the comment text from a preset comment text library to serve as a target comment text for outputting.
The embodiment also provides comment text generation under the condition that the input text has no semantics, so that the corresponding comment text can be output aiming at the input text of the user, and the interactivity and the richness of the virtual social contact are improved.
Based on the method shown in any of the foregoing embodiments, before mapping, in the step S202, the plurality of target words according to a pre-constructed corpus to obtain features of the plurality of target words, the method may further include:
and processing a plurality of documents in a preset data source by adopting a preset Word vector model (Word 2 vec) to obtain the characteristics of the plurality of basic words.
The plurality of documents may be documents for a preset language in the preset data source, such as a plurality of documents in chinese, for example, 32 ten thousand chinese documents. The preset data source may be a preset open source corpus.
In the method provided by the embodiment, the characteristics of the plurality of basic words can be obtained based on the plurality of documents in the preset data source, so that the corpus is generated, the creation of the corpus is realized, and the richness of the basic words in the corpus is ensured.
For the text processing method shown in any of the above embodiments, the embodiment of the present application may further provide a method for constructing a database having features of a real text and features of corresponding comment text. Fig. 5 is a flowchart of a method for constructing a database in a text processing method according to an embodiment of the present application. As shown in fig. 5, before selecting n pieces of real text corresponding to the input text from the pre-created database according to the characteristics of the input text in the above-mentioned method S102, the method may further include:
s501, crawling social text content from a preset network platform.
The preset network platform may be a preset social platform, or social function modules in other network platforms, such as a scoring or message function module. The social text content may be: social text content associated with users meeting preset user levels in a preset network platform, or text content with the number of comments or forwarding number meeting preset conditions.
S502, extracting features of the multiple real texts in the social text content to obtain features of the multiple real texts.
S503, extracting features of comment texts corresponding to each real text in the social text content to obtain features of the comment texts corresponding to each real text.
In the specific implementation process, the implementation process of feature extraction of each real text and the implementation process of feature extraction of comment text are similar to the implementation process of feature extraction of the input text, and are not repeated herein with reference to the above.
According to the method provided by the embodiment, the real text and the corresponding comment text in the social text content can be respectively subjected to feature extraction by crawling the social text content from the preset network platform, so that the features of the real text and the features of the comment text are obtained, the features of the real text and the comment text in the database can be more real, and the authenticity of the target comment text generated based on the database can be effectively ensured.
For a clear understanding of the text processing method provided in the present application, explanation is made below by way of a specific example. Fig. 6 is a schematic view of a scenario of a text processing method provided in the embodiment of the present application, as shown in fig. 6, a preset word vector model may be adopted in advance to process a document in a preset data source, so as to obtain a corpus, that is, the corpus including features of a plurality of basic words; feature extraction can also be performed through a plurality of real texts and comment texts corresponding to the real texts, so that features of the plurality of real texts and features of comment texts corresponding to each real text are obtained, and the method comprises the following steps: and constructing a database of the features of the real text and the features of the corresponding comment text.
In a specific application process, word segmentation processing can be performed on the acquired input text, and stopping word processing and special word processing can be performed on a plurality of initial words after word segmentation to obtain a word set comprising a plurality of target words. Under the condition that the target words are obtained, the target words can be subjected to matrix mapping according to a pre-created corpus to obtain a feature matrix of the target words, wherein the feature matrix is formed by: the feature vectors of the plurality of target words.
Under the condition that the feature matrix of the target words is obtained, the feature of the input text can be obtained based on the feature vectors of the target words, then n real texts which are most similar to the input text are selected from a plurality of real texts in a preset database according to the feature of the input text and the feature of the real texts in the preset database, and m comment texts which are most similar to the input text are determined from the comment texts corresponding to the n real texts to be output as target comment texts corresponding to the input text according to the feature of the real texts and the feature of the comment texts corresponding to the n real texts.
The following describes a device, an apparatus, a storage medium, etc. for executing the text processing method provided in the present application, and specific implementation processes and technical effects of the device and the apparatus and the storage medium are referred to above, and are not described in detail below.
Fig. 7 is a schematic diagram of a text processing device according to an embodiment of the present application, and as shown in fig. 7, the text processing device 700 may include:
the feature extraction module 701 is configured to perform feature extraction on an input text to obtain features of the input text;
a first selection module 702, configured to select n pieces of real text from a database created in advance according to the features of the input text; the database stores: characteristics of a plurality of real texts and characteristics of comment texts corresponding to each real text;
a second selecting module 703, configured to select m comment texts from the comment texts corresponding to the n real texts as target comment texts according to the features of the input text and the features of the comment texts corresponding to the n real texts; wherein n is an integer greater than or equal to 1, and m is an integer greater than 1.
Optionally, the feature extraction module 701 is specifically configured to obtain a plurality of target words in the input text; mapping the target words according to a pre-constructed corpus, so as to obtain the characteristics of the target words, wherein the corpus comprises: features of a plurality of base words; the characteristics of each target word are the characteristics of target basic words matched with each target word in the corpus; and obtaining the characteristics of the input text according to the characteristics of the target words.
Optionally, the feature extraction module 701 is specifically configured to perform weighted sum operation on features of the plurality of target words by using preset weights of the plurality of target words, so as to obtain features of the input text.
Optionally, the text processing device 700 may further include:
the first determining module is used for determining preset weight of each target word according to the inverse document frequency index of the target basic word matched with each target word in the corpus.
Optionally, the text processing device 700 may further include:
the calculation module is used for calculating the inverse document frequency index of each basic word according to the word frequency of each basic word in the corpus, the total number of the documents with each basic word in the plurality of documents and a preset scale factor.
Optionally, the text processing device 700 may further include:
the word segmentation module is used for segmenting the input text to obtain a plurality of initial words;
the processing module is used for processing the plurality of initial words so as to remove stop words and/or fixed combination words in the plurality of initial words and obtain a plurality of target words.
Optionally, the text processing device 700 may further include:
the second determining module is used for determining that the input text is a semantic-free input text if the target basic words matched with each target word do not exist in the corpus;
And the third selection module is used for randomly selecting comment texts from a preset comment text library to serve as target comment texts.
Optionally, the text processing device 700 may further include:
the first creating module is used for processing a plurality of documents in a preset data source by adopting a preset word vector model to obtain the characteristics of a plurality of basic words, so as to realize the creation of a corpus.
Optionally, the text processing device 700 may further include:
the second creation module is used for crawling social text content from a preset network platform; extracting features of a plurality of real texts in the social text content to obtain features of the plurality of real texts; feature extraction is carried out on comment texts corresponding to each real text in the social text content, so that features of the comment texts corresponding to each real text are obtained, and the creation of a preset database is realized.
Optionally, the first selecting module is specifically configured to select, according to the characteristics of the input text, n pieces of real text that are most relevant to the characteristics of the input text from the database.
Optionally, the second selection module is specifically configured to calculate similarity between features of the input text and features of comment texts corresponding to the n pieces of real texts; and selecting m comment texts with highest similarity from the comment texts corresponding to the n real texts according to the calculated similarity, and taking the m comment texts with highest similarity as target comment texts.
The foregoing apparatus is used for executing the method provided in the foregoing embodiment, and its implementation principle and technical effects are similar, and are not described herein again.
The above modules may be one or more integrated circuits configured to implement the above methods, for example: one or more application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as ASIC), or one or more microprocessors (digital singnal processor, abbreviated as DSP), or one or more field programmable gate arrays (Field Programmable Gate Array, abbreviated as FPGA), or the like. For another example, when a module above is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU) or other processor that may invoke the program code. For another example, the modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).
Fig. 8 is a schematic diagram of an electronic device provided in an embodiment of the present application, where the electronic device may be a computing device or a server with text processing functions.
The electronic device 800 includes: memory 801, and processor 802. The memory 801 and the processor 802 are connected by a bus.
The memory 801 is used for storing a program, and the processor 802 calls the program stored in the memory 801 to execute the above-described method embodiment. The specific implementation manner and the technical effect are similar, and are not repeated here.
Optionally, the present invention also provides a program product, such as a computer readable storage medium, comprising a program for performing the above-described method embodiments when being executed by a processor.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.
The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (english: processor) to perform some of the steps of the methods according to the embodiments of the invention. And the aforementioned storage medium includes: u disk, mobile hard disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.
The foregoing is merely a specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes or substitutions are covered by the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (14)

1. A text processing method, comprising:
extracting features of an input text to obtain the features of the input text; the input text is input text in a virtual social scene;
according to the characteristics of the input text, selecting n real texts from a pre-created database; the database stores: characteristics of a plurality of real texts and characteristics of comment texts corresponding to each real text; the real texts are a plurality of text sentences from text contents in a real social scene, and comment texts corresponding to each real text are comment texts corresponding to each real text in the real social scene;
selecting m comment texts from the comment texts corresponding to the n real texts as target comment texts according to the characteristics of the input texts and the characteristics of the comment texts corresponding to the n real texts; wherein n is an integer greater than or equal to 1, and m is an integer greater than 1.
2. The method of claim 1, wherein the feature extraction of the input text to obtain the features of the input text comprises:
Acquiring a plurality of target words in the input text;
mapping the target words according to a pre-constructed corpus, so as to obtain the characteristics of the target words, wherein the corpus comprises: features of a plurality of base words; the characteristics of each target word are the characteristics of target basic words matched with each target word in the corpus;
and obtaining the characteristics of the input text according to the characteristics of the target words.
3. The method of claim 2, wherein deriving the characteristics of the input text from the characteristics of the plurality of target words comprises:
and carrying out weighted sum operation on the characteristics of the target words by adopting the preset weights of the target words to obtain the characteristics of the input text.
4. The method of claim 3, wherein the weighting and summing the characteristics of the plurality of target words with the preset weights of the plurality of target words, before obtaining the characteristics of the input text, further comprises:
and determining the preset weight of each target word according to the inverse document frequency index of the target basic word matched with each target word in the corpus.
5. The method of claim 4, wherein the plurality of base terms in the corpus originate from a plurality of documents; the method further comprises the steps of:
and calculating an inverse document frequency index of each basic word according to the word frequency of each basic word in the corpus, the total number of the documents with each basic word in the plurality of documents and a preset scale factor.
6. The method of claim 2, wherein the obtaining a plurality of target words in the input text comprises:
word segmentation is carried out on the input text to obtain a plurality of initial words;
and processing the plurality of initial words to remove stop words and/or fixed combination words in the plurality of initial words so as to obtain the plurality of target words.
7. The method according to claim 2, wherein the method further comprises:
if the target basic words matched with each target word do not exist in the corpus, determining the input text as semantic-free input text;
Randomly selecting comment texts from a preset comment text library to serve as target comment texts.
8. The method of claim 2, wherein before mapping the plurality of target words according to the pre-constructed corpus to obtain features of the plurality of target words, the method further comprises:
and processing a plurality of documents in a preset data source by adopting a preset word vector model to obtain the characteristics of the plurality of basic words.
9. The method according to claim 1, wherein before selecting n pieces of real text corresponding to the input text from a pre-created database according to the characteristics of the input text, the method further comprises:
crawling social text content from a preset network platform;
extracting features of the multiple real texts in the social text content to obtain features of the multiple real texts;
and extracting features of comment texts corresponding to each real text in the social text content to obtain features of the comment texts corresponding to each real text.
10. The method of claim 1, wherein selecting n pieces of real text from a pre-created database according to the characteristics of the input text comprises:
And selecting the n pieces of real texts which are most relevant to the characteristics of the input text from the database according to the characteristics of the input text.
11. The method according to any one of claims 1-10, wherein selecting m pieces of comment text as target comment text from the n pieces of comment text corresponding to real text according to the characteristics of the input text and the characteristics of the comment text corresponding to the n pieces of real text, includes:
respectively calculating the similarity of the characteristics of the input text and the characteristics of comment texts corresponding to the n real texts;
and selecting m comment texts with highest similarity from the comment texts corresponding to the n real texts according to the calculated similarity, and taking the m comment texts with highest similarity as target comment texts.
12. A text processing apparatus, comprising:
the feature extraction module is used for extracting features of the input text to obtain the features of the input text; the input text is input text in a virtual social scene;
the first selection module is used for selecting n real texts from a pre-created database according to the characteristics of the input text; the database stores: characteristics of a plurality of real texts and characteristics of comment texts corresponding to each real text; the real texts are a plurality of text sentences from text contents in a real social scene, and comment texts corresponding to each real text are comment texts corresponding to each real text in the real social scene;
The second selecting module is used for selecting m comment texts from the comment texts corresponding to the n real texts according to the characteristics of the input text and the characteristics of the comment texts corresponding to the n real texts; wherein n is an integer greater than or equal to 1, and m is an integer greater than 1.
13. An electronic device, comprising: a memory and a processor, the memory storing a computer program executable by the processor, the processor implementing the text processing method of any of the preceding claims 1-11 when the computer program is executed.
14. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when read and executed, implements the text processing method according to any of the preceding claims 1-11.
CN202110451337.4A 2021-04-25 2021-04-25 Text processing method, device, electronic equipment and storage medium Active CN113177399B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110451337.4A CN113177399B (en) 2021-04-25 2021-04-25 Text processing method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110451337.4A CN113177399B (en) 2021-04-25 2021-04-25 Text processing method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113177399A CN113177399A (en) 2021-07-27
CN113177399B true CN113177399B (en) 2024-02-06

Family

ID=76926050

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110451337.4A Active CN113177399B (en) 2021-04-25 2021-04-25 Text processing method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113177399B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018176764A1 (en) * 2017-03-30 2018-10-04 联想(北京)有限公司 Data processing method and apparatus, and electronic device
CN109885770A (en) * 2019-02-20 2019-06-14 杭州威佩网络科技有限公司 A kind of information recommendation method, device, electronic equipment and storage medium
CN110851650A (en) * 2019-11-11 2020-02-28 腾讯科技(深圳)有限公司 Comment output method and device and computer storage medium
CN111126063A (en) * 2019-12-26 2020-05-08 北京百度网讯科技有限公司 Text quality evaluation method and device
CN112667780A (en) * 2020-12-31 2021-04-16 上海众源网络有限公司 Comment information generation method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018176764A1 (en) * 2017-03-30 2018-10-04 联想(北京)有限公司 Data processing method and apparatus, and electronic device
CN109885770A (en) * 2019-02-20 2019-06-14 杭州威佩网络科技有限公司 A kind of information recommendation method, device, electronic equipment and storage medium
CN110851650A (en) * 2019-11-11 2020-02-28 腾讯科技(深圳)有限公司 Comment output method and device and computer storage medium
CN111126063A (en) * 2019-12-26 2020-05-08 北京百度网讯科技有限公司 Text quality evaluation method and device
CN112667780A (en) * 2020-12-31 2021-04-16 上海众源网络有限公司 Comment information generation method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113177399A (en) 2021-07-27

Similar Documents

Publication Publication Date Title
US11106714B2 (en) Summary generating apparatus, summary generating method and computer program
US11087092B2 (en) Agent persona grounded chit-chat generation framework
US10592607B2 (en) Iterative alternating neural attention for machine reading
CN108463815A (en) The name Entity recognition of chat data
CN110234018B (en) Multimedia content description generation method, training method, device, equipment and medium
CN109284502B (en) Text similarity calculation method and device, electronic equipment and storage medium
JP2010537286A (en) Creating an area dictionary
CN110347790B (en) Text duplicate checking method, device and equipment based on attention mechanism and storage medium
CN110110332B (en) Text abstract generation method and equipment
CN111694937A (en) Interviewing method and device based on artificial intelligence, computer equipment and storage medium
Cecillon et al. Abusive language detection in online conversations by combining content-and graph-based features
CN111767394A (en) Abstract extraction method and device based on artificial intelligence expert system
CN112328735A (en) Hot topic determination method and device and terminal equipment
EP4060548A1 (en) Method and device for presenting prompt information and storage medium
CN114416926A (en) Keyword matching method and device, computing equipment and computer readable storage medium
CN113934834A (en) Question matching method, device, equipment and storage medium
CN112507721B (en) Method, apparatus, device and computer readable storage medium for generating text theme
CN110738056A (en) Method and apparatus for generating information
CN112307738A (en) Method and device for processing text
CN110457707B (en) Method and device for extracting real word keywords, electronic equipment and readable storage medium
CN113177399B (en) Text processing method, device, electronic equipment and storage medium
CN110895656B (en) Text similarity calculation method and device, electronic equipment and storage medium
EP2915067A1 (en) Text analysis
CN111401070B (en) Word meaning similarity determining method and device, electronic equipment and storage medium
JP5824429B2 (en) Spam account score calculation apparatus, spam account score calculation method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant