CN113221553A

CN113221553A - Text processing method, device and equipment and readable storage medium

Info

Publication number: CN113221553A
Application number: CN202010070866.5A
Authority: CN
Inventors: 王业全; 魏望; 王爱华
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2021-08-06

Abstract

The embodiment of the application discloses a text processing method, a text processing device, text processing equipment and a readable storage medium, wherein the method comprises the following steps: acquiring word segmentation in a text; segmenting the word segmentation to obtain at least two sub-word segmentations corresponding to the word segmentation; acquiring subvectors corresponding to the at least two sub-participles respectively; coding and combining the sub-vectors respectively corresponding to the at least two sub-participles according to the distribution positions of the at least two sub-participles in the text to obtain a recombined vector; the restructuring vector is used for representing the participle and is used for performing text processing on the text. By the method and the device, the accuracy of the word vector of the word segmentation can be improved.

Description

Text processing method, device and equipment and readable storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a text processing method, apparatus, device, and readable storage medium.

Background

With the rapid development of internet technology and artificial intelligence technology, Natural Language Processing (NLP) applications, such as extraction of keywords in texts, sentiment analysis of product reviews, etc., emerge in spring after rain, but the basis of the application is established in text information no matter which scene is used, so accurate extraction of text context information is the first and necessary of Natural Language Processing.

In both the supervised learning method and the unsupervised learning method, the extraction of the text context information is mainly based on word vector representation, so that the text needs to be segmented. In the prior art, when a text is divided into words, the words are directly coded, and then semantics in the text are represented according to the obtained word vectors, but the expression results of the word vectors generated by direct coding are rough, and particularly when characters in the words are excessive, such as 'late spring shoots', the word granularity is rough, so that the word vectors corresponding to the words cannot necessarily accurately represent the semantics of the words, and the wrong understanding of the text is caused.

Disclosure of Invention

The embodiment of the application provides a text processing method, a text processing device, text processing equipment and a computer-readable storage medium, which can improve the accuracy of word vectors of word segmentation.

An embodiment of the present application provides a text processing method, including:

acquiring word segmentation in a text;

segmenting the word segmentation to obtain at least two sub-word segmentations corresponding to the word segmentation;

acquiring subvectors corresponding to the at least two sub-participles respectively;

coding and combining the sub-vectors respectively corresponding to the at least two sub-participles according to the distribution positions of the at least two sub-participles in the text to obtain a recombined vector; the restructuring vector is used for representing the participle and is used for performing text processing on the text.

An embodiment of the present application provides a text processing apparatus in one aspect, including:

the first acquisition module is used for acquiring word segmentation in the text;

the segmentation module is used for segmenting the participles to obtain at least two sub-participles corresponding to the participles;

the second obtaining module is used for obtaining the sub-vectors corresponding to the at least two sub-participles respectively;

the recombination module is used for coding and combining the sub-vectors respectively corresponding to the at least two sub-participles according to the distribution positions of the at least two sub-participles in the text to obtain a recombined vector; the restructuring vector is used for representing the participle and is used for performing text processing on the text.

Wherein, above-mentioned reorganization module includes:

a generating sequence unit, configured to generate a sub-participle sequence including the at least two sub-participles according to distribution positions of the at least two sub-participles in the text;

a sequence dividing unit for dividing the sub-participle sequence into sub-sequences to be recombined for combining the participles;

and the coding combination unit is used for coding combination of the subvectors corresponding to the sub-participles in the subsequence to be recombined to obtain the recombined vector.

Wherein, the sequence dividing unit includes:

the quantity determining subunit is used for acquiring the number of segmentation segments of the participle and determining the quantity of the target participle corresponding to the participle according to the number of the segmentation segments;

a dividing sequence sub-unit, configured to divide the sub-word segmentation sequence into the sub-sequences to be recombined according to the number of the target sub-words; the number of the sub-participles in the subsequence to be recombined is equal to the number of the target sub-participles.

Wherein, the sub-participle sequence comprises sub-participles t_jJ is a positive integer less than or equal to the number of the sub-participles in the sub-participle sequence;

the above-mentioned divided sequence unit includes:

obtaining the sub-unit to be evaluated for obtaining the sub-word segmentation sequenceSub-participle t_jDividing the above sub-word t_jAs a participle to be evaluated;

updating the sub-unit to be evaluated, and obtaining the sub-participle t in the sub-participle sequence if the participle to be evaluated does not belong to the legal participle_j+1According to the above sub-participles t_jAnd the above sub-participles t_j+1Updating the word segmentation to be evaluated;

a subunit to be recombined is determined, which is used for determining the sub-participle t if the updated participle to be evaluated belongs to a legal participle_jAnd the above sub-participles t_j+1Is the above-mentioned to-be-recombined subsequence; wherein, t_j-1Is empty, or t_j-1Belonging to the previous subsequence to be recombined.

Wherein, the sequence dividing unit further comprises:

a determining to-be-evaluated subunit, configured to obtain a dictionary database, determine that the participle to be evaluated belongs to the legal participle if the participle to be evaluated exists in the dictionary database, and divide the participle into t_jDetermining the sequence of the to-be-recombined subsequence;

and the determining subunit is also used for determining that the participle to be evaluated does not belong to the legal participle if the participle to be evaluated does not exist in the dictionary database.

The sub-participles in the subsequence to be recombined comprise a first sub-participle and a second sub-participle, the sub-vector corresponding to the first sub-participle is a first sub-vector, and the sub-vector corresponding to the second sub-participle is a second sub-vector;

the code combining unit includes:

a first obtaining value sub-unit for obtaining the value P in the first sub-vector_iObtaining the value Q in the second sub-vector_i(ii) a i is the dimension number of the first sub-vector and the second sub-vector, and i is a positive integer; the above numerical value P_iThe value corresponding to the ith dimension in the first sub-vector, the value Q_iThe value is a numerical value corresponding to the ith dimension in the second sub-vector;

an arithmetic value subunit for counting the above numbersValue P_iAnd the above numerical value Q_iPerforming numerical operation to generate a numerical value R_i；

A first recombination vector subunit for recombining the vector according to the value R_iAnd generating the recombination vector.

the code combining unit includes:

a second value obtaining sub-unit for obtaining the value P in the first sub-vector_iObtaining the value Q in the second sub-vector_i(ii) a i is the dimension number of the first sub-vector and the second sub-vector, and i is a positive integer; the above numerical value P_iThe value corresponding to the ith dimension in the first sub-vector, the value Q_iThe value is a numerical value corresponding to the ith dimension in the second sub-vector;

a select value subunit for selecting from the value P_iAnd the above numerical value Q_iTo select the maximum or minimum value as the value M_i；

A second recombination direction quantum unit for generating the value M_iAnd generating the recombination vector.

Wherein the at least two sub-participles comprise a sub-participle S_nN is a positive integer less than or equal to the number of the at least two sub-participles;

the second obtaining module includes:

an initial sub-vector obtaining unit for obtaining the sub-participles S_nObtaining a second initial sub-vector corresponding to the residual sub-word segmentation; the remaining sub-participles include at least two of the sub-participles except the sub-participle S_nAn outer sub-word;

a fused initial sub-vector unit, configured to perform vector fusion on the first initial sub-vector and the second initial sub-vector to obtain a fused sub-vector;

a weighted initial sub-vector unit for performing weighted fusion on the fused sub-vector and the first initial sub-vector to obtain the sub-participle S_nThe corresponding sub-vector.

The at least two sub-participles are characters in a participle system or roots composed of at least two characters.

Wherein, still include:

a module for determining a criticality probability, configured to determine, according to the restructured vector, a criticality probability of the participle corresponding to the restructured vector in the text;

a keyword determining module, configured to determine keywords in the text according to the criticality probability; wherein the word segmentation includes the keyword.

Wherein the recombination vector comprises a first recombination vector and a second recombination vector;

further comprising:

a similarity obtaining module, configured to obtain a similarity between the first recombined vector and the second recombined vector;

and the similar word segmentation determining module is used for determining the word segmentation corresponding to the first recombination vector and the word segmentation corresponding to the second recombination vector as the semantic similar word segmentation cluster of the text if the similarity is greater than a similarity threshold value.

Wherein, still include:

a part-of-speech recognition module for recognizing the part-of-speech type corresponding to the participle according to the recombined vector;

and a part-of-speech tag adding module, configured to obtain a part-of-speech tag corresponding to the part-of-speech type, and add the part-of-speech tag corresponding to the participle to a corresponding position in the text.

One aspect of the present application provides a computer device, comprising: a processor, a memory, a network interface;

the processor is connected to the memory and the network interface, wherein the network interface is used for providing a data communication function, the memory is used for storing a computer program, and the processor is used for calling the computer program to execute the method in the embodiment of the present application.

An aspect of the embodiments of the present application provides a computer-readable storage medium storing a computer program, where the computer program includes program instructions, which, when executed by a processor, perform a method as in the embodiments of the present application.

The method includes the steps that word segmentation in a text is obtained; segmenting the word segmentation to obtain at least two sub-word segmentations corresponding to the word segmentation; acquiring subvectors corresponding to the at least two sub-participles respectively; coding and combining the sub-vectors respectively corresponding to the at least two sub-participles according to the distribution positions of the at least two sub-participles in the text to obtain a recombined vector; the restructuring vector is used for representing the participle and is used for performing text processing on the text. In the above way, the at least two sub-participles corresponding to the text are obtained by further segmenting the participles in the text; and then coding at least two sub-participles to obtain a sub-vector fused with context information, coding the sub-vector representation according to the distribution positions of the at least two sub-participles in the text to obtain a recombined vector corresponding to the text, and finally performing various text processing on the text according to the recombined vector. By adopting the method and the device, at least two sub-participles are obtained by segmenting the participles, and because the granularity of the sub-participles is smaller, the sub-vectors corresponding to the sub-participles are more accurate, so that the accuracy of the recombined vectors can be improved, namely the semantics of the participles can be represented more accurately through the recombined vectors; in the process of recombination, the distribution position of each sub-participle in the text is considered, so that the recombination vector of the participle is ensured to be related to the sub-participle contained in the participle, that is, the error of extracting boundary information can be avoided, and the accuracy of the recombination vector is further ensured.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1a is a schematic diagram of a system architecture according to an embodiment of the present application;

FIG. 1b is a schematic structural diagram of a text processing module according to an embodiment of the present disclosure;

FIG. 1c is a schematic diagram of a text hierarchy provided in an embodiment of the present application;

fig. 2a is a schematic view of a text processing scenario provided in an embodiment of the present application;

fig. 2b is a schematic view of a text processing scenario provided in an embodiment of the present application;

fig. 3 is a schematic flowchart of a text processing method according to an embodiment of the present application;

fig. 4 is a schematic view of a text processing scenario provided in an embodiment of the present application;

fig. 5a is a schematic view of a text processing scenario provided in an embodiment of the present application;

fig. 5b is a schematic view of a text processing scenario provided in an embodiment of the present application;

fig. 6 is a schematic flowchart of a text processing method according to an embodiment of the present application;

fig. 7a is a schematic view of a scene of a candidate word according to an embodiment of the present application;

fig. 7b is a scene schematic diagram of a candidate word according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Please refer to fig. 1a, which is a schematic diagram of a system architecture according to an embodiment of the present application. The server 10a provides services for a user terminal cluster, which may include: user terminal 10b, user terminals 10c, …, user terminal 10 d. When the user terminal 10d (or the user terminal 10b or the user terminal 10c) acquires a text input by a user and needs to perform text processing on the text, for example, text emotion classification, text keyword extraction, text word clustering, and the like, the text may be sent to the server 10a, please refer to fig. 1b together, which is a schematic structural diagram of a text processing module provided in the embodiment of the present application. For ease of understanding and description, the following key terms are first briefly described in this application.

1. Word segmentation system: the granularity of text generally contains a level of characters, words, sentences, paragraphs, chapters, and the like. Wherein, the characters can be called single characters or simply called characters; words, also referred to simply as words. For example, short text: "Artificial intelligence is an important component of computer science", then the character levels are as follows: "human", "artificial", "intelligence", "energy", "yes", "count", "calculation", "machine", "science", "learning", "heavy", "important", "composition", "part"; the word rank is: "artificial", "intelligent", "yes", "computer", "scientific", "important", "composition", "part"; the sentence level is the entire short text. The paragraph level and chapter level are for a broader definition. The segmentation system may segment the input text into character-level or word-level representations, such as standard segmentation techniques (standardalyzer), ignored segmenter (StopAnalyzer), space segmenter (whitespace analyzer), and the like.

2. Subword (Subword): in the word segmentation system, the various levels of common texts are firstly introduced, but a level between single words and words is also called subword level. It is not obvious in Chinese and is not different from characters, but it is obvious in English, for example, "tencent" is cut into "te", "# # nce" and "# # nt", the above-mentioned 3 combinations are not character level, nor word level, and belong to sub-word level. The minimum segmentation granularity of the text is in a sub-word level.

3. A dictionary database: refers to a large set of data words or phrases that have been constructed in advance for text processing. The words in the dictionary database may include chinese text (different dialect text in various different places), english text, and the like. The dictionary database may be created by dividing the text information in any one or more databases into phrases according to common punctuations, such as "artificial intelligence", "the people's republic of china", and the like, which are composed of words, the common punctuations may be a pause sign, a comma, a seal sign, a colon, a period, and an interval sign ("-") in english, and the like. It should be noted that in practical applications, punctuation that may be contained in phrases cannot be used to segment documents, such as dashes for connecting contextual statements.

4. Word segmentation coding network model: the bi-directional Language model is obtained through pre-training, and includes, but is not limited to, an ELMO network (Embedding from Language Models) and a BERT network (Bidirectional Encoder retrieval from transformations). In the application, the Word segmentation coding network model needs to be trained in advance based on a dictionary database to generate a deep bidirectional language model, and by means of the internal structure of the model, Word vectors or Word embedding (Word embedding) corresponding to an input text is learned, wherein the Word embedding refers to the mapping from data to an input space to a distributed representation space through automatic learning, so that the data amount required by training is reduced, and the method can also be understood as embedding a high-dimensional space into a low-dimensional space, and each Word or phrase is mapped to a vector on a real number domain. The ELMO model is a way to represent word vectors by linear combinations between layers based on a bi-directional language model. The BERT model, based on transform bi-directional encoder representation, is an international advanced text representation method, and can represent the input text as a low-dimensional dense vector. In computer science, a low-dimensional dense vector generally does not exceed one thousand dimensions, each element in the vector is not 0 but a decimal between 0 and 1, a corresponding high-dimensional sparse vector generally exceeds one thousand dimensions, and most elements in the vector are 0. Unlike other language representation models in the near future, BERT aims to pre-train the deep bi-directional representation by jointly adjusting the context of all layers, so that the pre-trained BERT representation can fine-tune the text through an additional output layer, thereby meeting the text processing requirements.

As shown in fig. 1b, the text processing module 400 provided in this embodiment can be divided into the following four layers.

First, the layer 100a is input. The user terminal (also including the user terminal 10d, the user terminal 10b, or the user terminal 10c in fig. 1 a) sends the text to be processed to the server (also including the server 10a in fig. 1 a), where the text to be processed is "Tencent, advertisement, title, advertising", it should be noted that the "text to be processed" mentioned in this application is a word or phrase text that has been segmented by a segmentation system, for example, the original text is "Tencent advertisement is a department of Tencent corporation", and the text to be processed "Tencent, advertisement, which is Tencent, a company, English, one, a department", is obtained by the segmentation system using the above-mentioned standard segmentation technology, so that the text is not repeated.

Second, a first transport layer 200 a. After receiving the text to be processed, the server transmits the text to be processed to the word segmentation device, and after segmentation, the minimum segmentation granularity of the text, namely the sub-word level, is obtained, namely the text to be processed is segmented into a sub-word sequence 100b, namely, a Tencent # message, a Guangdong # message, a te # nce # # nt, an ad # ver # # tis # ing, wherein Teng, Sight, advertisement, TE, nce, nt, ad, ver, Tis and ing are all sub-words.

Third, layer 500 is encoded. The text to be processed and the sub-word sequence 100b are both natural languages, not computer languages, so that the encoding layer 500 converts the sub-words into sub-word vectors, first obtains sub-word initial vectors corresponding to each sub-word in the sub-word sequence 100b, and then generates a sub-word initial vector sequence 100c according to the sub-word initial vectors corresponding to a plurality of sub-words. The encoding layer 500 encodes the sub-word initial vector sequence 100c through the semantic extraction model 300 to obtain sub-word sense vectors corresponding to each sub-word in the sub-word sequence 100b, and then generates a sub-word sense vector sequence 100d according to the plurality of sub-word sense vectors.

Fourth, a second transport layer 200 b. Since the encoding layer 500 is a method of representing based on sub-word level, the encoding layer 500 outputs vector representation of each subword, that is, only vector representation of "te", "nce", "nt" can be output, and vector representation of "tencent" (word level) cannot be output, which brings challenges to natural language processing task, because in practical application, segmentation of subwords is performed on texts based on words or words, for example, keyword extraction is performed on texts, and the texts to be processed are segmented into "te # # nce # nt, ad # # tis #, is, the, de # # rt # # of, te # # nce # and patent" so that boundary information of words is lost when encoding is performed, that is: the tencent itself is a word, and it is possible to obtain such a wrong keyword extraction result as "tencent adver". Therefore, the second transmission layer 200b is added in the present application, and is used to recombine the sub-word sense vector sequence 100d, for example, obtain the number of segmentation segments, determine the number of sub-word sense vectors belonging to each segmentation, recombine the sub-word sense vectors according to the distribution position of each sub-word in the text, obtain the recombination vectors corresponding to each segmentation, and then combine the recombination vectors into a recombined word vector sequence 100e according to the distribution position of each segmentation in the text. Through the process, the sequence representation of the word can be obtained again on the basis of the text vector representation of the subword, and then the vector representation of the subsequent task is performed, so that the problem of word boundary error is avoided, and the better subword coding advantage brought by large-scale data pre-training of the BERT model can be effectively utilized. The text processing module 400 mentioned above can also be considered as a modified BERT module.

Further, please refer to fig. 1c, which is a schematic diagram of a text hierarchy structure provided in the embodiment of the present application. As shown in fig. 1c, in the foreign language, such as english, the character level is one letter, i.e., "T, e, n, …" in fig. 1c, and the subword is composed of one or more characters, i.e., "Te # # nce # # nt" in fig. 1c, and the word is a complete structure, which is the most important and direct level for composing the text, and it should be noted that, in chinese, the above characters and subwords are regarded as single words, such as "teng, news", and the word can be regarded as words, such as "teng news"; a phrase is a word combination consisting of multiple words or words, such as "Tencent advertising", "Tencent Advertising"; a sentence level is a complete text, such as "Tencent adapting is the department of content company".

The user terminal may include a mobile phone, a tablet computer, a notebook computer, a palm computer, an intelligent sound, a mobile internet device (MID, a mobile internet device), a POS (Point Of Sales) device, a wearable device (e.g., an intelligent watch, an intelligent bracelet, etc.), and the like.

Further, please refer to fig. 2a, which is a schematic view of a text processing scenario provided in an embodiment of the present application. For the convenience of distinguishing, the text is cut into word sequences and the words are cut into sub-word sequences, when the text is cut into words, commas are used for representing cutting marks, for example, the text 'artificial intelligence is an important component of computer science' is cut to obtain 'artificial intelligence, which is a word sequence of computer science, importance, composition and part'. When the words are divided into sub-words, the cut marks are represented by the # # ' which is the well sign, for example, the word sequence ' artificial and intelligent, computer, scientific, important, composition and part ' is divided into ' artificial # # artificial and intelligent # # functional ', calculating # # machine, scientific #, important # # and group #, and part # is divided into ' sub-word sequence '. As shown in fig. 2a, the user terminal 10D sends the text to the server 10a, and after the server 10a receives the text, the text is first participled, that is, the text is divided into words or phrases by a participle system, so as to obtain a word sequence 20a, that is, a sequence of "a, B, C, …, D"; then, the word sequence 20a is input into the text processing module 400 shown in fig. 1B, the text processing module 400 first performs sub-word segmentation on the segmented words in the word sequence 20a, as shown in fig. 2a, the segmented word a is segmented to obtain a sub-word a1 and a sub-word a2, the segmented word B is segmented to obtain a sub-word B1 and a sub-word B2, the segmented word C is segmented to obtain a sub-word C1 and sub-words C2 and C …, the segmented word D is segmented to obtain a sub-word D1 and a sub-word D2, and a sub-word sequence 20B of "a 1# # a2, B1# B2, C1# # C2, D1# # D2" is obtained according to the distribution positions of the sub-words in the text; converting each sub-word in the sub-word sequence 20b into a one-dimensional sub-word vector by methods such as querying a word vector table, namely generating a corresponding original sub-word vector for each sub-word in the sub-word sequence 20b, obtaining a sub-word original vector sequence 200a corresponding to the sub-word sequence 20b, and then inputting the sub-word original vector sequence 200a into a BERT model for further encoding to obtain a sub-word meaning vector sequence 200b containing text context semantic information.

If the sub-word sense vector sequence 200b is directly used for processing a specific task on a text, a boundary error problem may occur, such as extraction of a keyword from the text, and word-level vector representation is the basis, because if the extraction is performed according to sub-word-level coding, such as "Tencent science" is a keyword, but according to the sub-word vector representation, an erroneous keyword such as "science" may be obtained, which may result in a boundary error. Therefore, after the BERT model is output, the sub-word sense vector sequence 200b is recombined, as shown in fig. 2a, to obtain a recombined vector sequence 200c, and the recombined vectors in the recombined vector sequence 200c correspond to the words in the word sequence 20a one-to-one. Finally, according to practical application, the recombined vector sequence 200c is input into a specific natural language processing task, and an ideal result can be obtained. Please refer to fig. 2b together, which is a scene schematic diagram of text processing provided in the embodiment of the present application, after the server 10a obtains the text 10b, and obtains the restructuring vector sequence 200c based on the detailed process, the restructuring vector sequence 200c is input to the Neural network model 20, the model may include, but is not limited to, a Long Short-Term Memory network (LSTM), a Convolutional Neural Network (CNN), and the like, and then the keywords in the text 10b may be obtained, as shown in fig. 2b, the keywords in the text 10b are "china" and "science and technology", and the text 10b and the keywords "china" and the keywords "science and technology" are sent to the user terminal 10 d; after receiving the keyword transmitted from the server 10a, the user terminal 10d may display the keyword in the form of text or voice on the interface.

Further, please refer to fig. 3, which is a flowchart illustrating a text processing method according to an embodiment of the present application. As shown in fig. 3, the method may include:

and step S101, acquiring word segmentation in the text.

Specifically, the term "segmentation" as used herein may be a word in a foreign language, such as overturning, a word in a Chinese language, such as man, or a phrase, such as wanstar. The word segmentation in the text is a word sequence segmented by a word segmentation system in the text, namely, words or phrases and the like form a sequence according to the distribution positions of the words or phrases and the like in the text.

And S102, segmenting the participles to obtain at least two sub-participles corresponding to the participles.

Specifically, since the levels corresponding to the participles are words, phrases, or words, etc., and the segmentation granularity is large, the precision is rough when the participles are converted into participle vectors according to the word vector table, so that the participles are segmented before the text is encoded, and a smaller segmentation granularity, namely a sub-word level, is obtained. For example, the text "artificial intelligence is an important component of computer science" is first segmented to obtain "artificial intelligence", i.e., a word sequence of computer science, importance, composition, and part "which is then input into a text processing module, such as the text processing module 400 in fig. 1b, the word sequence" artificial intelligence ", i.e., computer science, importance, composition, and part" is first segmented into sub-word sequences, i.e., "artificial # # worker, intelligence # # worker, i.e., a # calculator # # calculator, a science # # student, a heavy # # student, a group # # student, and a part # # student".

Step S103, obtaining the subvectors corresponding to the at least two sub-participles respectively.

Specifically, the at least two sub-participles include a sub-participle S_nN is a positive integer less than or equal to the number of the at least two sub-participles; obtaining the sub-participle S_nCorresponding first initial sub-vector to obtain residual sub-componentA second initial sub-vector corresponding to the word; the remaining sub-participles include at least two of the sub-participles except the sub-participle S_nAn outer sub-word; performing vector fusion on the first initial sub-vector and the second initial sub-vector to obtain a fused sub-vector; performing weighted fusion on the fused subvector and the first initial subvector to obtain the sub-participle S_nThe corresponding sub-vector.

Please refer to fig. 4, which is a schematic view of a text processing scenario provided in the embodiment of the present application. As shown in FIG. 4, the second subword in the sequence of subwords 200 is regarded as the target subword 2002 (corresponding to the above-mentioned subword S)_n) Then, a sub-word vector representation corresponding to each sub-word in the sub-word sequence 200 is obtained, and actually, the input into the BERT model contains two parts in addition to each original sub-word vector in the sub-word sequence 200: 1. text vector: the value of the vector is automatically learned in the model training process, is used for depicting the global semantic information of the text and is fused with the semantic information of the subwords; 2. position vector: because semantic information carried by subwords appearing at different positions of a text is different, for example: "i love you" and "you love me", so the BERT model adds a different vector to each subword at a different position to distinguish them, and finally, the sum of the original subword vector, text vector and position vector is used as BERT model input. However, in order to distinguish the semantic subword vectors mentioned later, the sequence of subword vectors is still referred to as the sequence of subword original vectors 300, wherein the target subword original vector 3002a is the first initial subword vector of the target subword 2002. Respectively carrying out similarity calculation on the target sub-word original vector 3002a and the sub-vectors in the sub-word original vector sequence 300, wherein common similarity functions comprise dot products, splicing, a perception engine and the like to obtain a fused sub-vector 600, representing vector weights respectively corresponding to the target sub-word original vector 3002a and the other sub-word original vectors, then carrying out weighted summation on each vector weight in the fused sub-vector 600 and the corresponding sub-vector in the sub-word original vector sequence 300 to obtain a target sub-word vector 3002b, and thus obtaining the sub-word S_nThe corresponding sub-vector.

Step S104, coding and combining the sub-vectors respectively corresponding to the at least two sub-participles according to the distribution positions of the at least two sub-participles in the text to obtain a recombined vector; the restructuring vector is used for representing the participle and is used for performing text processing on the text.

Specifically, a sub-participle sequence including the at least two sub-participles is generated according to the distribution positions of the at least two sub-participles in the text; acquiring the segmentation segment number of the participle, and determining the target sub-participle number corresponding to the participle according to the segmentation segment number; dividing the sub-word segmentation sequence into the sub-sequences to be recombined according to the target sub-word segmentation quantity; the number of the sub-participles in the subsequence to be recombined is equal to the number of the target sub-participles. The sub-participles in the subsequence to be recombined comprise a first sub-participle and a second sub-participle, the sub-vector corresponding to the first sub-participle is a first sub-vector, and the sub-vector corresponding to the second sub-participle is a second sub-vector; obtaining the value P in the first sub-vector_iObtaining the value Q in the second sub-vector_i(ii) a i is the dimension number of the first sub-vector and the second sub-vector, and i is a positive integer; the above numerical value P_iThe value corresponding to the ith dimension in the first sub-vector, the value Q_iThe value is a numerical value corresponding to the ith dimension in the second sub-vector; for the above value P_iAnd the above numerical value Q_iPerforming numerical operation to generate a numerical value R_i(ii) a According to the above numerical value R_iAnd generating the recombination vector.

Optionally, the sub-participle sequence includes a sub-participle t_jJ is a positive integer less than or equal to the number of the sub-participles in the sub-participle sequence; obtaining the sub-participle t in the sub-participle sequence_jDividing the above sub-word t_jAs a participle to be evaluated; if the participle to be evaluated does not belong to the legal participle, acquiring a sub-participle t in the sub-participle sequence_j+1According to the above sub-participles t_jAnd the above sub-participles t_j+1Updating the word segmentation to be evaluated; if the updated participle to be evaluated belongs to the legal participle, determining that the updated participle to be evaluated belongs to the legal participleThe above sub-word segments t_jAnd the above sub-participles t_j+1Is the above-mentioned to-be-recombined subsequence; wherein, t_j-1Is empty, or t_j-1Belonging to the previous subsequence to be recombined. Obtaining a dictionary database, if the participle to be evaluated is stored in the dictionary database, determining that the participle to be evaluated belongs to the legal participle, and dividing the participle into t_jDetermining the sequence of the to-be-recombined subsequence; and if the participle to be evaluated does not exist in the dictionary database, determining that the participle to be evaluated does not belong to the legal participle.

Optionally, a value P in the first sub-vector is obtained_iObtaining the value Q in the second sub-vector_i(ii) a i is the dimension number of the first sub-vector and the second sub-vector, and i is a positive integer; the above numerical value P_iThe value corresponding to the ith dimension in the first sub-vector, the value Q_iThe value is a numerical value corresponding to the ith dimension in the second sub-vector; from the above value P_iAnd the above numerical value Q_iTo select the maximum or minimum value as the value M_i(ii) a According to the above value M_iAnd generating the recombination vector.

Please refer to fig. 5a, which is a schematic view of a text processing scenario provided in an embodiment of the present application. As shown in fig. 5a, each word in the word sequence 20a is segmented to obtain a sub-word sequence 20b, and both the word sequence 20a and the sub-word sequence 20b form a sequence according to the distribution positions of units (words and sub-words) in the text, so that the sequences can be formed according to the unit order when the text is recombined. Referring to fig. 5a, after the segmentation a is segmented, two subwords are generated, including a subword a1 and a subword a 2; after the segmentation B is segmented, two subwords are generated, wherein the two subwords comprise a subword B1 and a subword B2; after the segmentation C is segmented, two subwords are generated, wherein the two subwords comprise a subword C1 and a subword C2; after the segmentation word D is segmented, two sub-words are generated, wherein the two sub-words comprise a sub-word D1 and a sub-word D2; that is, the word sequence 20a divides words, such as "tend, advertising, advertisement" into "te # # nce # # nt, ad # # ver # # tis # # ing, Teng # # message, advertisement" according to the division sequence (2, 2, 2, …, 2), and each word is divided into 3, 4, 2, 2, respectively, so that after the division number of the words is obtained, the target sub-word number corresponding to the words can be determined according to the division number, and then the sub-word sequence 20b is divided into the sub-sequence 20c to be recombined according to the target sub-word number.

Optionally, please refer to fig. 5b together, which is a schematic view of a text processing scene provided in an embodiment of the present application. After the word sequence 20b is obtained, one or two subwords are taken as a subword group to be evaluated (namely, a participle to be evaluated), as shown in fig. 5b, (a1, a2) is taken as a subword group to be evaluated, then the participle to be evaluated generated by (a1, a2) recombination is confirmed with the dictionary database 10b, if the participle to be evaluated generated by (a1, a2) recombination exists in the dictionary database 10b, the participle to be evaluated is determined to belong to a legal participle, and the participle to be evaluated generated by (a1, a2) recombination is determined as a subsequence to be evaluated; and if the participles to be evaluated do not exist in the dictionary database 10b, determining that the participles to be evaluated do not belong to legal participles. At this time, as shown in fig. 5b, the next subword b1 is added on the basis of (a1, a2) to form a subword group to be evaluated (a1, a2, b1), and then the participles to be evaluated generated by recombining (a1, a2, b1) are confirmed with the dictionary database 10b, so that whether the subword group to be evaluated (a1, a2, b1) is a subsequence to be recombined can be determined.

And after obtaining the subsequence 20c to be recombined, coding and combining the subwords in the subsequence 20c to be recombined. The code combination mode can be divided into two categories, one is numerical operation and the other is numerical comparison. Referring to fig. 5a again, assuming that the vector corresponding to the subword a1 is (11 … 0) and the vector corresponding to the subword a2 is (22 … 4), the vector corresponding to the word a is obtained by adding the two vectors, i.e., the vector corresponding to the subword a1 and the value in the dimension corresponding to the vector corresponding to the subword a2 (33 … 4). In addition to the above-mentioned exemplary addition operation, when the vectors corresponding to a plurality of subwords are coded and combined, a subtraction operation, an average operation, or a superposition of two or more operations may be performed, which is not limited in this application.

When the vectors corresponding to the sub-words are compared in value and encoded and combined, for example, the maximum value is selected, the vector corresponding to the sub-word a1 (11 … 0) and the vector corresponding to the sub-word a2 (22 … 4) are compared in the maximum value, and then the vector corresponding to the generated participle a is (22 … 4).

It should be noted that the above mentioned vector values are only exemplary and are not representative.

The method includes the steps that word segmentation in a text is obtained; segmenting the word segmentation to obtain at least two sub-word segmentations corresponding to the word segmentation; acquiring subvectors corresponding to the at least two sub-participles respectively; coding and combining the sub-vectors respectively corresponding to the at least two sub-participles according to the distribution positions of the at least two sub-participles in the text to obtain a recombined vector; the restructuring vector is used for representing the participle and is used for performing text processing on the text. In the above way, the at least two sub-participles corresponding to the text are obtained by further segmenting the participles in the text; and then coding at least two sub-participles to obtain a sub-vector fused with context information, coding the sub-vector representation according to the distribution positions of the at least two sub-participles in the text to obtain a recombined vector corresponding to the text, and finally performing various text processing on the text according to the recombined vector. By adopting the method and the device, at least two sub-participles are obtained by segmenting the participles, and then the at least two sub-participles are coded and combined to obtain the recombination vector. Because the granularity of the sub-participles is smaller, the sub-vectors corresponding to the sub-participles are more accurate, and the accuracy of the word vectors of the participles can be improved.

Further, please refer to fig. 6, which is a flowchart illustrating a text processing method according to an embodiment of the present application. As shown in fig. 6, the method may include:

step S201, acquiring word segmentation in the text.

Step S202, segmenting the word segmentation to obtain at least two sub-word segmentations corresponding to the word segmentation.

Step S203, obtaining the subvectors corresponding to the at least two sub-participles respectively.

Step S204, coding and combining the sub-vectors respectively corresponding to the at least two sub-participles according to the distribution positions of the at least two sub-participles in the text to obtain a recombined vector; the restructuring vector is used for representing the participle and is used for performing text processing on the text.

For specific implementation processes of step S201 to step S204, reference may be made to the description of step S101 to step S104 in the embodiment corresponding to fig. 3, which is not described herein again.

Step S205, obtaining the criticality probability of the word segmentation corresponding to the restructured vector in the text according to the restructured vector.

Specifically, the embodiment can be applied to scenes such as advertisements and keyword prompts of an information retrieval system. Keywords play a significant role in a large number of situations, typically information retrieval systems (search engines, e-commerce searches), recommendation systems (e-commerce platforms), advertising systems, dialog systems, and so forth. For example, in a search engine, an advanced search engine may often provide some candidate words, please refer to fig. 7a and fig. 7b together, where fig. 7a is a schematic view of a scene of a candidate word provided in an embodiment of the present application, and fig. 7b is a schematic view of a scene of a candidate word provided in an embodiment of the present application. Fig. 7a is the prompt after the search application hgsxhy inputs "patent", fig. 7B is the prompt after the search application B inputs "patent", and the prompt words (candidate words) are all obtained by the keyword extraction technique.

Referring to fig. 2b again, after the recombined vector sequence 200c is obtained based on the steps S201 to S204, the recombined vector sequence 200c is input to the Neural network model 20, which may include, but is not limited to, a Long Short-Term Memory network (LSTM), a Convolutional Neural Network (CNN), and the like, so that the criticality probabilities of the recombined participles (i.e., the participles) respectively corresponding to the recombined vectors can be obtained, for example, after the processing of "chinese science and technology moving world", the criticality probability of "chinese" is 0.92, the criticality probability of "science and technology" is 0.93, the criticality probability of "moving" is 0.6, the criticality probability of "moving" is 0.23, and the criticality probability of "world" is 0.8.

Optionally, the neural network model 20 described above may include the text processing module 400 of fig. 1 b. The training process for the neural network model 20 may be: the method comprises the steps of obtaining a sample text and sample keywords of the sample text, inputting the sample text into the neural network model 20, obtaining a recombination vector of word segmentation of the sample text through a text processing module 400 in the neural network model 20, further inputting a prediction keyword corresponding to the sample text by the neural network model 20 based on the recombination vector, and adjusting model parameters in the neural network model 20 according to errors between the prediction keyword and the sample keyword until the neural network model 20 converges. The sub-word meaning vector generated by the trained neural network model 20 can accurately express text semantic information, so that keywords in the text can be accurately extracted.

Step S206, determining keywords in the text according to the criticality probability; wherein the word segmentation includes the keyword.

Specifically, after obtaining the criticality probability corresponding to each word, the keywords of the text may be determined according to the criticality probability, for example, the text "china science and technology moves to the world" may determine that the first two keywords are "china" and "science and technology" according to the criticality probability.

The recombined vector output by the text processing module can be applied to different natural language processing tasks, and optionally, the recombined vector comprises a first recombined vector and a second recombined vector; obtaining the similarity between the first recombination vector and the second recombination vector; and if the similarity is greater than a similarity threshold value, determining the participle corresponding to the first recombination vector and the participle corresponding to the second recombination vector as a semantic similar participle cluster of the text. The practical application scene of the task comprises the following steps: question-answering (judging whether a question matches an answer), sentence matching (whether two sentences express the same meaning), and the like.

Optionally, according to the recombination vector, identifying a part-of-speech type corresponding to the word segmentation; and acquiring part-of-speech tags corresponding to the part-of-speech types, and adding the part-of-speech tags corresponding to the participles at corresponding positions in the text. Since "lovely" can be used as a noun, for example, "lovely is her tagged", as a verb, for example, "i'm lovely spicy bar", or as an adjective, for example, "she is a lovely person", it is determined according to the context information when determining the part of speech of a word.

Please refer to fig. 8, which is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application. The text processing means may be a computer program (including program code) running on a computer device, for example, the text processing means is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application. As shown in fig. 8, the text processing apparatus 1 may include: a first acquisition module 11, a segmentation module 12, a second acquisition module 13 and a reassembly module 14.

The first obtaining module 11 is configured to obtain a word segmentation in a text;

a segmentation module 12, configured to segment the segmented word to obtain at least two sub-segmented words corresponding to the segmented word;

a second obtaining module 13, configured to obtain sub-vectors corresponding to the at least two sub-participles respectively;

a restructuring module 14, configured to perform coding combination on the subvectors corresponding to the at least two sub-participles respectively according to distribution positions of the at least two sub-participles in the text, so as to obtain a restructuring vector; the restructuring vector is used for representing the participle and is used for performing text processing on the text.

For specific functional implementation manners of the first obtaining module 11, the dividing module 12, the second obtaining module 13, and the recombining module 14, reference may be made to steps S101 to S104 in the embodiment corresponding to fig. 3, which is not described herein again.

Referring to fig. 8 again, the restructuring module 14 may include: a generation sequence unit 141, a division sequence unit 142, and an encoding combination unit 143.

A generating sequence unit 141, configured to generate a sub-participle sequence including the at least two sub-participles according to distribution positions of the at least two sub-participles in the text;

a dividing sequence unit 142, configured to divide the sub-word segmentation sequence into sub-sequences to be recombined, which are used to combine the word segments;

and the coding and combining unit 143 is configured to code and combine the subvectors corresponding to the sub-participles in the subsequence to be recombined to obtain the recombined vector.

The specific functional implementation manners of the sequence generating unit 141, the sequence dividing unit 142, and the code combining unit 143 may refer to step S104 in the embodiment corresponding to fig. 3, which is not described herein again.

Referring to fig. 8 again, the dividing sequence unit 142 may include: a number sub-unit 1421 and a partition sequence sub-unit 1422.

A determining number quantum unit 1421, configured to obtain the number of segmentation segments of the participle, and determine the number of target participles corresponding to the participle according to the number of segmentation segments;

a dividing sequence sub-unit 1422, configured to divide the sub-word segmentation sequence into the sub-sequences to be recombined according to the number of the target sub-words; the number of the sub-participles in the subsequence to be recombined is equal to the number of the target sub-participles.

For specific functional implementation manners of the quantity determining subunit 1421 and the sequence dividing subunit 1422, reference may be made to step S104 in the embodiment corresponding to fig. 3, which is not described herein again.

Referring to fig. 8 again, the dividing sequence unit 142 may include: obtaining a sub-unit to be evaluated 1423, updating the sub-unit to be evaluated 1424, and determining a sub-unit to be recombined 1425.

An obtaining subunit 1423 to be evaluated, configured to obtain a sub-participle t in the above-mentioned sub-participle sequence_jDividing the above sub-word t_jAs a participle to be evaluated;

an update to-be-evaluated subunit 1424, configured to, if the to-be-evaluated participle does not belong to a legal participle, obtain a sub-participle t in the sub-participle sequence_j+1According to the above sub-participles t_jAnd the above sub-participles t_j+1Updating the word segmentation to be evaluated;

a determining to-be-recombined subunit 1425, configured to determine, if the updated to-be-evaluated participle belongs to a legal participle, the above-mentioned participle t_jAnd the above sub-participles t_j+1Is the above-mentioned to-be-recombined subsequence; wherein, t_j-1Is empty, or t_j-1Belonging to the previous subsequence to be recombined.

For the specific functional implementation manner of obtaining the sub-unit to be evaluated 1423, updating the sub-unit to be evaluated 1424, and determining the sub-unit to be recombined 1425, reference may be made to step S104 in the embodiment corresponding to fig. 3, which is not described herein again.

Referring to fig. 8 again, the dividing sequence unit 142 may include: obtaining a sub-unit to be evaluated 1423, updating the sub-unit to be evaluated 1424, and determining a sub-unit to be recombined 1425, which may further include: a sub-unit to be evaluated 1426 is determined.

A determining to-be-evaluated subunit 1426, configured to obtain a dictionary database, determine that the participle to be evaluated belongs to the legal participle if the participle to be evaluated exists in the dictionary database, and divide the participle t into the legal participles_jDetermining the sequence of the to-be-recombined subsequence;

the determining to-be-evaluated subunit 1426 is further configured to determine that the to-be-evaluated segmented word does not belong to the legal segmented word if the to-be-evaluated segmented word does not exist in the dictionary database.

For determining a specific function implementation manner of the sub-unit 1426 to be evaluated, reference may be made to step S104 in the embodiment corresponding to fig. 3, which is not described herein again.

Referring to fig. 8 again, the code combining unit 143 may include: a first get value subunit 1431, an operand value subunit 1432, and a first regrouping vector subunit 1433.

A first obtaining value sub-unit 1431, configured to obtain a value P in the first sub-vector_iObtaining the value Q in the second sub-vector_i(ii) a i is the dimension number of the first sub-vector and the second sub-vector, and i is a positive integer; the above numerical value P_iThe value corresponding to the ith dimension in the first sub-vector, the value Q_iThe value is a numerical value corresponding to the ith dimension in the second sub-vector;

an operand value subunit 1432 for adding the value P_iAnd the above numerical value Q_iPerforming numerical operation to generate a numerical value R_i；

A first recombination vector quantum unit 1433 for determining the value R_iAnd generating the recombination vector.

The specific functional implementation manners of the first obtaining numerical value subunit 1431, the calculating numerical value subunit 1432, and the first regrouping vector subunit 1433 may refer to step S104 in the embodiment corresponding to fig. 3, which is not described herein again.

Referring to fig. 8 again, the code combining unit 143 may include: a second acquisition value subunit 1434, a selection value subunit 1435, and a second recombination quantum unit 1436.

A second obtaining value sub-unit 1434, configured to obtain the value P in the first sub-vector_iObtaining the value Q in the second sub-vector_i(ii) a i is the dimension number of the first sub-vector and the second sub-vector, and i is a positive integer; the above numerical value P_iThe value corresponding to the ith dimension in the first sub-vector, the value Q_iThe value is a numerical value corresponding to the ith dimension in the second sub-vector;

a select value subunit 1435 for selecting the value P from the values mentioned above_iAnd the above numerical value Q_iTo select the maximum or minimum value as the value M_i；

A second recombination vector quantum unit 1436 for counting the above value M_iAnd generating the recombination vector.

The specific functional implementation manners of the second obtaining numerical value subunit 1434, the selecting numerical value subunit 1435, and the second recombining quantum unit 1436 may refer to step S104 in the embodiment corresponding to fig. 3, which is not described herein again.

Referring to fig. 8 again, the second obtaining module 13 may include: an initial sub-vector obtaining unit 131, an initial sub-vector fusing unit 132, and an initial sub-vector weighting unit 133.

An initial sub-vector obtaining unit 131 for obtaining the sub-participles S_nObtaining a second initial sub-vector corresponding to the residual sub-word segmentation; the remaining sub-participles include at least two of the sub-participles except the sub-participle S_nAn outer sub-word;

a fused initial sub-vector unit 132, configured to perform vector fusion on the first initial sub-vector and the second initial sub-vector to obtain a fused sub-vector;

a weighted initial sub-vector unit 133, configured to perform weighted fusion on the fused sub-vector and the first initial sub-vector to obtain the sub-participle S_nThe corresponding sub-vector.

For specific functional implementation manners of the initial sub-vector obtaining unit 131, the initial sub-vector fusing unit 132, and the initial sub-vector weighting unit 133, reference may be made to step S103 in the embodiment corresponding to fig. 3, which is not described herein again.

Referring again to fig. 8, the text processing apparatus 1 may include: the first obtaining module 11, the cutting module 12, the second obtaining module 13, and the restructuring module 14 may further include: a determine criticality probability module 15 and a determine keywords module 16.

A keyword probability determining module 15, configured to determine, according to the restructuring vector, a keyword probability of the participle corresponding to the restructuring vector in the text;

a keyword determining module 16, configured to determine keywords in the text according to the criticality probability; wherein the word segmentation includes the keyword.

The specific functional implementation manners of the module 15 for determining the criticality probability and the module 16 for determining the keyword may refer to step S205 to step S206 in the embodiment corresponding to fig. 6, which is not described herein again.

Referring again to fig. 8, the text processing apparatus 1 may include: the first obtaining module 11, the cutting module 12, the second obtaining module 13, and the restructuring module 14 may further include: a similarity obtaining module 17 and a similar word segmentation determining module 18.

A similarity obtaining module 17, configured to obtain a similarity between the first recombined vector and the second recombined vector;

and a similar word segmentation module 18 for determining the word segmentation corresponding to the first recombined vector and the word segmentation corresponding to the second recombined vector as the semantic similar word segmentation cluster of the text if the similarity is greater than a similarity threshold.

The specific functional implementation manners of the similarity obtaining module 17 and the similar word segmentation determining module 18 may refer to step S206 in the embodiment corresponding to fig. 6, and are not described herein again.

Referring again to fig. 8, the text processing apparatus 1 may include: the first obtaining module 11, the cutting module 12, the second obtaining module 13, and the restructuring module 14 may further include: a part-of-speech recognition module 19 and a part-of-speech tagging module 20.

A part-of-speech recognition module 19, configured to recognize, according to the regrouping vector, a part-of-speech type corresponding to the segmented word;

a part-of-speech tagging module 20, configured to obtain a part-of-speech tag corresponding to the part-of-speech type, and add the part-of-speech tag corresponding to the participle at a corresponding position in the text.

The specific functional implementation manners of the part-of-speech recognizing module 19 and the part-of-speech tagging module 20 may refer to step S206 in the embodiment corresponding to fig. 6, which is not described herein again.

Fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 9, the computer apparatus 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the computer apparatus 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 9, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 1000 shown in fig. 9, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

acquiring word segmentation in a text;

In an embodiment, when the processor 1001 performs encoding and combining on the sub-vectors respectively corresponding to the at least two sub-participles according to the distribution positions of the at least two sub-participles in the text to obtain a recombined vector, specifically perform the following steps:

generating a sub-word segmentation sequence containing the at least two sub-words according to the distribution positions of the at least two sub-words in the text;

dividing the sub-word segmentation sequence into sub-sequences to be recombined for combining the word segmentation;

and coding and combining the subvectors corresponding to the sub-participles in the subsequence to be recombined to obtain the recombined vector.

In one embodiment, when the processor 1001 divides the sub-word segmentation sequence into sub-sequences to be recombined for combining into the word segmentation, the following steps are specifically performed:

acquiring the segmentation segment number of the participle, and determining the target sub-participle number corresponding to the participle according to the segmentation segment number;

dividing the sub-word segmentation sequence into the sub-sequences to be recombined according to the target sub-word segmentation quantity; the number of the sub-participles in the subsequence to be recombined is equal to the number of the target sub-participles.

In one embodiment, the sub-participle sequence comprises sub-participles t_jJ is a positive integer less than or equal to the number of the sub-participles in the sub-participle sequence;

when the processor 1001 divides the sub-participle sequence into to-be-recombined sub-sequences for combining into the participles, the following steps are specifically performed:

obtaining the sub-participle t in the sub-participle sequence_jDividing the above sub-word t_jAs a participle to be evaluated;

if the participle to be evaluated does not belong to the legal participle, acquiring a sub-participle t in the sub-participle sequence_j+1According to the above sub-participles t_jAnd the above sub-participles t_j+1Updating the word segmentation to be evaluated;

if the updated participle to be evaluated belongs to a legal participle, determining the sub-participle t_jAnd the above sub-participles t_j+1Is the above-mentioned to-be-recombined subsequence; wherein, t_j-1Is empty, or t_j-1Belonging to the previous subsequence to be recombined.

In an embodiment, when the processor 1001 divides the sub-participle sequence into to-be-recombined sub-sequences for combining into the participles, the following steps are specifically performed:

obtaining a dictionary database, if the participle to be evaluated is stored in the dictionary database, determining that the participle to be evaluated belongs to the legal participle, and dividing the participle into t_jDetermining the sequence of the to-be-recombined subsequence;

and if the participle to be evaluated does not exist in the dictionary database, determining that the participle to be evaluated does not belong to the legal participle.

In one embodiment, the sub-participles in the subsequence to be recombined include a first sub-participle and a second sub-participle, the sub-vector corresponding to the first sub-participle is a first sub-vector, and the sub-vector corresponding to the second sub-participle is a second sub-vector;

when the processor 1001 performs coding combination on the subvectors corresponding to the sub-participles in the subsequence to be recombined to obtain the recombined vector, the following steps are specifically performed:

obtaining the value P in the first sub-vector_iObtaining the value Q in the second sub-vector_i(ii) a i is the dimension number of the first sub-vector and the second sub-vector, and i is a positive integer; the above numerical value P_iThe value corresponding to the ith dimension in the first sub-vector, the value Q_iThe value is a numerical value corresponding to the ith dimension in the second sub-vector;

for the above value P_iAnd the above numerical value Q_iPerforming numerical operation to generate a numerical value R_i；

According to the above numerical value R_iAnd generating the recombination vector.

from the above value P_iAnd the above numerical value Q_iTo select the maximum or minimum value as the value M_i；

According to the above value M_iAnd generating the recombination vector.

In one embodiment, the at least two sub-participles comprise a sub-participle S_nN is a positive integer less than or equal to the number of the at least two sub-participles;

when the processor 1001 obtains the sub-vectors corresponding to the at least two sub-participles, the following steps are specifically performed:

obtaining the sub-participle S_nObtaining a second initial sub-vector corresponding to the residual sub-word segmentation; the remaining sub-participles include at least two of the sub-participles except the sub-participle S_nAn outer sub-word;

performing vector fusion on the first initial sub-vector and the second initial sub-vector to obtain a fused sub-vector;

performing weighted fusion on the fused subvector and the first initial subvector to obtain the sub-participle S_nThe corresponding sub-vector.

In one embodiment, the at least two sub-participles are characters in a participle system or roots composed of at least two characters.

In an embodiment, the processor 1001 further specifically executes the following steps:

obtaining the criticality probability of the word segmentation corresponding to the recombination vector in the text according to the recombination vector;

determining keywords in the text according to the criticality probability; wherein the word segmentation includes the keyword.

In one embodiment, the regrouping vector comprises a first regrouping vector and a second regrouping vector;

the processor 1001 further specifically executes the following steps:

obtaining the similarity between the first recombination vector and the second recombination vector;

and if the similarity is smaller than a similarity threshold value, determining the participle corresponding to the first recombination vector and the participle corresponding to the second recombination vector as the semantic similar participle cluster of the text.

identifying part-of-speech types corresponding to the participles according to the recombination vectors;

and acquiring part-of-speech tags corresponding to the part-of-speech types, and adding the part-of-speech tags corresponding to the participles at corresponding positions in the text.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a processor, the text processing method provided in each step in fig. 3 and fig. 6 is implemented, which may specifically refer to the implementation manner provided in each step in fig. 3 or fig. 6, and is not described herein again.

The computer-readable storage medium may be the text processing apparatus provided in any of the foregoing embodiments or an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, provided on the computer device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the computer device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the computer device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

The terms "first," "second," and the like in the description and in the claims and drawings of the embodiments of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprises" and any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, product, or apparatus that comprises a list of steps or elements is not limited to the listed steps or modules, but may alternatively include other steps or modules not listed or inherent to such process, method, apparatus, product, or apparatus.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The method and the related apparatus provided by the embodiments of the present application are described with reference to the flowchart and/or the structural diagram of the method provided by the embodiments of the present application, and each flow and/or block of the flowchart and/or the structural diagram of the method, and the combination of the flow and/or block in the flowchart and/or the block diagram can be specifically implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block or blocks.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A method of text processing, the method comprising:

acquiring word segmentation in a text;

obtaining sub-vectors corresponding to the at least two sub-participles respectively;

coding and combining the sub-vectors respectively corresponding to the at least two sub-participles according to the distribution positions of the at least two sub-participles in the text to obtain a recombined vector; wherein the reformulation vector is used to characterize the participle and to text process the text.

2. The method according to claim 1, wherein said encoding and combining the sub-vectors respectively corresponding to the at least two sub-participles according to the distribution positions of the at least two sub-participles in the text to obtain a recombined vector comprises:

dividing the sub-word segmentation sequence into sub-sequences to be recombined for combining into the word segmentation;

3. The method of claim 2, wherein the dividing the sub-participle sequence into to-be-recombined sub-sequences for combining into the participle comprises:

acquiring the segmentation segment number of the segmentation word, and determining the target sub-segmentation word number corresponding to the segmentation word according to the segmentation segment number;

dividing the sub-word segmentation sequence into the sub-sequences to be recombined according to the target sub-word segmentation quantity; and the number of the sub-participles in the subsequence to be recombined is equal to the number of the target sub-participles.

4. The method of claim 2, wherein the sequence of sub-participles comprises a sub-participle t_jJ is a positive integer less than or equal to the number of sub-participles in the sub-participle sequence;

the dividing the sub-participle sequence into to-be-recombined sub-sequences for combining into the participle comprises:

obtaining the sub-participles t in the sub-participle sequence_jDividing the sub-word t_jAs a participle to be evaluated;

if the participle to be evaluated does not belong to the legal participle, acquiring a sub-participle t in the sub-participle sequence_j+1According to the sub-participle t_jAnd the sub-participles t_j+1Updating the word segmentation to be evaluated;

if the updated participle to be evaluated belongs to a legal participle, determining the sub-participle t_jAnd the sub-participles t_j+1Is the sequence of the to-be-recombined subsequence; wherein, t_j-1Is empty, or t_j-1Belonging to the previous subsequence to be recombined.

5. The method of claim 4, further comprising:

obtaining a dictionary database, if the participle to be evaluated exists in the dictionary database, determining that the participle to be evaluated belongs to the legal participle, and dividing the participle t_jDetermining the sequence as the to-be-recombined subsequence;

6. The method according to claim 2, wherein the sub-participles in the subsequence to be recombined include a first sub-participle and a second sub-participle, the sub-vector corresponding to the first sub-participle is a first sub-vector, and the sub-vector corresponding to the second sub-participle is a second sub-vector;

the coding combination of the subvectors corresponding to the sub-participles in the subsequence to be recombined to obtain the recombined vector comprises the following steps:

obtaining a value P in the first sub-vector_iObtaining the value Q in the second sub-vector_i(ii) a i is the dimension number of the first sub-vector and the second sub-vector, and i is a positive integer; the value P_iIs the value corresponding to the ith dimension in the first sub-vector, and the value Q_iThe value is a numerical value corresponding to the ith dimension in the second sub-vector;

for the value P_iAnd the value Q_iPerforming numerical operation to generate a numerical value R_i；

According to the value R_iAnd generating the recombination vector.

7. The method according to claim 2, wherein the sub-participles in the subsequence to be recombined include a first sub-participle and a second sub-participle, the sub-vector corresponding to the first sub-participle is a first sub-vector, and the sub-vector corresponding to the second sub-participle is a second sub-vector;

obtaining a value P in the first sub-vector_iObtaining the value Q in the second sub-vector_i(ii) a i is the dimension number of the first sub-vector and the second sub-vector, and i is a positive integer; the value P_iIs the value corresponding to the ith dimension in the first sub-vectorThe value Q_iThe value is a numerical value corresponding to the ith dimension in the second sub-vector;

from said value P_iAnd the value Q_iTo select the maximum or minimum value as the value M_i；

According to the value M_iAnd generating the recombination vector.

8. The method of claim 1, wherein the at least two sub-participles comprise a sub-participle S_nN is a positive integer less than or equal to the number of the at least two sub-participles;

the obtaining of the subvectors corresponding to the at least two sub-participles respectively includes:

obtaining the sub-participle S_nObtaining a second initial sub-vector corresponding to the residual sub-word segmentation; the residual sub-participles comprise the sub-participles S except the sub-participle S in the at least two sub-participles_nAn outer sub-word;

9. The method of claim 1, wherein the at least two sub-participles are characters in a participle system or roots consisting of at least two characters.

10. The method of claim 1, further comprising:

determining the criticality probability of the participles corresponding to the recombination vectors in the text according to the recombination vectors;

determining keywords in the text according to the criticality probability; wherein the segmentation includes the keyword.

11. The method of claim 1, wherein the recomposition vector comprises a first recomposition vector and a second recomposition vector;

further comprising:

and obtaining the similarity between the first recombination vector and the second recombination vector, and if the similarity is greater than a similarity threshold value, determining the participle corresponding to the first recombination vector and the participle corresponding to the second recombination vector as the semantic similar participle of the text.

12. The method of claim 1, further comprising:

identifying part-of-speech types corresponding to the participles according to the recombined vectors;

13. A text processing apparatus, characterized in that the apparatus comprises:

the segmentation module is used for segmenting the participle to obtain at least two sub-participles corresponding to the participle;

the encoding module is used for encoding and combining the sub-vectors respectively corresponding to the at least two sub-participles according to the distribution positions of the at least two sub-participles in the text to obtain a recombined vector; wherein the reformulation vector is used to characterize the participle and to text process the text.

14. A computer device, comprising: a processor, a memory, and a network interface;

the processor is connected to the memory and the network interface, wherein the network interface is configured to provide data communication functions, the memory is configured to store program code, and the processor is configured to call the program code to perform the steps of the method according to any one of claims 1 to 12.

15. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the steps of the method according to any one of claims 1 to 12.