CN114818736B

CN114818736B - Text processing method, chain finger method and device for short text and storage medium

Info

Publication number: CN114818736B
Application number: CN202210612667.1A
Authority: CN
Inventors: 林泽南; 赵岷; 傅瑜; 张国鑫; 秦华鹏; 吕雅娟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2023-06-09
Anticipated expiration: 2042-05-31
Also published as: CN114818736A

Abstract

The disclosure provides a text processing method, a chain finger method, a device, equipment, a storage medium and a computer program product for short text, and relates to the technical fields of artificial intelligence such as knowledge graph, deep learning, natural language processing and the like. The specific implementation scheme is as follows: constructing a text data set for each word sense item of the ambiguous word, wherein the ambiguous word corresponds to a plurality of word sense items; carrying out semantic analysis on short texts containing ambiguous words to obtain a plurality of candidate disambiguation words; matching each candidate disambiguation word with the text data set of each word sense item respectively; responding to successful matching of a candidate disambiguation word with a text data set of only one word sense item, and taking the candidate disambiguation word as a target disambiguation word of the matched word sense item; and saving the short text containing the target disambiguation term to a disambiguation text set of the matched term sense. The obtained disambiguation text set is accurate and concise.

Description

Text processing method, chain finger method and device for short text and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies such as knowledge graph, deep learning, and natural language processing, and in particular, to a text processing method, a chain finger method for short text, an apparatus, a device, a storage medium, and a computer program product.

Background

In the field of natural language processing, short text generally refers to text with a short length and a small number of characters, such as search Query, dialogue content, various titles, and the like. In various short text application scenarios, a quick understanding of the short text content may be achieved by means of Linking (Linking). The most common is to associate an "Entity mention (introduction)" in a short text with an Entity (Entity) in a semantic knowledge base, thereby implementing Entity chain finger (Entity Linking).

Disclosure of Invention

The disclosure provides a text processing method, a chain finger method, a device, equipment, a storage medium and a computer program product for short texts, and the obtained disambiguated text set is accurate and concise, so that the chain finger efficiency of the short texts is improved.

According to a first aspect of the present disclosure, there is provided a text processing method, including: constructing a text data set for each word sense item of the ambiguous word, wherein the ambiguous word corresponds to a plurality of word sense items; carrying out semantic analysis on short texts containing ambiguous words to obtain a plurality of candidate disambiguation words; matching each candidate disambiguation word with the text data set of each word sense item respectively; responding to successful matching of a candidate disambiguation word with a text data set of only one word sense item, and taking the candidate disambiguation word as a target disambiguation word of the matched word sense item; and saving the short text containing the target disambiguation term to a disambiguation text set of the matched term sense.

According to a second aspect of the present disclosure, there is provided a chain finger method for short text, comprising: obtaining a short text to be processed, and determining a target ambiguous word from the short text to be processed; acquiring a plurality of word sense items of a target ambiguous word and a disambiguation text set of each word sense item, wherein the disambiguation text set is obtained by the text processing method provided by the first aspect; matching the short text to be processed with the disambiguated text set of each word sense item respectively; based on the matching results, chain finger results are determined for the target ambiguous word.

According to a third aspect of the present disclosure, there is provided a text processing apparatus comprising: a construction module configured to construct a text data set for each word sense item of an ambiguous word, wherein the ambiguous word corresponds to a plurality of word sense items; the analysis module is configured to carry out semantic analysis on the short text containing the ambiguous words to obtain a plurality of candidate disambiguation words; a matching module configured to match each candidate disambiguation word with the text dataset of each term respectively; a determination module configured to, in response to a candidate disambiguation word successfully matching only the text dataset of one of the word sense items, disambiguate the candidate as a target disambiguation word for the matched word sense item; and a saving module configured to save the short text containing the target disambiguation term to the disambiguated text set of matched term senses.

According to a fourth aspect of the present disclosure, there is provided a chain finger device for short text, comprising: the first acquisition module is configured to acquire short texts to be processed and determine target ambiguous words from the short texts to be processed; a second obtaining module configured to obtain a plurality of word sense items of the target ambiguous word and a disambiguated text set of each word sense item, wherein the disambiguated text set is obtained by the text processing apparatus provided by the third method; the text matching module is configured to match the short text to be processed with the disambiguated text set of each word sense item respectively; and a chain finger module configured to determine a chain finger result for the target ambiguous word based on the matching result.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the text processing method or the chain finger method for short text.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the text processing method or the chain finger method for short text.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the above-described text processing method or chain finger method for short text.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary system architecture diagram to which the present disclosure may be applied;

FIG. 2 is a flow chart of one embodiment of a text processing method according to the present disclosure;

FIG. 3 is a flow chart of another embodiment of a text processing method according to the present disclosure;

FIG. 4 is a flow chart of yet another embodiment of a text processing method according to the present disclosure;

FIG. 5 is a flow chart of one embodiment of a chain finger method for short text according to the present disclosure;

FIG. 6 is a flow chart of another embodiment of a chain finger method for short text according to the present disclosure;

FIG. 7 is a schematic diagram of a structure of one embodiment of a text processing device according to the present disclosure;

FIG. 8 is a schematic structural view of one embodiment of a chain finger device for short text according to the present disclosure;

fig. 9 is a block diagram of an electronic device used to implement a text processing method or a chain finger method for short text in accordance with an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the text processing method, text processing apparatus, or chain finger method for short text, chain finger apparatus for short text of the present disclosure may be applied.

As shown in fig. 1, a system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the

terminal devices

101, 102, 103, or the interaction between the

terminal devices

101, 102, 103 may be implemented via the server 105. Various client applications, such as a text processing application, a search application, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal devices

101, 102, 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smartphones, tablets, laptop and desktop computers, and the like. When the

terminal devices

101, 102, 103 are software, they can be installed in the above-described electronic devices. Which may be implemented as a plurality of software or software modules, or as a single software or software module. The present invention is not particularly limited herein.

Server 105 may provide various text processing services. For example, the server 105 may analyze and process the short text information acquired from the

terminal devices

101, 102, 103 and generate processing results (e.g., chain-finger the vocabulary in the short text, etc.).

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster formed by a plurality of servers, or as a single server. When server 105 is software, it may be implemented as a plurality of software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

It should be noted that, the text processing method or the chain finger method for short text provided in the embodiments of the present disclosure is generally performed by the server 105, and accordingly, the text processing apparatus or the chain finger apparatus for short text is generally disposed in the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of a text processing method according to the present disclosure is shown. The text processing method comprises the following steps:

step 201, a text data set is constructed for each word sense item of an ambiguous word, wherein the ambiguous word corresponds to a plurality of word sense items.

In this embodiment, the execution subject of the text processing method (e.g., the server 105 shown in fig. 1) may first obtain an ambiguous word and a plurality of word sense items of the ambiguous word from a pre-established word sense knowledge base. Wherein, the ambiguous word can refer to a word covering a plurality of expression meanings, and each expression meaning can be used as a word meaning item. For example, the word "white plum" may cover three expression meanings: "Tang dynasty poem Living" and "wife of the host" and "game character Living" such that the word "Living" corresponds to three term and each term has its own ID (Identity document, identity) in the term knowledge base.

After determining the word sense item corresponding to each ambiguous word, the execution body may further construct a text data set for each word sense item, so that each word sense item corresponds to one text data set. Various long and short texts may be included in the text data set, the contents of which are used to describe the corresponding word sense items. For example, in the text data set of the term "tangshen poetry white", text contents such as related encyclopedia introduction, poetry analysis and the like can be included. In some optional implementations of this embodiment, the description text related to the word sense term in the word sense knowledge base may be directly used as the text data set of the word sense term, or the text data set may be further supplemented and perfected by a manual addition method.

And 202, carrying out semantic analysis on the short text containing the ambiguous words to obtain a plurality of candidate disambiguation words.

In this embodiment, the executing body first obtains short texts including ambiguous words, for example, search Query including ambiguous word "liqueur", article titles, etc., where the sources of the short texts may be word sense knowledge base, or may be various web pages and application programs. And then carrying out semantic analysis on the short texts, removing useless words unsuitable for being used as disambiguation words, such as auxiliary words, query words, mood words, punctuation and the like, from the short texts, and taking the rest words in the short texts as candidate disambiguation words.

It should be noted that, disambiguation words referred to in the embodiments of the present disclosure refer to words that can assist disambiguation of ambiguous words and point to a specific term. For example, the disambiguation word "Wang Lun" may explicitly point the meaning of the expression of the ambiguous word "Lifewhite" to the term "Tang poem Lifewhite".

Step 203, each candidate disambiguation word is matched with the text data set of each word sense item.

In this embodiment, after obtaining the candidate disambiguation words, the execution entity may further match each disambiguation word with each text data set. In some alternative matching modes, candidate disambiguation words may be used as a search target to search for strings directly in the text dataset. In other alternative matching modes, semantic parsing may be performed on the text in the text dataset, and then candidate ambiguous words may be found in the parsed text dataset. It may then be determined whether the candidate disambiguation word matches the text data set based on the number of times the candidate disambiguation word appears in the text data set. For example, a certain matching frequency threshold may be preset, if the number of times that the candidate disambiguation word appears in the text data set is greater than the matching frequency threshold, it may be determined that the candidate disambiguation word is successfully matched with the text data set, otherwise, the matching is failed.

In some alternative implementations, a comprehensive determination may be made of all candidate disambiguation words in a short text to determine whether a match with the text dataset was successful. Specifically, after each candidate disambiguation word is respectively matched with the text data set, the occurrence times of each candidate disambiguation word in the same short text may be further summed, and then whether the candidate disambiguation word is matched with the text data set is determined according to the summation result. In this case, all candidate disambiguation words in one short text are identical to the matching situation of the text dataset.

Step 204, in response to successful matching of a candidate disambiguation word with the text data set of only one word sense item, the candidate disambiguation is taken as the target disambiguation word of the matched word sense item.

In this embodiment, statistics may be further performed on the matching situation of the candidate disambiguation words and the text data set, if one candidate disambiguation word is only successfully matched with one text data set, the candidate disambiguation word may be used as a target disambiguation word of a word sense item, that is, a word sense item corresponding to the successfully matched text data set, and the target disambiguation word may be stored in the disambiguation word set of the word sense item.

Step 205, storing the short text containing the target disambiguation term into a disambiguation text set of the matched term.

In this embodiment, after determining the target disambiguation word, the executing body may further filter the short text containing the disambiguation word obtained in step 202, and obtain the short text containing the target disambiguation word from the short text. And then, the short text obtained by screening is stored in a disambiguation text set of the word sense item corresponding to the target disambiguation word. In the subsequent disambiguation of the vocabulary, the set of disambiguation text may be used in place of the descriptive text of the word sense term in the word sense knowledge base.

According to the text processing method provided by the embodiment of the disclosure, a text data set is firstly constructed for each word sense item of an ambiguous word, wherein the ambiguous word corresponds to a plurality of word sense items, then semantic analysis is carried out on a short text containing the ambiguous word to obtain a plurality of candidate disambiguation words, then each candidate disambiguation word is matched with the text data set of each word sense item respectively, and in response to successful matching of one candidate disambiguation word with the text data set of one word sense item, the candidate disambiguation is used as a target disambiguation word of the matched word sense item, and finally the short text containing the target disambiguation word is stored into the disambiguation text set of the matched word sense item. By matching the vocabulary in the short text with the text data set of the word sense items, a disambiguation text set only containing the short text is generated for each word sense item, so that the text data in the disambiguation text set is accurate and concise.

With further continued reference to fig. 3, a flow 300 of another embodiment of a text processing method according to the present disclosure is shown. The text processing method comprises the following steps:

step 301, a text data set is constructed for each word sense item of an ambiguous word, wherein the ambiguous word corresponds to a plurality of word sense items.

And 302, carrying out semantic analysis on the short text containing the ambiguous words to obtain a plurality of candidate disambiguation words.

Step 303, matching each candidate disambiguation word with the text data set of each word sense item.

Step 304, in response to successful matching of a candidate disambiguation word with the text data set of only one word sense item, the candidate disambiguation is taken as the target disambiguation word of the matched word sense item.

Step 305, storing the short text containing the target disambiguation term into a disambiguation text set of the matched term.

In this embodiment, the specific operations of steps 301-305 are described in detail in the embodiment shown in fig. 2 and steps 201-205 are not described herein.

And 306, respectively performing vector compression operation on the disambiguation text set of each word sense item to obtain a compressed average vector of the disambiguation text set, and taking the compressed average vector as the disambiguation vector of the corresponding word sense item. .

In this embodiment, the execution body may further perform vector conversion and compression on each disambiguation text set, so that each word sense item may correspond to one disambiguation vector. Specific vector compression operations include: vector conversion is carried out on each short text in the text set to obtain a plurality of text vectors; and carrying out weighted fitting on the plurality of text vectors to obtain a compressed average vector. Specifically, each short text in a disambiguated set of text may first be converted to a text vector, e.g., text may be converted to vector form using pre-trained language model BERT (Bidirectional Encoder Representation from Transformers), relic ERNIE, resulting in multiple text vectors. Then, the multiple text vectors are weighted and fitted to obtain a compressed average vector, for example, the vector compression can be realized by using Whitening or SimCSE (Simple Contrastive Learning of Sentence Embeddings) modes, i.e. all short texts under the word sense term are used as training sets, and the compressed average vector is constructed into a short text set. In this way each disambiguated text set may be converted into a compressed average vector that may be used directly as a disambiguating vector for the term corresponding to the disambiguated text set.

Step 307, configuring a word-sense-free text set for the ambiguous word, wherein the word-sense-free text set comprises a plurality of word-sense-free short texts, and the word-sense-free short texts contain ambiguous words and do not contain target disambiguation words.

In this embodiment, a word-sense-free text set may be further configured for the ambiguous word, where the word-sense-free text set includes a plurality of word-sense-free short texts, where the word-sense-free short texts refer to short texts that cannot be disambiguated, i.e., cannot be mapped to a word-sense item. Among these word-sense-free short texts, two short texts including ambiguous words but not including target disambiguation words such as "who is the word" lye white "," Li Baiduo big ", including ambiguous words" lye white ", but not including any target disambiguation words that can point to" tangshi lye white "," wife lye white of a host ", or" game character lye white "of a word sense item, may be regarded as word-sense-free short texts.

When the non-word sense text set is constructed, the non-word sense text set may be constructed by manual screening, or may be screened from all short texts containing ambiguous words obtained in the step 302, that is, if the short texts containing ambiguous words cannot be stored in the disambiguated text set of a certain word sense item, the short texts containing ambiguous words may be directly stored in the non-word sense text set of the corresponding ambiguous words.

And 308, performing vector compression operation on the word-sense-free text set, and taking the obtained compressed average vector as a disambiguation vector of the word-sense-free text set.

In this embodiment, the execution body may further perform vector conversion and compression on the word-sense-free text set, so that the word-sense-free text set may also correspond to a compressed average vector, that is, a disambiguation vector of the word-sense-free text set. The specific vector compression operation may be as described in step 306 above, and the text set to be processed, i.e., the word-free text set, and the specific compression method will not be described herein.

As can be seen from fig. 3, compared with the embodiment corresponding to fig. 2, the text processing method in this embodiment may further perform a vector compression operation on each disambiguation text set after obtaining the disambiguation text set of each word sense item, so as to generate a disambiguation vector for each word sense item, thereby improving the degree of distinction between word sense items. And a word sense-free text set and a corresponding disambiguation vector are configured for the ambiguous word, so that the distinction between word sense items and word sense-free conditions can be further improved.

With further continued reference to fig. 4, a flow 400 of yet another embodiment of a text processing method according to the present disclosure is shown. The text processing method comprises the following steps:

Step 401, generating an initial text resource set.

In this embodiment, the execution body of the text processing method (for example, the server 105 shown in fig. 1) may first generate an initial text resource set, where the initial text resource set may include a full-web search Query set, may also include various types of web page data, such as encyclopedia, bar, news information, and the like, and may also include manually input disambiguation text and disambiguation words.

And step 402, respectively acquiring text resources matched with each word sense item of the ambiguous word from the initial text resource set to obtain a text data set of each word sense item.

In this embodiment, the executing body may first determine each word sense item of the ambiguous word by using an existing word sense knowledge base, and then obtain a text resource matching each word sense item from the initial text set. For example, keywords associated with word sense items can be used to screen from a search Query set of an initial text resource set and various web page data to obtain web page content matched with each word sense item as a component of the text data set; the encyclopedia page corresponding to the term sense may also be directly obtained as another component of the text data set.

Step 403, screening short text containing ambiguous words from the initial text resource set.

In this embodiment, the execution body may use the ambiguous word itself as the search keyword, and filter out the short text containing the ambiguous word from the initial text resource set. The short text comprises search Query, various titles (such as news titles and commodity titles), question descriptions in a question-answering platform and the like. In this embodiment, the short text is typically no longer than 64 characters in length.

And 404, carrying out text analysis on the short text containing the ambiguous words to obtain a plurality of candidate disambiguation words.

Step 405, each candidate disambiguation word is matched with the text data set of each word sense item.

Step 406, responding that one candidate disambiguation word is successfully matched with the text data set of one word sense item, and taking the candidate disambiguation word as a target disambiguation word of the matched word sense item.

Step 407, storing the short text containing the target disambiguation term into a disambiguation text set of the matched term.

In this embodiment, the specific operations of steps 404-407 are described in detail in the embodiment shown in fig. 2 and steps 202-205 are not described herein.

Step 408, obtaining key text information from a text data set of a word sense item.

In this embodiment, the execution body may further obtain, from a text dataset of a word sense item, key text information with word sense representativeness, such as a abstract portion in an encyclopedia page, a SPO triplet with high confidence, and the like. Wherein an SPO triplet includes a subject (subject), a predicate (predicate), and an object (object) of a sentence.

And 409, extracting the expanded disambiguation words from the key text information, and splicing the expanded disambiguation words and the ambiguous words to obtain a spliced short text.

In this embodiment, an extended disambiguation word may be further extracted from the key text information, and for example, the keyword of the abstract portion and the object, i.e., the O value in the SPO triplet, may be used as the extended disambiguation word. The execution body may then splice the expanded disambiguation word with the ambiguous word, e.g., directly connect the plurality of expanded disambiguation words with the ambiguous word, respectively, to obtain a plurality of spliced short texts.

Step 410, storing the spliced short text to the disambiguated text set of the term.

In this embodiment, the resulting concatenated short text may be saved in the disambiguated text set for that term determined in step 408 above. The expanded disambiguation words are mined, and the spliced short text obtained by splicing the expanded disambiguation words and the ambiguous words is used, so that the disambiguation text set can be supplemented, and the problem of insufficient text quantity in the disambiguation text set is effectively solved.

In some alternative implementations, the text processing method 400 may further include steps 306-308 in fig. 3, the details of which are described in detail in fig. 3 and are not described here.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the text processing method in this embodiment may first generate an initial text resource set, and then obtain a text data set of each word sense item and a short text containing an ambiguous word from the initial text resource set for subsequent processing. Therefore, the obtained disambiguation words and the disambiguation text set are not dependent on the pre-recording of the semantic knowledge base, and the application range of the disambiguation text set is effectively expanded.

With further continued reference to fig. 5, there is shown a flow 500 of one embodiment of a chain finger method for short text according to the present disclosure, the chain finger method comprising the steps of:

step 501, a short text to be processed is obtained, and a target ambiguous word is determined from the short text to be processed.

In this embodiment, the execution subject of the chain finger method (e.g., the server 105 shown in fig. 1) may first obtain a short text to be processed, which may be a short text that is currently required to be analyzed and processed. For example, a search Query obtained from a search page, a question description obtained from a question-answer page, a commodity title obtained from a commodity page, and the like. And then determining target ambiguous words to be disambiguated from the short text to be processed, wherein the target ambiguous words can be determined according to the current task type, for example, the current search Query is obtained through a website page of the movie and television class, and then the vocabulary related to the movie and television class can be determined as the target ambiguous words.

It should be noted that, in this embodiment, the target ambiguous word is not only an entity word, but also a non-entity word, such as a concept word or any other vocabulary.

Step 502, obtaining a plurality of word sense items of the target ambiguous word and a disambiguated text set of each word sense item.

In this embodiment, the execution entity may first obtain all word sense terms of the target ambiguous word from the existing word sense knowledge base, and then further determine a disambiguated text set of each word sense term, where the disambiguated text set is obtained by using the text processing method provided in any of fig. 2 to fig. 4.

Step 503, matching the short text to be processed with the disambiguated text set of each word sense item.

In this embodiment, the execution body may match the short text to be processed with the disambiguated text set of each word sense item, respectively. Specifically, the short text to be processed can be directly searched in the disambiguated text set, and then the search result is used as a matching result; alternatively, the text similarity of the short text to be processed and each disambiguated text set may be calculated, and then the text similarity may be used as a matching result.

Step 504, determining a chain finger result for the target ambiguous word based on the matching result.

In this embodiment, the execution body may determine the chain finger result of the target ambiguous word according to the obtained matching result, for example, if the occurrence frequency of the short text to be processed in a certain disambiguation text set is greater than a predetermined frequency threshold, the word sense item of the disambiguation text set may be determined as the chain finger result of the target ambiguous word. Or sequencing the short text to be processed and the text similarity of each disambiguated text set, and then determining the word sense item of the disambiguated text set with the highest text similarity as the chain finger result of the target ambiguous word.

As can be seen from fig. 5, in the chain finger method for short text of the present embodiment, first, a short text to be processed is obtained, and a target ambiguous word is determined from the short text to be processed; and then acquiring a plurality of word sense items of the target ambiguous word and a disambiguation text set of each word sense item, then respectively matching the short text to be processed with the disambiguation text set of each word sense item, and finally determining a chain finger result for the target ambiguous word based on the matching result. When the short text to be processed is disambiguated, the utilized disambiguation text set also only comprises the short text, and does not comprise excessive redundant information, so that the disambiguation efficiency of the short text is improved, and the accuracy of word chain fingers in the short text is improved.

With further continued reference to fig. 6, there is shown a flow 600 of another embodiment of a chain finger method for short text according to the present disclosure, the chain finger method comprising the steps of:

step 601, obtaining a short text to be processed, and determining a target ambiguous word from the short text to be processed.

Step 602, obtaining a plurality of word sense items of the target ambiguous word and a disambiguated text set of each word sense item.

In this embodiment, the above steps 601-602 are already described in detail in steps 501-502 in fig. 5, and are not described herein.

Step 603, obtaining a compressed average vector of the short text to be processed and a disambiguation vector of each word sense term.

In this embodiment, the execution body may perform vector compression operation on the short text to be processed to obtain a corresponding compressed average vector. Meanwhile, the disambiguation vector of each word sense term may be obtained according to the text processing method provided in fig. 3, and the specific vector compression method may also refer to the specific description in step 306 in fig. 3, which is not repeated herein.

Step 604, the compressed average vectors of the short text to be processed are respectively matched with the disambiguation vector of each word sense item.

In this embodiment, the execution body may calculate a vector similarity between a compressed average vector of the short text to be processed and the disambiguation vector of each term, and use the calculation result as the first matching result. Vector similarity may be calculated using methods such as manhattan distance, euclidean distance, cosine function, and the like.

Step 605, acquiring disambiguation vectors of the word-sense-free text set of the target ambiguous word.

In this embodiment, the execution body may further obtain a disambiguation vector of the word-sense-free text set of the target ambiguous word, where the disambiguation vector of the word-sense-free text set may be obtained according to the text processing method provided in fig. 3.

Step 606, matching the compressed average vector of the short text to be processed with the disambiguation vector of the word-sense-free text set.

In this embodiment, the execution body may calculate a vector similarity between the compressed average vector of the short text to be processed and the disambiguation vector of the word-free text set, and use the calculation result as the second matching result.

Step 607, judging whether the matching degree of the word-sense-free text set is better than the matching degree of all the disambiguated text sets.

In this embodiment, the execution body may determine whether the matching degree of the word-sense-free text set is better than the matching degree of all the disambiguated text sets according to the first matching result and the second matching result. Specifically, the method may determine according to the calculated vector similarity, if the vector similarity calculated by the non-word sense text set is the largest, the matching degree of the non-word sense text set may be considered to be better than the matching degree of all the rest of the disambiguated text sets, and at this time, the following step 609 is continued, otherwise, the following step 608 is performed.

Step 608, determining the word sense item corresponding to the disambiguation text set with the highest matching degree as a chain finger result of the target ambiguous word.

In this embodiment, if the matching degree of the word-free text set and the short text to be processed is not the highest, the target ambiguous word may be specified to be corresponding to a specific word sense item, so that the word sense item corresponding to the disambiguated text set with the highest matching degree may be determined as the chain finger result of the target ambiguous word, that is, the target ambiguous word is associated with the ID of the word sense item in the semantic knowledge base.

Step 609, selecting a word sense item from a plurality of word sense items according to a preset rule, and taking the word sense item as a chain finger result of the target ambiguous word.

In this embodiment, if the matching degree between the word-free text set and the short text to be processed is highest, the target ambiguous word may not correspond to a specific word-sense item, so that a word-sense item may be selected from a plurality of word-sense items included in the target ambiguous word according to a preset rule, for example, a word-sense item most commonly used by the ambiguous word is selected as a chain indicating result of the target ambiguous word.

In some alternative implementations, the chain finger result of the target ambiguous word may also be determined directly as no value (Nil).

As can be seen from fig. 6, compared with the embodiment corresponding to fig. 5, in the chain finger method for short text in this embodiment, the compressed average vector of the short text to be processed is matched with the disambiguation vector of each text set, and as the difference between the disambiguation vectors of each text set is more obvious, the obtained matching result is more representative, and the accuracy of the chain finger result of the target ambiguous word can be obviously improved. Further, the non-value condition of the target ambiguous word is supplemented and confirmed through the non-word sense text set, and the accuracy and the comprehensiveness of the chain finger result are further improved.

With further reference to fig. 7, the present disclosure provides an embodiment of a text processing apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable in various electronic devices.

As shown in fig. 7, the text processing apparatus 700 of the present embodiment may include a construction module 701, an parsing module 702, a matching module 703, a determination module 704, and a saving module 705. Wherein the construction module 701 is configured to construct a text data set for each word sense item of an ambiguous word, wherein the ambiguous word corresponds to a plurality of word sense items; the parsing module 702 is configured to perform semantic parsing on the short text containing the ambiguous words to obtain a plurality of candidate disambiguation words; a matching module 703 configured to match each candidate disambiguation word with a text dataset of each term respectively; a determination module 704 configured to, in response to a candidate disambiguation word successfully matching only the text dataset of one of the word sense items, disambiguate the candidate as a target disambiguation word for the matched word sense item; a saving module 705 configured to save short text containing the target disambiguation term to a disambiguated text set of matched term senses.

In the present embodiment, in the text processing apparatus 700: specific processing of the building module 701, the parsing module 702, the matching module 703, the determining module 704 and the saving module 705 and technical effects thereof may refer to the relevant descriptions of steps 201 to 205 in the corresponding embodiment of fig. 2, and are not repeated herein.

In some alternative implementations of the present embodiment, the processing apparatus 700 further includes: the first compression module is configured to respectively execute vector compression operation on the disambiguation text set of each word sense item to obtain a compressed average vector of the disambiguation text set as a disambiguation vector of the corresponding word sense item; wherein the vector compression operation includes: vector conversion is carried out on each short text in the text set to obtain a plurality of text vectors; and carrying out weighted fitting on the plurality of text vectors to obtain a compressed average vector.

In some alternative implementations of the present embodiment, the processing apparatus 700 further includes: the configuration module is configured to configure a word-sense-free text set for the ambiguous words, wherein the word-sense-free text set comprises a plurality of word-sense-free short texts, and the word-sense-free short texts contain ambiguous words and do not contain target disambiguation words; and the second compression module is configured to execute vector compression operation on the word-free text set and take the obtained compressed average vector as a disambiguation vector of the word-free text set.

In some alternative implementations of the present embodiment, the processing apparatus 700 further includes: the information acquisition module acquires key text information from a text data set of a word sense item; the splicing module is configured to extract the expanded disambiguation words from the key text information and splice the expanded disambiguation words and the ambiguous words to obtain spliced short text; and a text saving module configured to save the spliced short text to the disambiguated text set of the term.

In some alternative implementations of the present embodiment, the building block 701 includes: a generation unit configured to generate an initial text resource set; the acquisition unit is configured to acquire text resources matched with each word sense item of the ambiguous word from the initial text resource set respectively to obtain a text data set of each word sense item.

In some alternative implementations of the present embodiment, the parsing module 702 includes: the screening unit is configured to screen short texts containing ambiguous words from the initial text resource set; and the parsing unit is configured to perform text parsing on the short text containing the ambiguous words to obtain a plurality of candidate disambiguation words.

With further reference to fig. 8, the present disclosure provides one embodiment of a chain finger device for short text, which corresponds to the method embodiment shown in fig. 5, which is particularly applicable in a variety of electronic devices.

As shown in fig. 8, the chain finger device 800 for short text of the present embodiment may include a first obtaining module 801 configured to obtain a short text to be processed, and determine a target ambiguous word from the short text to be processed; a second obtaining module 802 configured to obtain a plurality of word sense items of the target ambiguous word, and a disambiguated text set for each word sense item, wherein the disambiguated text set is obtained by the text processing apparatus provided in fig. 7 above; a text matching module 803 configured to match the short text to be processed with the disambiguated text set of each term, respectively; the chain finger module 804 is configured to determine a chain finger result for the target ambiguous word based on the matching result.

In this embodiment, the specific processing of each module in the chain finger device 800 for short text and the technical effects thereof may refer to the related descriptions of steps 501-504 in the corresponding embodiment of fig. 5, and are not repeated here.

In some optional implementations of the present embodiment, the text matching module includes: a vector acquisition unit configured to acquire a compressed average vector of short text to be processed, and a disambiguation vector of each term; and the text matching unit is configured to match the compressed average vector of the short text to be processed with the disambiguation vector of each word sense item respectively.

In some alternative implementations of the present embodiment, the apparatus 800 further includes: a third obtaining module configured to obtain disambiguation vectors of the non-word sense text set of the target ambiguous word; the second matching module is configured to match the compressed average vector of the short text to be processed with the disambiguation vector of the word-free text set; the chain finger module 804 includes: a judging unit configured to judge whether or not the matching degree of the word-sense-free text set is better than the matching degree of all the disambiguated text sets; the first chain finger unit is configured to determine word sense items corresponding to the disambiguation text set with the highest matching degree as a chain finger result of the target ambiguous word if not; and the second chain finger unit is configured to select one word sense item from the plurality of word sense items according to a preset rule if the target ambiguous word is a chain finger result.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; the output unit 907, for example, various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, such as a text processing method or a chain finger method for short text. For example, in some embodiments, the text processing method or the chain finger method for short text may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM903 and executed by the computing unit 901, one or more steps of the text processing method or the chain finger method for short text described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform text processing methods or chain finger methods for short text in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a server of a distributed system or a server that incorporates a blockchain. The server can also be a cloud server, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology. The server may be a server of a distributed system or a server that incorporates a blockchain. The server can also be a cloud server, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A text processing method, the method comprising:

constructing a text data set for each word sense item of an ambiguous word, wherein the ambiguous word corresponds to a plurality of word sense items;

carrying out semantic analysis on the short text containing the ambiguous words to obtain a plurality of candidate disambiguation words, wherein the disambiguation words are used for pointing the ambiguous words to specific word sense items;

Matching each candidate disambiguation word with the text data set of each word sense item respectively;

responding to successful matching of a candidate disambiguation word with a text data set of only one word sense item, and taking the candidate disambiguation word as a target disambiguation word of the matched word sense item;

and storing the short text containing the target disambiguation term into a disambiguation text set of the matched term.

2. The method of claim 1, further comprising:

vector compression operation is respectively carried out on the disambiguation text set of each word sense item, so that a compressed average vector of the disambiguation text set is obtained and is used as a disambiguation vector of the corresponding word sense item;

wherein the vector compression operation includes:

vector conversion is carried out on each short text in the text set to obtain a plurality of text vectors;

and carrying out weighted fitting on the text vectors to obtain a compressed average vector.

3. The method of claim 2, further comprising:

configuring a word-free text set for the ambiguous word, wherein the word-free text set comprises a plurality of word-free short texts, and the word-free short texts contain the ambiguous word and do not contain the target disambiguation word;

and executing the vector compression operation on the word-free text set, and taking the obtained compressed average vector as a disambiguation vector of the word-free text set.

4. The method of claim 1, further comprising:

acquiring key text information from a text data set of a word sense item;

extracting an expanded disambiguation word from the key text information, and splicing the expanded disambiguation word and the disambiguation word to obtain a spliced short text;

and storing the spliced short text to a disambiguated text set of the word sense item.

5. The method of any of claims 1-4, wherein said constructing a text dataset for each term sense of an ambiguous word comprises:

generating an initial text resource set;

and respectively acquiring text resources matched with each word sense item of the ambiguous word from the initial text resource set to obtain a text data set of each word sense item.

6. The method of claim 5, wherein the text parsing the short text containing the ambiguous word to obtain a plurality of candidate disambiguation words comprises:

screening short texts containing the ambiguous words from the initial text resource set;

and carrying out text analysis on the short text containing the ambiguous words to obtain the plurality of candidate disambiguation words.

7. A chain finger method for short text, the method comprising:

Obtaining a short text to be processed, and determining a target ambiguous word from the short text to be processed;

acquiring a plurality of word sense items of the target ambiguous word and a disambiguated text set of each word sense item, wherein the disambiguated text set is obtained by the text processing method according to any one of claims 1 to 6;

matching the short text to be processed with the disambiguated text set of each word sense item respectively;

and determining a chain finger result for the target ambiguous word based on the matching result.

8. The method of claim 7, wherein the matching the short text to be processed with the disambiguated text set for each term respectively comprises:

acquiring a compressed average vector of the short text to be processed and a disambiguation vector of each word sense item;

and matching the compressed average vector of the short text to be processed with the disambiguation vector of each word sense item respectively.

9. The method of claim 8, further comprising:

acquiring disambiguation vectors of the word-sense-free text set of the target ambiguous word;

matching the compressed average vector of the short text to be processed with the disambiguation vector of the word sense-free text set;

The determining the chain finger result for the target ambiguous word based on the matching result comprises:

judging whether the matching degree of the word-sense-free text set is better than that of all the disambiguation text sets;

if not, determining the word sense item corresponding to the disambiguation text set with the highest matching degree as a chain finger result of the target ambiguous word;

if yes, selecting one word sense item from the plurality of word sense items according to a preset rule, and taking the word sense item as a chain meaning result of the target ambiguous word.

10. A text processing apparatus, the apparatus comprising:

a construction module configured to construct a text data set for each word sense item of an ambiguous word, wherein the ambiguous word corresponds to a plurality of word sense items;

the parsing module is configured to perform semantic parsing on the short text containing the ambiguous words to obtain a plurality of candidate disambiguation words, wherein the disambiguation words are used for pointing the ambiguous words to specific word sense items;

a matching module configured to match each candidate disambiguation word with the text dataset of each term respectively;

a determination module configured to, in response to a candidate disambiguation word successfully matching only the text dataset of one of the word sense items, disambiguate the candidate as a target disambiguation word for the matched word sense item;

And the storage module is configured to store the short text containing the target disambiguation term into the disambiguation text set of the matched word sense term.

11. The apparatus of claim 10, further comprising:

the first compression module is configured to respectively execute vector compression operation on the disambiguation text set of each word sense item to obtain a compressed average vector of the disambiguation text set as a disambiguation vector of the corresponding word sense item;

wherein the vector compression operation includes:

12. The apparatus of claim 11, further comprising:

a configuration module configured to configure a word-sense-free text set for the ambiguous word, wherein the word-sense-free text set comprises a plurality of word-sense-free short texts, and the word-sense-free short texts contain the ambiguous word and do not contain the target disambiguation word;

and the second compression module is configured to execute the vector compression operation on the word-free text set and take the obtained compressed average vector as a disambiguation vector of the word-free text set.

13. The apparatus of claim 10, further comprising:

The information acquisition module acquires key text information from a text data set of a word sense item;

the splicing module is configured to extract an expanded disambiguation word from the key text information, splice the expanded disambiguation word with the disambiguation word, and obtain a spliced short text;

and the text saving module is configured to save the spliced short text to the disambiguated text set of the word sense item.

14. The apparatus of any of claims 10-13, wherein the build module comprises:

a generation unit configured to generate an initial text resource set;

and the acquisition unit is configured to acquire text resources matched with each word sense item of the ambiguous word from the initial text resource set respectively to obtain a text data set of each word sense item.

15. The apparatus of claim 14, the parsing module comprising:

a screening unit configured to screen short text containing the ambiguous word from the initial text resource set;

and the parsing unit is configured to perform text parsing on the short text containing the ambiguous words to obtain the plurality of candidate disambiguation words.

16. A chain finger device for short text, the device comprising:

The first acquisition module is configured to acquire short texts to be processed and determine target ambiguous words from the short texts to be processed;

a second acquisition module configured to acquire a plurality of word sense items of the target ambiguous word, and a disambiguated text set for each word sense item, wherein the disambiguated text set is obtained by the text processing apparatus of any one of claims 10-15;

a first matching module configured to match the short text to be processed with the disambiguated text set of each word sense item, respectively;

and the chain finger module is configured to determine a chain finger result for the target ambiguous word based on the matching result.

17. The apparatus of claim 16, wherein the first matching module comprises:

a vector acquisition unit configured to acquire a compressed average vector of the short text to be processed and a disambiguation vector of each word sense item;

and the text matching unit is configured to match the compressed average vector of the short text to be processed with the disambiguation vector of each word sense item respectively.

18. The apparatus of claim 17, further comprising:

a third obtaining module configured to obtain disambiguation vectors of the word-sense-free text set of the target ambiguous word;

The second matching module is configured to match the compressed average vector of the short text to be processed with the disambiguation vector of the word-sense-free text set;

the chain finger module includes:

a judging unit configured to judge whether or not the matching degree of the word-sense-free text set is better than the matching degree of all the disambiguated text sets;

the first chain finger unit is configured to determine word sense items corresponding to the disambiguated text set with the highest matching degree as a chain finger result of the target ambiguous word if not;

and the second chain finger unit is configured to select one word sense item from the plurality of word sense items according to a preset rule if the target ambiguous word is a chain finger result.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-9.