CN117009488A

CN117009488A - Candidate text determination method and device

Info

Publication number: CN117009488A
Application number: CN202311036568.4A
Authority: CN
Inventors: 白金国; 李长亮; 李小龙
Original assignee: Beijing Kingsoft Digital Entertainment Co Ltd
Current assignee: Beijing Kingsoft Digital Entertainment Co Ltd
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2023-11-07
Also published as: CN113220832A; CN113220832B

Abstract

The application provides a candidate text determining method and a candidate text determining device, wherein the candidate text determining method comprises the following steps: determining semantic vectors of questions to be replied based on the acquired questions to be replied, and acquiring semantic vectors of a plurality of texts in a text library; determining a first text to be selected which is semantically related to the question to be replied from a text library according to the similarity between the semantic vector of the question to be replied and the semantic vectors of the texts; word segmentation processing is carried out on the questions to be replied to obtain a plurality of first word units of the questions to be replied; determining a similarity score of each text relative to the questions to be replied based on the weight value of each first word unit and the relevance value of each first word unit and each text in the text library, and determining the text with the similarity score larger than a second threshold value as a second text to be selected; candidate text is determined based on the first candidate text and the second candidate text. The candidate text is determined in two ways, so that the accuracy of the determined candidate text is improved.

Description

Candidate text determination method and device

The application relates to a filing number 202110484317.7, a filing date 2021, 04 and 30 days, and a divisional application with the name of a text processing method and a text processing device.

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a candidate text determining method and apparatus, a computing device, and a computer readable storage medium.

Background

In the question-answering system, after acquiring the questions, information retrieval is performed first to obtain texts related to the comparison of the questions, then answers to the questions are determined from the obtained texts, and if the texts obtained by the information retrieval are irrelevant texts, the accuracy of the determined answers is affected, and further the performance of the question-answering system is affected, so that the information retrieval is of great importance.

In the prior art, in order to improve the recall rate of information retrieval and enable the relevance between the retrieved text and the problem to be higher, a semantic retrieval mode is generally adopted to determine the text related to the problem semantics. Specifically, the semantic vector of the question to be replied and the semantic vectors of a plurality of texts in the text library can be determined through the retrieval model, the similarity between the semantic vector of the text and the semantic vector of the question to be replied is determined, if the similarity is higher, the meaning of the question to be replied is closer to that of the text, and therefore the text with higher similarity to the semantic vector of the question to be replied can be determined as the text related to the meaning of the question to be replied.

However, in the above manner, the semantic vector obtained by vectorizing the questions to be answered is determined only according to the retrieval model, and the performance of the retrieval model depends on the training situation, so that the determined semantic vector of the questions to be answered may not be related to the questions to be answered in the case that the questions to be answered cannot be accurately represented, i.e. the text determined by such semantic vector may recall the irrelevant text through semantic retrieval, and further the answer of the questions to be answered determined based on the irrelevant text may not be accurate, i.e. the performance of the question-answering system may be affected.

Disclosure of Invention

In view of the above, embodiments of the present application provide a candidate text determining method and apparatus, a computing device and a computer readable storage medium, so as to solve the technical drawbacks existing in the prior art.

According to a first aspect of an embodiment of the present application, there is provided a candidate text determination method, including:

determining semantic vectors of the questions to be replied based on the acquired questions to be replied, and acquiring semantic vectors of a plurality of texts in a text library;

determining a first text to be selected related to the semantic of the question to be answered from the text library according to the similarity of the semantic vector of the question to be answered and the semantic vectors of the texts;

Word segmentation processing is carried out on the questions to be replied to obtain a plurality of first word units of the questions to be replied;

determining a similarity score of each text relative to the questions to be replied based on the weight value of each first word unit and the relevance value of each first word unit and each text in the text library, and determining the text with the similarity score larger than a second threshold value as a second text to be selected;

and determining candidate texts based on the first candidate text and the second candidate text.

According to a second aspect of an embodiment of the present application, there is provided a candidate text determination device including:

the first determining module is configured to determine semantic vectors of the questions to be replied based on the acquired questions to be replied, and acquire semantic vectors of a plurality of texts in a text library;

a second determining module configured to determine a first candidate text related to the semantics of the question to be answered from the text library according to the similarity of the semantic vector of the question to be answered and the semantic vectors of the texts;

the word segmentation module is configured to segment the questions to be replied to obtain a plurality of first word units of the questions to be replied;

The third determining module is configured to determine a similarity score of each text relative to the questions to be replied based on the weight value of each first word unit and the relevance value of each first word unit and each text in the text library, and determine the text with the similarity score larger than a second threshold as a second text to be selected;

and a fourth determination module configured to determine a candidate text based on the first candidate text and the second candidate text.

According to a third aspect of embodiments of the present application there is provided a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the candidate text determination method when executing the instructions.

According to a fourth aspect of embodiments of the present application, there is provided a computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the candidate text determination method.

According to a fifth aspect of embodiments of the present application, there is provided a chip storing computer instructions which, when executed by the chip, implement the steps of the candidate text determination method.

In the embodiment of the application, based on the acquired questions to be replied, determining semantic vectors of the questions to be replied, and acquiring semantic vectors of a plurality of texts in a text library; determining a first text to be selected which is semantically related to the question to be replied from a text library according to the similarity between the semantic vector of the question to be replied and the semantic vectors of the texts; word segmentation processing is carried out on the questions to be replied to obtain a plurality of first word units of the questions to be replied; determining a similarity score of each text relative to the questions to be replied based on the weight value of each first word unit and the relevance value of each first word unit and each text in the text library, and determining the text with the similarity score larger than a second threshold value as a second text to be selected; candidate text is determined based on the first candidate text and the second candidate text. Based on the questions to be answered, the texts and semantic vectors of the questions to be answered and the texts, the candidate texts are determined by using the first candidate text and the second candidate text determined in two ways, so that the determined candidate texts are determined based on the two ways, the accuracy of the determined candidate texts is improved, and the performance of the question-answering system is further improved.

Drawings

FIG. 1 is a block diagram of a computing device provided by an embodiment of the present application;

FIG. 2 is a flowchart of a text processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a text processing method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of determining candidate text according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a graph network provided by an embodiment of the present application;

FIG. 6 is a flowchart of another text processing method provided by an embodiment of the present application;

fig. 7 is a schematic structural diagram of a text processing device according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. The present application may be embodied in many other forms than those herein described, and those skilled in the art will readily appreciate that the present application may be similarly embodied without departing from the spirit or essential characteristics thereof, and therefore the present application is not limited to the specific embodiments disclosed below.

The terminology used in the one or more embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the application. As used in one or more embodiments of the application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of the application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the application. The word "if" as used herein may be interpreted as "responsive to a determination" depending on the context.

First, terms related to one or more embodiments of the present application will be explained.

Information retrieval: a method for querying information.

Semantic retrieval: and searching according to the semantics.

DPR model: (Dense Passage Retrieval, dense text segment search) model, semantic search may be performed for outputting candidate text related to an entered question based on the question.

Recall rate: the number of related texts retrieved is proportional to the number of related texts actually present in the text library. Wherein the related text is a text truly related to the question to be answered.

Adjacency matrix: the adjacency matrix of the undirected graph is symmetrical to the matrix representing the adjacency relations between nodes.

Text screening network: and screening the input text so as to determine the network of the text meeting the requirements.

Graph neural network: a deep learning network processes graph data.

BM25 algorithm: is an extension of a binary independent model and can be used for searching relevance ranking algorithm.

Semantic vector: a vector for characterizing semantic features of text.

Hidden layer feature vector: the feature vector obtained by combining the context information is a vector representation.

Word embedding: refers to the process of embedding a high-dimensional space, which is the number of all words in dimension, into a continuous vector space, which is much lower in dimension, each word or phrase being mapped to a vector on the real number domain.

word2vec: one method for word embedding processing is a high-efficiency word vector training method constructed by Mikolov on the basis of Bengio Neural Network Language Model (NNLM). The word embedding processing can be carried out on the text by using the method, so that word vectors of the text are obtained.

Word vector: a representation of a word is one that is intended to be processed by a computer.

The Bert model: (Bidirectional Encoder Representations from Transformer) model, a bi-directional attention neural network model.

First word unit: word units obtained after word segmentation processing is carried out on the questions to be replied.

Second word unit: word units are obtained after word segmentation processing is carried out on the candidate texts.

First feature vector: the first word units are combined with word vectors of other first word units in the questions to be replied to obtain vector representations.

Second feature vector: the second word units are combined with vector representations of word vectors of other second word units in the corresponding candidate text.

In the present application, a text processing method and apparatus, a computing device, and a computer-readable storage medium are provided, and detailed description is given in the following embodiments.

FIG. 1 illustrates a block diagram of a computing device 100, according to an embodiment of the application. The components of the computing device 100 include, but are not limited to, a memory 110 and a processor 120. Processor 120 is coupled to memory 110 via bus 130 and database 150 is used to store data.

Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 140 may include one or more of any type of network interface, wired or wireless (e.g., a Network Interface Card (NIC)), such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the application, the above-described components of computing device 100, as well as other components not shown in FIG. 1, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device shown in FIG. 1 is for exemplary purposes only and is not intended to limit the scope of the present application. Those skilled in the art may add or replace other components as desired.

Computing device 100 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.

Wherein the processor 120 may perform the steps of the text processing method shown in fig. 2. Fig. 2 shows a flow chart of a text processing method according to an embodiment of the application, comprising steps 202 to 206.

Step 202: based on the acquired questions to be answered, determining semantic vectors of the questions to be answered, a plurality of candidate texts and semantic vectors of the candidate texts, wherein each candidate text is a text related to the questions to be answered semantically in a text library.

In practical application, after obtaining a question to be answered, the semantic vector of the question and the semantic vector of the text in the text library can be determined through the retrieval model, and the text with higher similarity to the semantic vector of the question can be regarded as the text which is relatively close to the meaning of the question, so that the text with higher similarity to the semantic vector of the question can be determined as the text related to the question to be answered, and in this case, a large number of texts can be obtained generally. However, since vectorizing the questions to be answered and vectorizing the texts in the text library are both determined according to the retrieval model, and the performance of the retrieval model depends on the training situation, the determined semantic vector is uncontrollable, and may not accurately represent the questions to be answered, or may not accurately represent the texts in the text library, and the text determined by using the inaccurate semantic vector may not be related to the questions to be answered, i.e. may recall irrelevant texts through semantic retrieval, and may also be considered as reducing the recall rate of the retrieval. And the answers determined based on text not related to the question to be answered may be inaccurate, i.e., may also affect the performance of the question-answering system.

Therefore, the application provides a text processing method, which can further screen candidate texts after preliminary retrieval to obtain the candidate texts, delete candidate texts irrelevant to the questions to be replied and obtain target texts with high relevance to the texts to be replied. The text processing method provided by the application further screens on the basis of large-scale recall of semantic retrieval, and irrelevant texts can be filtered. The method improves the recall rate of the retrieval, enhances the reliability of semantic retrieval, and ensures higher accuracy of answers based on the target texts, namely improves the performance of a question-answering system.

As an example, the semantic vector of the question to be answered is a feature vector that can be used to characterize the semantics of the question to be answered, and the semantic vector of the candidate text is a feature vector that can be used to characterize the semantics of the candidate text.

As one example, a question to be answered is a question that requires a corresponding answer. For example, the question to be answered may be "the smallest natural number is a few", or the question to be answered may be "What is the smallest prime number", or the question to be answered may be "which countries are included in the four civilization palace", or the like.

In a first possible implementation manner, based on the acquired to-be-answered question, determining the semantic vector of the to-be-answered question, a plurality of candidate texts, and specific implementation of the semantic vectors of the plurality of candidate texts may include: extracting features of the questions to be answered, and determining semantic vectors of the questions to be answered; acquiring semantic vectors of a plurality of texts in the text library; determining a similarity score of each text relative to the question to be answered based on the semantic vector of the question to be answered and the semantic vectors of the plurality of texts; and determining the candidate texts based on the similarity scores of each text relative to the questions to be answered, and acquiring semantic vectors of the candidate texts.

The similarity score may be used to characterize the similarity between the text and the question to be answered, where a higher similarity score indicates that the text is more similar to the question to be answered, and a lower similarity score indicates that the text is less similar to the question to be answered.

That is, feature extraction can be performed on the questions to be answered to obtain semantic vectors of the questions to be answered, feature extraction is performed on texts in a text base to obtain semantic vectors of each text, and candidate texts related to the semantics of the questions to be answered are determined from the text base according to the similarity between the semantic vectors of the questions to be answered and the semantic vectors of the texts.

In some embodiments, the questions to be answered and the text in the text library may be entered into a semantic retrieval model to determine a plurality of candidate text. The semantic retrieval model can comprise a feature extraction module and a text retrieval module, wherein the feature extraction module can perform feature extraction on the questions to be answered and each text in the text library so as to obtain semantic vectors of the questions to be answered and semantic vectors of each text in the text library, and then according to the semantic vectors of the questions to be answered and the semantic vectors of each text, candidate texts related to the semantics of the questions to be answered can be determined through the text retrieval module.

As one example, the feature extraction module may include a word embedding layer and an encoding layer. The word embedding layer is used for carrying out word embedding processing on the input text to obtain word vectors, and the encoding layer is used for encoding the input word vectors to obtain semantic vectors.

In a specific implementation, word segmentation processing can be performed on an input to-be-replied question and a plurality of texts in a text library respectively to obtain a plurality of first word units of the to-be-replied question and a plurality of second word units of each text. As an example, the questions to be answered and the plurality of texts may be separately word-segmented according to a pre-formulated vocabulary. For example, in a pre-programmed vocabulary, if the text is chinese text, a word, a punctuation mark may be used as a word unit. If the text is foreign language text, a word and a punctuation mark can be used as a word unit. If a number is included in the text, a number may be used as a word unit.

For example, if the question to be answered is "the smallest natural number is several", the word segmentation is performed on the question to be answered, so that a plurality of first word units are [ the smallest, natural number, yes, several ], and if the question to be answered is "What is the smallest prime number", the word segmentation is performed on the question to be answered, so that a plurality of first word units are [ What, is, the, smallest, prime, number ]. The word segmentation process is performed on the text assuming that the text is "0 is the smallest natural number", so that a plurality of second word units are [0, yes, smallest, natural number ], and the word segmentation process is performed on the text assuming that the text is "the natural number is an integer greater than or equal to 0", so that a plurality of second word units are [ natural number, yes, greater than, or equal to, 0, integer ].

In a specific implementation, after word segmentation processing is performed on the questions to be replied, word embedding processing may be performed on each first word unit of the questions to be replied and each second word unit of the text in the text library, and each word unit is mapped to a low-dimensional vector space to obtain a word vector of each word unit. Wherein, for convenience of description, the first word unit and the second word unit are collectively referred to as word units.

As an example, a word embedding process may be performed on each first word unit to be replied to a question by means of one-hot (one-hot) coding, so as to obtain a word vector of each first word unit, and a word embedding process may be performed on each second word unit, so as to obtain a word vector of each second word unit.

As another example, word embedding processing may be performed on each first word unit to be replied to the question by way of word2vec coding, so as to obtain a word vector of each first word unit, and word embedding processing may be performed on each second word unit, so as to obtain a word vector of each second word unit.

In a specific implementation, after word embedding processing is performed by the word embedding layer to obtain word vectors, the word vectors of each first word unit and the word vectors of each second word unit can be input to the coding layer to perform coding processing, so that vector representations after each first word unit is combined with word vectors of other first word units in a question to be replied, namely, first feature vectors of each first word unit, and vector representations after each second word unit is combined with word vectors of other second word units in a corresponding text, namely, second feature vectors of each second word unit, can be obtained. And splicing the first feature vectors of the plurality of first word units of the to-be-answered question to obtain a semantic vector of the to-be-answered question, and splicing the second feature vectors of the plurality of second word units of the same text to obtain the semantic vector of the text.

In some embodiments, after the feature extraction module obtains the semantic vector of the question to be replied and the semantic vector of the text in the text library, the semantic vector of the question to be replied and the semantic vector of each text may be input into the text retrieval module, the similarity score of the semantic vector of the text to be replied and the semantic vector of each text may be determined, a plurality of similarity scores may be obtained, and then the candidate text may be determined from a plurality of texts in the text library according to the plurality of similarity scores.

As an example, through the text retrieval module, the semantic vector of the to-be-answered question may be multiplied by the semantic vector of each text, and the product may be normalized, so as to obtain similarity scores of the to-be-answered question and each text, i.e. obtain a plurality of similarity scores.

It should be noted that the above-mentioned feature extraction module is only one example of the present application. In other embodiments, the feature extraction module may be any structure including a word segmentation function, a word embedding function, and an encoding function, which is not limited in this embodiment of the present application. For example, the feature extraction module may employ the structure of the BERT model. In addition, the semantic search model may be a DPR model, from which a plurality of candidate texts semantically related to the questions to be answered may be obtained.

In one embodiment, after determining the plurality of similarity scores, candidate texts are further determined according to the similarity scores, so that determining the specific implementation of the plurality of candidate texts based on the similarity scores of each text with respect to the questions to be answered may include: and taking a plurality of texts with similarity scores larger than a second threshold value as the candidate texts.

It should be noted that, the second threshold may be set by the user according to the actual requirement, or may be set by default by the device, which is not limited in the embodiment of the present application. For example, the second threshold may be 0.8.

For example, since the greater the similarity score, the greater the semantic relevance of the text to the question to be answered, the lesser the similarity score, the lesser the semantic relevance of the text to the question to be answered, and therefore, if the similarity score is greater than the second threshold, the similarity may be considered to be sufficiently high, i.e., the semantic relevance of the text to the question to be answered is sufficiently high, the text may be determined as a candidate text.

For example, referring to fig. 3, fig. 3 is a schematic diagram of a text processing method according to an embodiment of the present application. After the questions to be answered are input into the semantic retrieval model, the semantic vectors of the questions to be answered and the semantic vectors of a plurality of texts can be output through the feature extraction module, and 1000 candidate texts and the semantic vectors of 1000 candidate texts can be obtained through the text retrieval module.

Further, after the multiple candidate texts are determined, the multiple candidate texts can be subjected to primary ranking through a BM25 algorithm, N candidate texts with the top ranking are reserved, candidate texts with the bottom ranking are deleted, the number of candidate texts obtained after primary screening can be reduced, and the calculation amount in a text screening network can be reduced.

In the implementation mode, a plurality of candidate texts related to the questions to be answered are determined from a text library through a semantic retrieval method, and a plurality of candidate texts with relatively high relevance to the questions to be answered can be recalled through a semantic retrieval model.

In the embodiment of the application, the feature extraction is carried out on the questions to be replied and the text, so that the semantic vector capable of representing the semantics of the questions to be replied and the semantic vector capable of representing the semantics of the text are determined, the candidate text related to the semantics of the questions to be replied is determined according to the similarity between the semantic vectors, the semantic vector of the questions to be replied is not the word vector concatenation of a plurality of single first word units, but is obtained based on the first feature vector obtained by combining each first word unit with the full text semantic information, the questions to be replied can be represented more accurately, the semantic vector of the candidate text is not the word vector concatenation of a plurality of single second word units, but is obtained based on the second feature vector obtained by combining each second word unit with the full text semantic information, the candidate text can be represented more accurately, and the retrieval accuracy and the recall rate are improved.

In a second possible implementation manner, a plurality of candidate texts can be determined from the texts in the text library through a BM25 algorithm, then feature extraction is performed on the questions to be answered and the determined candidate texts, and semantic vectors of the questions to be answered and semantic vectors of the plurality of candidate texts can be determined.

In some embodiments, determining a plurality of candidate texts from a text library by a BM25 algorithm may include: word segmentation processing is carried out on the questions to be replied to obtain a plurality of first word units of the questions to be replied; determining the relevance value of each first word unit and each text, so that a plurality of relevance values of each first word unit can be obtained, and each relevance value corresponds to one text; determining a weight value of each first word unit; based on the weight value of each first word unit and the multiple relevance values of each first word unit, similarity scores of each text relative to the questions to be replied can be determined, and multiple similarity scores are obtained. And comparing the similarity scores with a second threshold value, and determining a plurality of texts with similarity scores larger than the second threshold value as a plurality of candidate texts.

As an example, the questions to be answered may be word-segmented according to a pre-formulated vocabulary. Illustratively, assuming that the question to be answered is "the smallest natural number is a few", word segmentation processing is performed on the question to be answered, and a plurality of first word units may be obtained as [ the smallest, natural number, yes, few ].

As an example, in the first word unit q _i And text d for example, a first word unit q is determined _i The implementation of the relevance value with the text d can comprise: determining a first word unit q _i The frequency of occurrence in the text d and the average length of all the texts in the text library are determined, and the length of the text d is determined, based on the frequency, the average length and the length of the text d, the first word unit q can be determined _i The correlation with text d takes a value.

Illustratively, the first word unit q may be determined by the following equation (1) _i The relevance to the text d takes the value:

wherein R (q _i D) represents the first word unit q _i Relevance to text d takes value, f _i Representing the first word unit q _i The frequency of occurrence, k, in the text d ₁ And b are both regulatory factors, typically empirically set, typically k ₁ =2, b=0.75, dl represents the length of text d, avg (dl) represents the average length of all text in the text library.

Through the above formula (1), the relevance value of each first word unit with respect to each text can be determined.

As an example, in the first word unit q _i For example, a first word unit q is determined _i The implementation of the weights of (2) may include: determining a total number of all texts in a text store and determining that the text store comprises a first word unit q _i Based on the total number and the text comprising the first word unit q ₁ Can determine the first word unit q _i Is a weight value of (a).

Illustratively, the method can be performed by the following(2) determining the first word unit q _i Weight value of (2):

wherein W is _i Representing the first word unit q _i N represents the total number of texts in the text library, N (q ₁ ) The representation includes a first word unit q ₁ Is a number of texts.

By the above formula (2), a weight value of each first word unit can be determined.

As an example, taking text d as an example, after determining the relevance value of each first word unit with respect to each text d, and determining the weight value of each first word unit, the similarity score of text d with respect to the question to be answered may be determined by the following formula (3):

wherein Q represents the question to be answered, score (Q, d) represents the similarity Score of text d relative to the question to be answered Q, n represents the number of first word units in the question to be answered, W _i Representing the first word unit q _i Weight of R (q) _i D) represents the first word unit q _i The correlation with text d takes a value.

Through the above formula (3), a similarity score for each text with respect to the question to be answered can be determined.

After the similarity score of each text relative to the questions to be replied is determined, the text with the similarity score larger than the second threshold value can be determined as the candidate text, and then the candidate text and the questions to be replied are input into a feature extraction model to perform feature extraction, so that the semantic vector of each candidate text and the semantic vector of the questions to be replied can be obtained.

It should be noted that the above implementation process of determining a plurality of candidate texts by using the BM25 algorithm is only an example, and in practical implementation, the BM25 algorithm may be used after being adaptively adjusted, which is not limited by the embodiment of the present application. In addition, the implementation process of determining the candidate text according to the similarity score and extracting the features of the candidate text and the questions to be answered is the same as that of the previous implementation, and the specific implementation thereof can be referred to the related description in the first implementation, which is not repeated here.

In this implementation, a plurality of candidate texts related to the questions to be answered are determined from the text library by the BM25 retrieval method, and a plurality of candidate texts with relatively high relevance to the questions to be answered can be recalled.

In a third possible implementation manner, the first candidate text may be obtained from the text library through a semantic search model, the second candidate text may be obtained from the text library through a BM25 search algorithm, and the plurality of candidate texts may be determined based on the first candidate text and the second candidate text. And, semantic vectors of questions to be answered and semantic vectors of a plurality of candidate texts are obtained.

It should be noted that, the implementation process of obtaining the first candidate text from the text library through semantic search is the same as the implementation process of determining the candidate text in the first implementation manner, and specific implementation thereof may refer to the related description in the first implementation manner, which is not limited by the embodiment of the present application. The implementation process of obtaining the second candidate text from the text library through the BM25 search algorithm is the same as the implementation process of determining the candidate text in the second implementation, and specific implementation of the implementation may refer to the description related to the second implementation, which is not limited by the embodiment of the present application.

In some embodiments, the intersection of the first candidate text and the second candidate text may be determined as a plurality of candidate texts, i.e., text that repeatedly appears in the first candidate text and the second candidate text is determined as a candidate text. By way of example, assuming that the first candidate text includes text 1, text 2, and text 4 and the second candidate text includes text 1, text 3, and text 4, text 1 and text 4 may be determined as candidate texts. The candidate text thus determined has a higher correlation with the question to be answered than the candidate text determined by only one retrieval method, i.e. the determined candidate text is more accurate.

In other embodiments, a union of the first candidate text and the second candidate text may be determined as the plurality of candidate texts. By way of example, assuming that the first candidate text includes text 1, text 2, and text 4 and the second candidate text includes text 1, text 3, and text 4, text 1, text 2, text 3, and text 4 may be determined as candidate texts. In this way, the text related to the questions to be answered can be acquired as much as possible, and the situation of missing the related text can be avoided.

In addition, in implementation, when the first text to be selected is determined through the semantic retrieval model, the semantic vector of the question to be replied and the semantic vector of the first text to be selected can be obtained, and the semantic vector of the second text to be selected can be obtained through feature extraction.

As an example, if the finally determined plurality of candidate texts includes a text that does not belong to the first candidate text, feature extraction may be performed on the text that does not belong to the first candidate text, and a semantic vector of the text that does not belong to the first candidate text may be obtained, and thus a semantic vector of the plurality of candidate texts may be obtained. For example, assuming that the first candidate text includes text 1, text 2 and text 4, and the candidate texts include text 1, text 3 and text 4, the feature extraction module of the semantic search model may obtain the semantic vector of text 1, the semantic vector of text 2 and the semantic vector of text 4, and text 3 is a text not belonging to the first candidate text, feature extraction may be performed on text 3 to determine the semantic vector of text 3, so that the semantic vectors of 3 candidate texts may be determined.

As another example, if the finally determined plurality of candidate texts is an intersection of the first candidate text and the second candidate text, that is, there is no text that does not belong to the first candidate text, the semantic vector of the first candidate text determined by the semantic search model may be determined as the semantic vector of the plurality of candidate texts. For example, assuming that the first candidate text includes text 1, text 2 and text 4, and the candidate text includes text 1 and text 4, the feature extraction module of the semantic search model may obtain the semantic vector of text 1, the semantic vector of text 2 and the semantic vector of text 4, so that the semantic vectors of 2 candidate texts may be directly obtained.

Referring to fig. 4, fig. 4 is a schematic diagram illustrating a candidate text determination according to an embodiment of the present application. In fig. 4, semantic vectors of N first texts to be selected and N first texts to be selected can be determined through a semantic search model, M second texts to be selected can be determined through a BM25 search algorithm, and feature extraction is performed on the M second texts to be selected through a feature extraction module, so that semantic vectors of the M second texts to be selected can be obtained. Assuming that there is no repeated text in the first candidate text and the second candidate text, m+n texts may be used as candidate texts, and m+n semantic vectors may be used as semantic vectors of the candidate texts.

In this implementation, the accuracy of the recalled candidate text can be improved by determining a plurality of candidate texts related to the question to be answered from the text library in a manner of combining semantic retrieval and BM25 retrieval.

Step 204: and constructing an adjacency matrix based on the association relation between the questions to be replied and the plurality of candidate texts, wherein the adjacency matrix is used for representing the relevance between the questions to be replied and the plurality of candidate texts and the relevance between the plurality of candidate texts.

In the embodiment of the application, after the plurality of candidate texts are determined, the plurality of candidate texts need to be screened, and as the screening is possible to be single only according to the association relation between the candidate texts and the questions to be replied, the association relation between the candidate texts and the candidate texts can be considered, and the association relation between the candidate texts and the questions to be replied can be represented by using the adjacency matrix.

Further, before constructing the adjacency matrix based on the association relationship between the question to be answered and the plurality of candidate texts, the method further comprises:

acquiring keywords of the questions to be replied and keywords of each candidate text;

if the corresponding keyword of the question to be replied exists in the first candidate text, determining that the association relationship between the first candidate text and the question to be replied is relevant, wherein the first candidate text is any candidate text in the plurality of candidate texts;

If the corresponding keyword of the second candidate text exists in the first candidate text, determining that the association relationship between the first candidate text and the second candidate text is relevant, wherein the second candidate text is any candidate text except the first candidate text in the plurality of candidate texts;

and determining that the association relationship between the questions to be replied and the self is relevant, and determining that the association relationship between each candidate text and the self is relevant, or determining that the association relationship between the questions to be replied and the self is irrelevant, and determining that the association relationship between each candidate text and the self is irrelevant.

The keywords may be words of relatively great importance in the questions to be answered, or may be words of relatively great importance in the candidate text. Also, the number of keywords of the question to be answered may be one, two or more. The number of keywords of the candidate text may be one, two, and more.

Wherein, the corresponding keywords may be keywords, similar words which may also be keywords, paraphraseology, replacement words, and the like. For example, assuming that the keyword is a paper towel, the corresponding keyword may be toilet paper, roll paper, extraction paper. Assuming that the keywords are natural numbers, the corresponding keywords may be non-negative integers. Assuming that the keywords are white plum, the corresponding keywords may be poetry, too white, and green lotus.

That is, before constructing the adjacency matrix, it is necessary to determine the association relationship between the question to be answered and the plurality of candidate texts, and the association relationship between the plurality of candidate texts. Specifically, the keyword of the question to be answered and the keyword of each candidate text may be obtained, and if the first candidate text includes the corresponding keyword of the question to be answered, the first candidate text may be considered to be similar to the central idea expressed by the question to be answered, and then it may be determined that the association relationship between the first candidate text and the question to be answered is relevant. If the first candidate text has a keyword corresponding to the keyword of the second candidate text, the first candidate text may be considered to be similar to the center thought expressed by the second candidate text, and then the association relationship between the first candidate text and the second candidate text may be determined to be relevant. And, the association relationship between the question to be answered and itself may be determined as relevant or irrelevant, and the association relationship between each candidate text and itself may be determined as relevant or irrelevant.

In some embodiments, keywords may be extracted from questions to be answered and candidate text according to an entity extraction algorithm. For example, assuming that the question to be answered is "the smallest natural number is several", the keywords are "the smallest" and "natural number" can be extracted. Assuming that the candidate text is "natural number is an integer greater than or equal to 0", the keywords that can be extracted are "natural number", "greater than or equal to" and "0".

In some embodiments, if the question to be answered includes a keyword, it may be determined that the association relationship between the question to be answered and the first candidate text is relevant as long as the corresponding keyword of the keyword exists in the first candidate text; if the second candidate text includes a keyword, it may be determined that the association relationship between the second candidate text and the first candidate text is related as long as the first candidate text includes a keyword corresponding to the keyword.

As an example, if the question to be answered includes a plurality of keywords, it may be determined that the association relationship between the question to be answered and the first candidate text is relevant as long as a corresponding keyword of one of the keywords exists in the first candidate text; if the second candidate text includes a plurality of keywords, it may be determined that the association relationship between the second candidate text and the first candidate text is relevant as long as there is a corresponding keyword of one of the keywords in the first candidate text.

For example, assuming that the keyword of the question to be answered includes "minimum" and "natural number", and that the first candidate text is "natural number is a non-negative integer", including the keyword "natural number", it may be determined that the association relationship of the first candidate text and the question to be answered is relevant. Assuming that the keywords of the second candidate text are "0" and "natural number", the first candidate text is "natural number is a non-negative integer", and the keywords are "natural number", it may be determined that the association relationship of the first candidate text and the second candidate text is related.

As another example, if the question to be answered includes a plurality of keywords, the first candidate text needs to have a corresponding keyword of each keyword, so as to determine that the association relationship between the question to be answered and the first candidate text is relevant; if the second candidate text includes a plurality of keywords, the first candidate text needs to have a corresponding keyword of each keyword, so that it can be determined that the association relationship between the second candidate text and the first candidate text is relevant. Thus, the accuracy of determining the association relationship can be improved.

For example, assuming that the keyword of the question to be answered includes "minimum" and "natural number", and that the first candidate text 1 is "natural number is a non-negative integer", in which only the keyword "natural number" is included, it may be determined that the association relationship of the first candidate text 1 and the question to be answered is irrelevant; assuming that the first candidate text 2 is "0 is the smallest non-negative integer", in which the corresponding keywords "non-negative integer" including the keywords "smallest" and the keywords "natural number", that is, the corresponding keywords of each keyword in the questions to be answered are included in the first candidate text 2, it may be determined that the association relationship between the first candidate text 2 and the questions to be answered is relevant. Assuming that the keywords of the second candidate text are "0" and "natural number", and that the first candidate text is "natural number from 0", including the keywords "natural number" and the keywords "0", it can be determined that the association relationship of the first candidate text and the second candidate text is relevant.

It should be noted that, in such a case that a corresponding keyword of a keyword of the second candidate text exists in the first candidate text, the first candidate text may be an interpretation of the keyword of the second candidate text. For example, assuming that the second candidate text includes a keyword B, which may be in the form of a hyperlink in the second candidate text, by clicking on the hyperlink, a jump may be made to the first candidate text, and then the first candidate text may be considered to have a corresponding keyword to the keyword of the second candidate text.

In the embodiment of the application, before the adjacency matrix is constructed, the association relation between the questions to be replied and the candidate texts can be determined according to the keywords, the association relation among a plurality of candidate texts is determined, the adjacency matrix is constructed based on the association relation, and the association relation among the candidate texts is considered on the basis of considering the association relation between the questions to be replied and the candidate texts, so that the accuracy of text screening can be further improved.

In one possible implementation manner, the specific implementation of constructing the adjacency matrix based on the association relationship between the questions to be answered and the multiple candidate texts may include: and taking the questions to be replied and the plurality of candidate texts as nodes, taking the nodes as rows and columns, wherein the arrangement order of the row nodes and the column nodes is the same, and determining the elements of each position based on the association relation of the row nodes and the column nodes corresponding to each position to obtain the adjacency matrix.

That is, in the constructed adjacency matrix, the element of each position is determined according to the association relation of the row node and the column node of the position, the row node and the column node are the questions to be replied and the plurality of candidate texts, and the arrangement order of the row node and the arrangement order of the column node are the same.

As an example, for convenience of description, a question to be answered and a plurality of candidate texts may be referred to as nodes, the plurality of nodes may be randomly numbered, the plurality of nodes are numbered as rows, and the plurality of nodes are numbered as columns, and then elements of an ith row and a jth column in the adjacency matrix are determined according to association relations between the ith row node and the jth row node. Where i and j are integers greater than 0.

For example, assuming that the number of questions to be answered is 1, the number of candidate text 1 is 2, and the number of candidate text 2 is 3, the row nodes of the adjacency matrix are arranged in the order of numbers 1 to 3, and the column nodes are also arranged in the order of numbers 1 to 3.

In the embodiment of the application, the association relation between the questions to be replied and the candidate text can be expressed in the form of the adjacency matrix, so that the equipment processing is convenient.

In one embodiment, determining the specific implementation of the element of each location based on the association relationship of the row and the column corresponding to each location may include:

If the association relationship between the row node and the column node corresponding to the target position is relevant, determining that the element of the target position is 1, wherein the target position is any position in the adjacency matrix; if the association relationship between the row node and the column node corresponding to the target position is irrelevant, determining that the element of the target position is 0.

As an example, to facilitate device identification, the correlation may be represented by a value of 1 and the uncorrelation by a value of 0. If the association between the ith row node and the jth column node is relevant, the element of the jth column of the ith row is 1, and if the association between the ith row node and the jth column node is irrelevant, the element of the jth column of the ith row is 0.

Illustratively, assuming that three candidate texts are included, the number of questions to be answered is 1, the number of candidate texts 1 is 2, the number of candidate texts 2 is 3, and the number of candidate texts 3 is 4. And, the association relation between the questions to be replied and the candidate text 1 is relevant, and the elements of the 1 st row, the 2 nd column and the 2 nd row, the 1 st column are all 1; if the association relation between the questions to be replied and the candidate text 2 is irrelevant, the elements of the 1 st row, the 3 rd column and the 3 rd row, the 1 st column are all 0; the association relation between the questions to be replied and the candidate text 3 is relevant, and the elements of the 1 st row, the 4 th column and the 4 th row and the 1 st column are 1; the association relation between the candidate text 1 and the candidate text 2 is relevant, and then the elements of the 2 nd row, the 3 rd column and the 3 rd row, the 2 nd column are 1; the association relation between the candidate text 1 and the candidate text 3 is relevant, and then the elements of the 2 nd row, the 4 th column and the 4 th row, the 2 nd column are 1; the association relation between the candidate text 2 and the candidate text 3 is irrelevant, and the elements of the 3 rd row, the 4 th column and the 4 th row and the 3 rd column are 0; and the association relation between the text to be replied and the text is relevant, and each candidate text is relevant to the association relation of the text, and then the elements of the 1 st row, the 1 st column, the 2 nd row, the 2 nd column, the 3 rd row, the 3 rd column and the 4 th row and the 4 th column are all 1. Namely, the adjacent matrix can be obtained by the above method

In another possible implementation manner, the specific implementation of constructing the adjacency matrix based on the association relationship between the questions to be answered and the multiple candidate texts may include: taking the questions to be replied and the plurality of candidate texts as nodes, and connecting different nodes with related association relations to obtain a graph network; the adjacency matrix is constructed based on the graph network.

In this implementation manner, for different nodes, if the association relationship is related, it may be considered that there are edges between the different nodes, the to-be-replied question and the multiple candidate texts are taken as nodes, and the association relationship is taken as an edge, then the graph network may be constructed, and then the adjacency matrix may be constructed based on the graph network.

Illustratively, assuming that the association relationship between the question to be answered and the candidate text 1 is relevant, an edge exists between the node of the question to be answered and the node of the candidate text 1; the association relation between the questions to be replied and the candidate text 2 is irrelevant, and no edge exists between the nodes of the questions to be replied and the candidate text 2; the association relation between the questions to be replied and the candidate text 3 is relevant, and edges exist between the nodes of the questions to be replied and the candidate text 3; the association relation between the candidate text 1 and the candidate text 2 is relevant, and an edge exists between the candidate text 1 node and the candidate text 2 node; the association relation between the candidate text 1 and the candidate text 3 is relevant, and an edge exists between the candidate text 1 node and the candidate text 3 node; the association relation between the candidate text 2 and the candidate text 3 is irrelevant, and no edge exists between the candidate text 2 node and the candidate text 3 node, so that the graph network shown in fig. 5 can be obtained.

In one embodiment, constructing the adjacency matrix based on the graph network may include: and taking the nodes in the graph network as rows and columns, wherein the arrangement sequence of the row nodes is the same as the arrangement sequence of the column nodes, and determining the elements of each position based on whether edges exist in the row nodes and the column nodes corresponding to each position, so as to obtain the adjacency matrix.

That is, in the constructed adjacency matrix, the element of each position is determined according to the association relation of the row node and the column node of the position, the row node and the column node are nodes in the graph network, and the arrangement order of the row node and the arrangement order of the column node are the same.

As an example, nodes in the graph network may be randomly numbered, a plurality of nodes are numbered as rows, and a plurality of nodes are numbered as columns, and elements of an ith row and a jth column in the adjacency matrix are determined according to an association relationship between the ith row node and the jth row node. Where i and j are integers greater than 0.

In one embodiment, determining the specific implementation of the element for each location based on whether an edge exists for the row node and the column node corresponding to each location may include: if the row node and the column node corresponding to the target position are not the same nodes and edges exist in the graph network, determining that the element of the target position is 1, wherein the target position is any position in the adjacent matrix; if the row node and the column node corresponding to the target position are not the same nodes and no edge exists in the graph network, determining that the element of the target position is 0; if the row node and the column node corresponding to the target position are the same node, determining that the element of the target position is 1 or 0.

That is, for ease of device identification, the correlation may be represented by a value of 1 and the uncorrelation by a value of 0. If the row node and the column node corresponding to the target position are not the same node and are connected by an edge in the graph network, the row node and the column node can be considered to be related, and then the element of the target position can be determined to be 1; if the row node and the column node corresponding to the target position are not the same node, and there is no edge connection between the row node and the column node in the graph network, the row node and the column node can be considered to be irrelevant, and then the element of the target position can be determined to be 1. If the row node and the column node corresponding to the target position are the same node, no edge exists in the graph network, but the element of the target position can be determined to be 1 or 0.

As an example, in the case where i and j are not the same, if there is an edge between the i-th row node and the j-th column node in the graph network, the element of the i-th row and the j-th column is 1; if there is no edge between the ith row node and the jth column node in the graph network, then the element of the ith row and jth column is 0. In the case where i and j are the same, the element of the ith row and jth column may be determined to be 1 or 0.

Illustratively, assuming that three candidate texts are included, the number of questions to be answered is 1, the number of candidate texts 1 is 2, the number of candidate texts 2 is 3, and the number of candidate texts 3 is 4. And, there is an edge between the question node to be replied and the candidate text 1 node, then the elements of the 1 st row, the 2 nd column and the 2 nd row, the 1 st column are all 1; if no edge exists between the question node to be replied and the candidate text 2 node, the elements of the 1 st row, the 3 rd column and the 3 rd row, the 1 st column are all 0; if an edge exists between the node to be replied to the question and the candidate text 3 node, the elements of the 1 st row, the 4 th column and the 4 th row, the 1 st column are all 1; an edge exists between the candidate text 1 node and the candidate text 2 node, and then elements of the 2 nd row, the 3 rd column and the 3 rd row, the 2 nd column are 1; an edge exists between the candidate text 1 node and the candidate text 3 node, and then elements of the 2 nd row, the 4 th column and the 4 th row and the 2 nd column are 1; there is no edge between the candidate text 2 node and the candidate text 3 node, and then the elements of row 3, column 4 and row 4, column 3 are both 0. And, element 1 of the diagonal position in the adjacency matrix is determined. Namely, the adjacent matrix can be obtained by the above method

Illustratively, referring to FIG. 3, an adjacency matrix is constructed based on the question to be answered and the candidate text.

According to the embodiment of the application, the association relation between a plurality of candidate texts and the association relation between the questions to be replied and the candidate texts can be determined according to the keywords of the questions to be replied and the keywords of the candidate texts, and an adjacent matrix can be constructed according to the association relation, namely the association relation is expressed in the form of the adjacent matrix, the relation between the candidate texts can be considered on the basis of considering the questions to be replied, the extracted association relation is richer, the adjacent matrix is used as the input of a text screening network, the text screening network can extract richer association relation, and the accuracy of text screening can be improved.

Step 206: and inputting the semantic vector of the to-be-replied question, the semantic vectors of the candidate texts and the adjacency matrix into a text screening network to determine a target text.

As an example, the target text may be text that has been screened for a higher relevance to the question, and no irrelevant text is present in the target text. Wherein the irrelevant text may be text that is irrelevant to the question.

As one example, the text-screening network may be a graph neural network. Illustratively, the text filtering network may be a graph convolutional neural network, a graph self-coding neural network, and the like, which is not limited by the embodiments of the present application.

In one embodiment, inputting the semantic vectors of the questions to be answered, the semantic vectors of the candidate texts and the adjacency matrix into a text filtering network, and determining the specific implementation of the target text may include: inputting the adjacency matrix, the semantic vector of the to-be-replied question and the semantic vectors of the plurality of candidate texts into a text screening network to obtain a relevance score of each candidate text relative to the to-be-replied question; and determining the candidate text with the relevance score being greater than a first threshold as the target text.

The first threshold may be set by a user according to an actual requirement, or may be set by default by a device, which is not limited in the embodiment of the present application. For example, the first threshold may be 0.8.

Wherein the relevance score is used to represent the relevance of the candidate text to the question to be answered. The higher the relevance score, the higher the relevance of the candidate text to the question to be answered, and the lower the relevance score, the lower the relevance of the candidate text to the question to be answered.

As an example, the adjacency matrix, the semantic vector of the question to be replied and the semantic vector of the plurality of candidate texts may be input into a text filtering network, the association relationship between the question to be replied and the plurality of candidate texts is learned through the text filtering network, the semantic vector of the question to be replied and the semantic vector of the plurality of candidate texts are updated according to the association relationship, the semantic vector of the question to be replied and the semantic vector of the plurality of candidate texts, the updated semantic vector is converted into a relevance score of each candidate text relative to the question to be replied, and if the relevance score of a certain candidate text is greater than a first threshold, the candidate text may be considered to be sufficiently high in relevance to the question to be replied, and the candidate text may be determined as the target text.

According to the method and the device for selecting the target text, the relevance score of each candidate text relative to the questions to be replied can be determined through the text screening network, and the target text is determined from the plurality of candidate texts according to the relevance score, so that the target text which has higher relevance to the questions to be replied can be screened out of the plurality of candidate texts, and the candidate texts on a large scale can be rapidly reordered and screened.

In one embodiment, inputting the adjacency matrix, the semantic vector of the question to be answered and the semantic vectors of the plurality of candidate texts into a text filtering network to obtain a relevance score of each candidate text relative to the question to be answered may include:

Splicing the semantic vector of the question to be replied and the semantic vectors of the candidate texts to obtain spliced semantic vectors;

inputting the spliced semantic vector and the adjacency matrix into a hidden layer of a text screening network to obtain a hidden layer feature vector group, wherein the hidden layer feature vector group comprises hidden layer feature vectors obtained by combining the semantic vectors of the candidate texts with the questions to be replied and hidden layer feature vectors obtained by combining the semantic vectors of other candidate texts and the questions to be replied with each candidate text;

and inputting the hidden layer feature vector group into a full-connection layer to obtain the relevance score of each candidate text relative to the questions to be replied.

In some embodiments, a semantic vector of a to-be-replied question and semantic vectors of a plurality of candidate texts can be spliced to obtain a spliced semantic vector, the spliced semantic vector and an adjacent matrix are input into a hidden layer of a text screening network, multiple convolution operations can be performed in the hidden layer, the semantic vector of the to-be-replied question in the spliced semantic vector and the semantic vector of the candidate texts are combined to obtain a hidden layer feature vector group, the hidden layer feature vector group is input into a full-connection layer, and a relevance score of each candidate text relative to the to-be-replied question can be obtained.

As an example, assuming that the number of candidate texts is 9, the semantic vector of the question to be answered is a 300-dimensional vector, and the semantic vector of each candidate text is also a 300-dimensional vector, the spliced semantic vector obtained by splicing the semantic vector of the question to be answered and the semantic vectors of the plurality of candidate texts may be a 10×300 matrix, and each row in the matrix represents one semantic vector. After the spliced semantic vector is input into the hidden layer, the spliced semantic vector can be multiplied by the transpose thereof, namely, a 10×300 matrix is subjected to dot multiplication with a 300×10 matrix, so that a 10×10 first matrix can be obtained. The element of the ith row and the jth column in the first matrix is a value obtained by combining the semantic vector of the ith row node and the value of the first dimension in the semantic vectors of all the nodes.

As an example, the adjacency matrix is also a 10×10 matrix, and then the first matrix is combined with the adjacency matrix, then the elements of the ith row and the jth column in the first matrix are multiplied by the elements of the ith row and the jth column in the adjacency matrix, that is, the elements of the same position in the first matrix and the adjacency matrix are multiplied one by one to obtain a 10×10 second matrix, so that the elements at the corresponding positions between the irrelevant row nodes and the column nodes are 0. And then carrying out normalization processing on the second matrix according to the rows, so that the elements of each row are in the same magnitude, and obtaining the weight corresponding to each node.

As an example, multiplying the second matrix by the stitching feature vector, that is, multiplying the 10×10 matrix by the 10×300 matrix, may obtain a 10×300 third matrix, where the ith row represents the hidden feature vector after the ith row node binds to the semantic vector of the other node, and the jth element of the ith row identifies the value of the hidden feature vector after the ith row node binds to the semantic vector of the other node in the jth dimension.

As an example, the third matrix may also be referred to as a hidden layer feature vector set, which is input to the full connection layer, where a preset conversion matrix exists, which may be a 300×1 matrix, and the third matrix and the conversion matrix are multiplied to obtain a 10×1 target matrix, where elements of each row represent relevance scores of the nodes of the row. Since the line node is a question to be answered and a plurality of candidate texts, a relevance score of each candidate text relative to the question to be answered can be obtained therefrom.

In some embodiments, after determining the relevance score of each candidate text with respect to the question to be answered, the labels of the candidate texts having relevance scores greater than a first threshold may also be determined to be relevant, i.e., the candidate texts are determined to be relevant to the question to be answered.

In the embodiment of the application, the candidate texts are screened through the text screening network to obtain the target texts, the association relation between the questions to be replied and the candidate texts is considered, the association relation between the candidate texts is considered, the extracted association relation is richer, the association relation is represented by the adjacent matrix, and the text screening network can learn richer association relation by combining with the semantic vector, so that the accuracy of text screening can be improved.

For example, referring to fig. 3, a semantic vector of a to-be-answered question and semantic vectors of 1000 candidate texts are spliced to obtain a spliced semantic vector, the spliced semantic vector and an adjacency matrix are input into a text screening network, and a relevance score of each candidate text relative to the to-be-answered question can be output, so that 10 target texts are determined.

In one embodiment, after determining the candidate text with the relevance score greater than the preset threshold as the target text, the method further includes: and if the number of the target texts is multiple, sequencing the target texts according to the sequence from the high relevance score to the low relevance score, and outputting the sequenced target texts according to the sequence.

The preset threshold may be set by a user according to actual needs, or may be set by default by a device, which is not limited in the embodiment of the present application. For example, the preset threshold may be 0.85.

In a specific implementation, when the number of the target texts is multiple, the target texts can be ranked according to the order of the relevance scores from large to small, and the ranked target texts are sequentially output for viewing by a user. In the case where the number of target text is one, the target text may be output for viewing by the user.

In the embodiment of the application, the method is different from the prior art that the feature extraction is needed to be carried out on the questions to be replied and the candidate texts through a reordering model after the processing through a semantic retrieval model, and the plurality of candidate texts are reordered according to the re-extracted semantic vectors. The semantic vector of the to-be-replied question and the semantic vectors of the candidate texts obtained by the semantic retrieval model can be input into a text screening network, the process of re-acquiring the semantic vector of the question and the semantic vector of the text in the prior art is reduced, and the candidate texts can be rapidly reordered to obtain target texts. And the recall result of the semantic retrieval model can be restrained through the text screening network, so that some irrelevant texts are prevented from being recalled.

Further, text screening can be achieved through the method, target text related to the questions to be answered is obtained, and then target answers can be obtained based on the questions to be answered and the target text. As an example, the questions to be answered and the target text ordered by relevance scores may be input into a reading understanding model, and then the target answers to the questions to be answered may be output.

Further, the training method of the text screening network is as follows:

acquiring a sample question, a plurality of sample texts and a sample mark of each sample text, wherein the sample mark of each sample text is used for representing the relevance of the sample text and the sample question;

determining a semantic vector of the sample question and a semantic vector of each sample text, and constructing an adjacency matrix based on the sample question and a plurality of sample texts;

inputting the semantic vector of the sample problem, the semantic vector of each sample text and an adjacency matrix into the text screening network, and processing through a hidden layer of the text screening network to obtain a hidden layer feature vector group, wherein the hidden layer feature vector group comprises hidden layer feature vectors obtained by combining the sample problem with the semantic vectors of the plurality of sample texts and hidden layer feature vectors obtained by combining each sample text with other sample texts and the semantic vectors of the sample problems;

Inputting the hidden layer feature vector group into a full-connection layer to obtain a relevance score of each sample text relative to the sample problem;

determining a predictive marker for each sample text based on a relevance score for each sample text relative to the sample question;

training the text screening network based on the predictive markers of each sample text and the loss values of the sample markers until a training stop condition is reached.

Wherein the sample markers may include correlated and uncorrelated.

In some embodiments, a sample question and a plurality of sample texts may be obtained from a sample library, and each sample text in the sample library corresponds to a sample label, while a sample label for each sample text may be obtained.

In a specific implementation, a sample problem, a plurality of sample texts and sample marks of each sample text can be acquired first, feature extraction is carried out on the sample problem and the sample texts, semantic vectors of the sample problem and semantic vectors of each sample text are determined, an adjacent matrix is constructed according to association relations between the sample texts and the sample problem, the semantic vectors of each sample text and the adjacent matrix are input into a hidden layer of a text screening network, multiple convolution operations can be carried out in the hidden layer, the semantic vectors of the sample problem and the semantic vectors of the sample texts in the spliced semantic vectors are combined to obtain a hidden layer feature vector group, the hidden layer feature vector group is input into a full connection layer, a relevance score of each sample text relative to the sample problem can be obtained, a predictive marker of the sample text with the relevance score being greater than a first threshold is determined to be relevant, the predictive marker of the sample text with the relevance score being smaller than or equal to the first threshold is determined to be irrelevant, the predictive marker of each sample text can be determined, and then the predictive marker of each sample text can be determined according to the predictive marker of the sample text with the relevance score of each sample text and the sample text with the relevance score being equal to the first threshold is determined to be irrelevant, the predictive marker of each sample text, the sample text is lost according to the predictive marker of each sample text with the sample text and the training text is determined to the training value, the loss value, and the training condition is reached.

It should be noted that, the specific implementation of determining the semantic vector of the sample question and the semantic vector of each sample text is the same as the specific implementation of determining the semantic vector of the question to be replied and the semantic vector of each candidate text in step 202, and the implementation process may refer to the related description of step 202, which is not repeated here in this embodiment. The specific implementation of constructing the adjacency matrix based on the sample question and the plurality of sample texts is the same as the specific implementation of constructing the adjacency matrix based on the association relation between the question to be replied and the plurality of candidate texts, and the implementation process can be referred to in the related description of step 204. The semantic vector of the sample question, the semantic vector of each sample text and the adjacency matrix are input into the text filtering network until the specific implementation of the prediction mark of each sample text is the same as the specific implementation part of the target text in the step, and the implementation process can be referred to the related description of the step, and the embodiment is not repeated here.

In one possible implementation manner, training the text filtering network based on the predicted label of each sample text and the loss value of the sample label until reaching the training stop condition may include: if the loss value is smaller than or equal to a third threshold value, stopping training the text screening network; and if the loss value is larger than the third threshold value, continuing training the text screening network.

It should be noted that, the third threshold may be set by the user according to the actual requirement, or may be set by default by the computing device, which is not limited in the embodiment of the present application.

That is, if the loss value is greater than the third threshold, it is indicated that the difference between the predictive markers and the first marker set is relatively large, and the performance of the text classification model is not good enough, so that it is necessary to continue training the text classification model. If the loss value is less than or equal to the third threshold value, which indicates that the difference between the predictive markers and the first marker set is small, the performance of the text classification model is relatively good, and the training of the text classification model can be considered to be finished, so that the training of the text classification model can be stopped.

As an example, a penalty value may be determined based on the predictive markers for each sample text and the sample markers for each sample text, then for multiple samples text, multiple penalty values may be obtained, the multiple penalty values may be weighted and summed, a penalty value corresponding to the training may be obtained, and parameters of the text screening network may be adjusted based on the penalty values to achieve training of the text screening network.

According to the embodiment of the specification, the specific training condition of the text screening network is judged according to the loss value, and under the condition that training is unqualified, the parameters of the text screening network are reversely adjusted according to the loss value so as to improve the text screening capacity of the text screening network, and the training speed is high and the training effect is good.

In another possible implementation manner, training the text filtering network based on the predicted label of each sample text and the loss value of the sample label until reaching the training stop condition may include: training the text screening network once based on the predictive markers of each sample text and the loss values of the sample markers, and recording the number of iterative training plus one; if the number of iterative training is smaller than or equal to a fourth threshold value, continuing training the text screening network; and if the number of iterative training is greater than the fourth threshold, stopping training the text screening network.

It should be noted that, the fourth threshold may be set by the user according to the actual requirement, or may be set by default by the computing device, which is not limited in the embodiment of the present application.

That is, each time the text screening network is trained based on the predicted mark and the loss value of the sample mark of each sample text, it can be considered that an iterative training is performed, the model is continuously iteratively trained based on the predicted mark and the sample mark obtained by the iterative training, the iteration number of the iterative training is recorded, if the iteration number is less than or equal to the fourth threshold value, it is indicated that the model is not trained enough, and if the iteration number is greater than the fourth threshold value, it is indicated that a sufficient number of training has been performed, the performance of the model is basically stable, and the training can be stopped.

It should be noted that, the preset times may be set by the user according to actual needs, or may be set by default by the computing device, which is not limited in the embodiment of the present application.

In the embodiment of the specification, whether the text screening network training is completed is judged according to the iteration times, so that the unnecessary times of the iteration training can be reduced, and the efficiency of the text screening network training is improved.

In the embodiment of the application, the semantic vector of the to-be-answered question, a plurality of candidate texts and the semantic vector of the plurality of candidate texts are determined based on the acquired to-be-answered question, wherein each candidate text is a text related to the to-be-answered question semantically in a text library; constructing an adjacency matrix based on the association relation between the questions to be replied and the plurality of candidate texts, wherein the adjacency matrix is used for representing the correlation between the questions to be replied and the plurality of candidate texts and the correlation between the plurality of candidate texts; and inputting the semantic vector of the to-be-replied question, the semantic vectors of the candidate texts and the adjacency matrix into a text screening network to determine a target text. After the candidate texts are determined, the candidate texts can be further screened through the text screening network, the candidate texts irrelevant to the questions to be answered are deleted, the target texts with higher relevance to the questions to be answered are obtained, recall of irrelevant texts is reduced, recall rate of retrieval is improved, and the accuracy of answers determined based on the target texts is higher due to higher relevance of the target texts to the questions to be answered, namely, the performance of a question-answering system is improved.

Fig. 6 shows a flowchart of another text processing method according to an embodiment of the present application, which is described by taking an example that the question to be answered is "the smallest natural number is several", and includes steps 602 to 628.

Step 602: and obtaining the questions to be answered.

In the present embodiment, taking an example that the question to be answered is "the smallest natural number is several".

Step 604: and extracting the characteristics of the questions to be answered, and determining the semantic vectors of the questions to be answered.

Continuing the above example, word segmentation is performed on the questions to be replied, so that a plurality of first word units [ minimum, natural number, yes, few ] can be obtained, word embedding processing can be performed on each first word unit of the questions to be replied in a word2vec coding mode, and each first word unit is mapped into a low-dimensional vector space, so that word vectors of each first word unit are obtained. And inputting the word vector of each first word unit into a coding layer for coding processing, so that vector representation of each first word unit combined with word vectors of other first word units in the questions to be answered, namely first feature vectors of each first word unit, is obtained, and the first feature vectors of a plurality of first word units of the questions to be answered are spliced, so that semantic vectors of the questions to be answered can be obtained.

Step 606: semantic vectors of a plurality of texts in a text library are obtained.

For example, assuming that the text is "0 is the smallest natural number", word segmentation processing is performed on the text, so that a plurality of second word units are [0, yes, smallest, natural number ], word embedding processing can be performed on each second word unit of the text by means of word2vec coding, and each second word unit is mapped into a low-dimensional vector space, so that a word vector of each second word unit is obtained. And inputting the word vector of each second word unit into the coding layer for coding processing, so that vector representation after each second word unit is combined with the word vectors of other second word units in the text, namely second feature vectors of each second word unit, and splicing the second feature vectors of a plurality of second word units of the text, so that semantic vectors of the text can be obtained.

Step 608: and determining the similarity score of each text relative to the questions to be answered based on the semantic vectors of the questions to be answered and the semantic vectors of the texts.

Step 610: and taking a plurality of texts with similarity scores larger than a second threshold value as the plurality of candidate texts, and acquiring semantic vectors of the plurality of candidate texts.

It should be noted that, the above steps 602-610 are the following descriptions of the step 202, the implementation process is the same as that of the step 202, and the specific implementation may refer to the related descriptions of the step 202, which are not repeated here. In addition, in this embodiment, the process of determining candidate texts from the text library is described by taking semantic search as an example, and in practical implementation, candidate texts may be determined by BM25 or other search algorithm, which is not limited in the present application.

Step 612: and acquiring the keywords of the questions to be replied and the keywords of each candidate text.

Continuing the above example, assuming that the question to be answered is "the smallest natural number is several", the keywords can be extracted as "the smallest" and "the natural number". Assuming that the candidate text is "0 is the smallest natural number", the keywords that can be extracted are "natural number", "0", and "smallest".

Step 614: if the corresponding keyword of the question to be replied exists in the first candidate text, determining that the association relation between the first candidate text and the question to be replied is relevant, wherein the first candidate text is any candidate text in the plurality of candidate texts.

For example, assuming that the keyword of the question to be answered includes "minimum" and "natural number", and that the first candidate text is "natural number is a non-negative integer", including the keyword "natural number", it may be determined that the association relationship of the first candidate text and the question to be answered is relevant.

Step 616: and if the corresponding keyword of the second candidate text exists in the first candidate text, determining that the association relationship between the first candidate text and the second candidate text is relevant, wherein the second candidate text is any candidate text except the first candidate text in the plurality of candidate texts.

For example, assuming that the keywords of the second candidate text are "0" and "natural number", the first candidate text is "natural number is a non-negative integer", and the keywords "natural number" are included therein, it may be determined that the association relationship of the first candidate text and the second candidate text is a correlation.

Step 618: and determining that the association relationship between the questions to be replied and the questions to be replied is relevant, and determining that the association relationship between each candidate text and the questions to be replied is relevant.

Step 620: and taking the questions to be replied and the plurality of candidate texts as nodes, taking the nodes as rows and columns, wherein the arrangement order of the row nodes and the column nodes is the same, and determining the elements of each position based on the association relation of the row nodes and the column nodes corresponding to each position to obtain the adjacency matrix.

For example, assuming that three candidate texts are included, the number of questions to be answered is 1, the number of candidate texts 1 is 2, the number of candidate texts 2 is 3, and the number of candidate texts 3 is 4. And, the association relation between the questions to be replied and the candidate text 1 is relevant, and the elements of the 1 st row, the 2 nd column and the 2 nd row, the 1 st column are all 1; if the association relation between the questions to be replied and the candidate text 2 is irrelevant, the elements of the 1 st row, the 3 rd column and the 3 rd row, the 1 st column are all 0; the association relation between the questions to be replied and the candidate text 3 is relevant, and the elements of the 1 st row, the 4 th column and the 4 th row and the 1 st column are 1; the association relation between the candidate text 1 and the candidate text 2 is relevant, and then the elements of the 2 nd row, the 3 rd column and the 3 rd row, the 2 nd column are 1; the association relation between the candidate text 1 and the candidate text 3 is relevant, and then the elements of the 2 nd row, the 4 th column and the 4 th row, the 2 nd column are 1; the association relation between the candidate text 2 and the candidate text 3 is irrelevant, and the elements of the 3 rd row, the 4 th column and the 4 th row and the 3 rd column are 0; and the association relation between the text to be replied and the text is relevant, and each candidate text is relevant to the association relation of the text, and then the elements of the 1 st row, the 1 st column, the 2 nd row, the 2 nd column, the 3 rd row, the 3 rd column and the 4 th row and the 4 th column are all 1. Namely, the adjacent matrix can be obtained by the above method

It should be noted that, the steps 612 to 620 are the following descriptions of the step 204, the implementation process is the same as that of the step 204, and the specific implementation can be referred to the related descriptions of the step 204, which is not repeated here.

Step 622: and splicing the semantic vector of the question to be replied and the semantic vectors of the candidate texts to obtain a spliced semantic vector.

Step 624: inputting the spliced semantic vector and the adjacency matrix into a hidden layer of a text screening network to obtain a hidden layer feature vector group, wherein the hidden layer feature vector group comprises hidden layer feature vectors obtained by combining the semantic vectors of the candidate texts with the questions to be replied and hidden layer feature vectors obtained by combining the semantic vectors of other candidate texts and the questions to be replied with each candidate text.

Step 626: and inputting the hidden layer feature vector group into a full-connection layer to obtain the relevance score of each candidate text relative to the questions to be replied.

Step 628: and determining the candidate text with the relevance score being greater than a first threshold as the target text.

For example, assuming that the question to be answered is "the smallest natural number is several", the candidate text includes candidate text 1 "the natural number is a non-negative integer", candidate text 2"0 is the smallest natural number", candidate text 3 "the natural number is an integer greater than or equal to 0", assuming that the relevance score of candidate text 1 with respect to the question to be answered is 0.6, the relevance score of candidate text 2 with respect to the question to be answered is 0.9, the relevance score of candidate text 3 with respect to the question to be answered is 0.85, and the first threshold is 0.8, it is possible to determine candidate text 2 and candidate text 3 as target texts.

It should be noted that, the steps 622-628 are the following descriptions of the step 206, the implementation process is the same as that of the step 206, and the specific implementation can be referred to the related descriptions of the step 206, which is not repeated here.

According to the text processing method provided by the embodiment of the application, after a plurality of candidate texts are determined, the candidate texts can be further screened through the text screening network, the candidate texts irrelevant to the questions to be answered are deleted, the target texts with higher relevance to the questions to be answered are obtained, recall of irrelevant texts is reduced, the recall rate of retrieval is improved, and the accuracy of answers determined based on the target texts is higher because the relevance between the target texts and the questions to be answered is higher, namely the performance of a question-answering system is improved.

Corresponding to the above method embodiment, the present application further provides an embodiment of a text processing device, and fig. 7 shows a schematic structural diagram of the text processing device according to an embodiment of the present application. As shown in fig. 7, the apparatus 700 includes:

a first determining module 702 configured to determine, based on the acquired question to be answered, a semantic vector of the question to be answered, a plurality of candidate texts, and semantic vectors of the plurality of candidate texts, wherein each candidate text is a text related to the meaning of the question to be answered in a text library;

A construction module 704 configured to construct an adjacency matrix based on the association relationship between the question to be answered and the plurality of candidate texts, wherein the adjacency matrix is used for characterizing the relevance between the question to be answered and the plurality of candidate texts and the relevance between the plurality of candidate texts;

a second determining module 706 is configured to input the semantic vectors of the questions to be answered, the semantic vectors of the candidate texts and the adjacency matrix into a text filtering network to determine a target text.

Optionally, the building module 704 is further configured to:

Optionally, the construction module 704 is configured to:

and taking the questions to be replied and the plurality of candidate texts as nodes, taking the nodes as rows and columns, wherein the arrangement order of the row nodes and the column nodes is the same, and determining the elements of each position based on the association relation of the row nodes and the column nodes corresponding to each position to obtain the adjacency matrix.

Optionally, the construction module 704 is configured to:

if the association relationship between the row node and the column node corresponding to the target position is relevant, determining that the element of the target position is 1, wherein the target position is any position in the adjacency matrix;

if the association relationship between the row node and the column node corresponding to the target position is irrelevant, determining that the element of the target position is 0.

Optionally, the construction module 704 is configured to:

taking the questions to be replied and the plurality of candidate texts as nodes, and connecting different nodes with related association relations to obtain a graph network;

The adjacency matrix is constructed based on the graph network.

Optionally, the construction module 704 is configured to:

and taking the nodes in the graph network as rows and columns, wherein the arrangement sequence of the row nodes is the same as the arrangement sequence of the column nodes, and determining the elements of each position based on whether edges exist in the row nodes and the column nodes corresponding to each position, so as to obtain the adjacency matrix.

Optionally, the construction module 704 is configured to:

if the row node and the column node corresponding to the target position are not the same nodes and edges exist in the graph network, determining that the element of the target position is 1, wherein the target position is any position in the adjacent matrix;

if the row node and the column node corresponding to the target position are not the same nodes and no edge exists in the graph network, determining that the element of the target position is 0;

if the row node and the column node corresponding to the target position are the same node, determining that the element of the target position is 1 or 0.

Optionally, the second determining module 706 is configured to:

inputting the adjacency matrix, the semantic vector of the to-be-replied question and the semantic vectors of the plurality of candidate texts into a text screening network to obtain a relevance score of each candidate text relative to the to-be-replied question;

And determining the candidate text with the relevance score being greater than a first threshold as the target text.

Optionally, the second determining module 706 is configured to:

Optionally, the second determining module 706 is further configured to:

and if the number of the target texts is multiple, sequencing the target texts according to the sequence from the high relevance score to the low relevance score, and outputting the sequenced target texts according to the sequence.

Optionally, the first determining module 702 is configured to:

Extracting features of the questions to be answered, and determining semantic vectors of the questions to be answered;

acquiring semantic vectors of a plurality of texts in the text library;

determining a similarity score of each text relative to the question to be answered based on the semantic vector of the question to be answered and the semantic vectors of the plurality of texts;

and determining the candidate texts based on the similarity scores of each text relative to the questions to be answered, and acquiring semantic vectors of the candidate texts.

Optionally, the first determining module 702 is configured to:

and taking a plurality of texts with similarity scores larger than a second threshold value as the candidate texts.

Optionally, the apparatus further comprises a training module configured to:

Optionally, the training module is configured to:

if the loss value is smaller than or equal to a third threshold value, stopping training the text screening network;

and if the loss value is larger than the third threshold value, continuing training the text screening network.

Optionally, the training module is configured to:

training the text screening network once based on the predictive markers of each sample text and the loss values of the sample markers, and recording the number of iterative training plus one;

if the number of iterative training is smaller than or equal to a fourth threshold value, continuing training the text screening network;

and if the number of iterative training is greater than the fourth threshold, stopping training the text screening network.

The above is an exemplary scheme of a text processing apparatus of the present embodiment. It should be noted that, the technical solution of the text processing apparatus and the technical solution of the text processing method belong to the same concept, and details of the technical solution of the text processing apparatus, which are not described in detail, can be referred to the description of the technical solution of the text processing method.

It should be noted that, the components in the apparatus claims should be understood as functional modules that are necessary to be established for implementing the steps of the program flow or the steps of the method, and the functional modules are not actually functional divisions or separate limitations. The device claims defined by such a set of functional modules should be understood as a functional module architecture for implementing the solution primarily by means of the computer program described in the specification, and not as a physical device for implementing the solution primarily by means of hardware.

In one embodiment, the application also provides a computing device, including a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the text processing method when executing the instructions.

The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the text processing method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the text processing method.

An embodiment of the application also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, implement the steps of the text processing method as described above.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the text processing method belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the text processing method.

The embodiment of the application discloses a chip which stores computer instructions which, when executed by a processor, implement the steps of the text processing method as described above.

The foregoing describes certain embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the application disclosed above are intended only to assist in the explanation of the application. Alternative embodiments are not intended to be exhaustive or to limit the application to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and the practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and the full scope and equivalents thereof.

Claims

1. A method of candidate text determination, the method comprising:

2. The method according to claim 1, wherein the number of candidate texts is a plurality; after determining the candidate text, based on the first candidate text and the second candidate text, the method further comprises:

determining semantic vectors of a plurality of candidate texts;

constructing an adjacency matrix based on the association relation between the questions to be replied and the plurality of candidate texts, wherein the adjacency matrix is used for representing the correlation between the questions to be replied and the plurality of candidate texts and the correlation between the plurality of candidate texts;

and inputting the semantic vector of the to-be-replied question, the semantic vectors of the candidate texts and the adjacency matrix into a text screening network to determine a target text.

3. The method of determining candidate texts according to claim 2, further comprising, before constructing an adjacency matrix based on the association relation between the question to be answered and the plurality of candidate texts:

4. The method for determining candidate texts as set forth in claim 3, wherein constructing an adjacency matrix based on the association relationship between the question to be answered and the plurality of candidate texts comprises:

5. The method of claim 4, wherein determining the element for each location based on the association of the row node and the column node corresponding to each location comprises:

6. The method of claim 2, wherein constructing an adjacency matrix based on the association relationship of the question to be answered and the plurality of candidate texts, comprises:

7. The method of claim 6, wherein determining the element for each location based on whether an edge exists for the row node and the column node corresponding to each location comprises:

8. The method of claim 2, wherein inputting the semantic vector of the question to be answered, the semantic vectors of the plurality of candidate texts, and the adjacency matrix into a text filtering network, determining a target text, comprises:

determining candidate texts with relevance scores greater than a first threshold as the target texts;

9. The method of claim 8, wherein inputting the adjacency matrix, the semantic vector of the question to be answered, and the semantic vectors of the plurality of candidate texts into a text filtering network to obtain a relevance score for each candidate text with respect to the question to be answered, comprises:

10. The candidate text determination as defined in claim 2, wherein the training method of the text filtering network is as follows:

11. The method of candidate text determination as defined in claim 10, wherein training the text screening network based on the predictive markers for each sample text and the loss value for the sample markers until a training stop condition is reached comprises:

if the loss value is smaller than or equal to a third threshold value, stopping training the text screening network; if the loss value is larger than the third threshold value, continuing training the text screening network;

or alternatively, the first and second heat exchangers may be,

training the text screening network once based on the predictive markers of each sample text and the loss values of the sample markers, and recording the number of iterative training plus one; if the number of iterative training is smaller than or equal to a fourth threshold value, continuing training the text screening network; and if the number of iterative training is greater than the fourth threshold, stopping training the text screening network.

12. The method of claim 1, wherein determining semantic vectors for questions to be answered based on the obtained questions to be answered, obtaining semantic vectors for a plurality of texts in a text library, comprises:

word segmentation processing is respectively carried out on the questions to be replied and a plurality of texts in a text library, so that a plurality of first word units of the questions to be replied and a plurality of second word units of each text are obtained;

performing word embedding processing on each first word unit of the questions to be replied and each second word unit of the text, and mapping each first word unit and each second word unit into a low-dimensional vector space to obtain word vectors of each first word unit and each second word unit;

the word vector of each first word unit and the word vector of each second word unit are input to a coding layer for coding processing, so that a first feature vector of each first word unit and a second feature vector of each second word unit are obtained;

and splicing the first feature vectors of the plurality of first word units of the to-be-answered question to obtain a semantic vector of the to-be-answered question, and splicing the second feature vectors of the plurality of second word units of the same text to obtain the semantic vectors of the plurality of texts.

13. The method of claim 1, wherein determining a first candidate text from the text library that is semantically related to the question to be answered based on similarity of semantic vectors of the question to be answered to semantic vectors of the plurality of texts, comprises:

multiplying the semantic vector of the question to be replied by the semantic vectors of the texts, and normalizing the products to obtain the similarity between the question to be replied and each text;

and determining a first text to be selected which is semantically related to the question to be replied from the text library according to the similarity between the question to be replied and each text.

14. The method for determining candidate text as defined in claim 1, wherein word segmentation is performed on the question to be answered to obtain a plurality of first word units of the question to be answered, including:

and carrying out word segmentation processing on the questions to be replied according to a pre-programmed word list to obtain a plurality of first word units of the questions to be replied.

15. The method of claim 1, wherein determining a similarity score for each text relative to the question to be answered based on the weight value of each first word unit, the relevance value of each first word unit to each text in the text base, and before determining a text with a similarity score greater than a second threshold as a second candidate text, further comprises:

Determining, for each first word unit, a frequency of occurrence of said each first word unit in any text, an average length of all texts in said text library, and a length of said any text;

determining a relevance value of each first word unit and any text based on the frequency, the average length and the length of the any text;

determining the total number of all texts in the text base and the number of texts comprising any first word unit in the text base;

and determining a weight value of any one of the first word units based on the total number and the number of texts comprising the any one of the first word units.

16. The candidate text determination method as defined in claim 1, wherein determining candidate text based on the first candidate text and the second candidate text comprises:

and determining an intersection of the first candidate text and the second candidate text as a candidate text, or determining a union of the first candidate text and the second candidate text as a candidate text.

17. A candidate text determination device, the device comprising:

18. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor, when executing the instructions, performs the steps of the method of any one of claims 1-16.

19. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the method of any one of claims 1 to 16.