CN116610782B

CN116610782B - Text retrieval method, device, electronic equipment and medium

Info

Publication number: CN116610782B
Application number: CN202310479153.8A
Authority: CN
Inventors: 陈珺仪; 谢奕; 陈佳颖
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2024-03-15
Anticipated expiration: 2043-04-28
Also published as: CN116610782A

Abstract

The disclosure provides a text retrieval method, a text retrieval device, electronic equipment and a text retrieval medium, relates to the field of artificial intelligence, in particular to the technical field of natural language processing, deep learning and pre-training models, and can be applied to scenes such as smart cities and smart government affairs. The specific implementation scheme is as follows: acquiring a plurality of candidate texts associated with the search text according to a plurality of keywords in the search text; analyzing the search text to obtain first feature information, second feature information and third feature information corresponding to the search text; respectively analyzing the plurality of candidate texts to obtain candidate feature information corresponding to each of the plurality of candidate texts; for each candidate text, determining the matching degree between the candidate text and the search text according to the first feature information, the second feature information, the third feature information and the candidate feature information; and sorting the candidate texts according to the matching degree, and obtaining a retrieval result corresponding to the retrieval text based on the sorting result.

Description

Text retrieval method, device, electronic equipment and medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of natural language processing, deep learning and pre-training models, and can be applied to scenes such as smart cities, smart government affairs and the like. The present disclosure relates in particular to a text retrieval method, apparatus, electronic device, storage medium and computer program product.

Background

In the related art, text retrieval is generally performed by adopting a text cut-off mode, that is, when the length of the retrieved content exceeds a certain limit, only the text content within the limit range is taken for text retrieval. However, in text retrieval using long text containing complex information, related key information may be distributed at various locations of the retrieved content. If text is searched by using a text cut-off mode, part of key information is missed, so that a search result is inaccurate.

Disclosure of Invention

The present disclosure provides a text retrieval method, apparatus, electronic device, storage medium and computer program product.

According to an aspect of the present disclosure, there is provided a text retrieval method including: acquiring a plurality of candidate texts associated with the search text according to a plurality of keywords in the search text; analyzing the search text to obtain first feature information, second feature information and third feature information corresponding to the search text; respectively analyzing the plurality of candidate texts to obtain candidate feature information corresponding to each of the plurality of candidate texts; for each candidate text, determining the matching degree between the candidate text and the search text according to the first feature information, the second feature information, the third feature information and the candidate feature information; and sorting the candidate texts according to the matching degree, and obtaining a retrieval result corresponding to the retrieval text based on the sorting result.

According to another aspect of the present disclosure, there is provided a text retrieval apparatus including: the acquisition module is used for acquiring a plurality of candidate texts associated with the search text according to a plurality of keywords in the search text; the first analysis module is used for analyzing the search text to obtain first characteristic information, second characteristic information and third characteristic information corresponding to the search text; the second analysis module is used for respectively carrying out analysis processing on the plurality of candidate texts to obtain candidate feature information corresponding to each of the plurality of candidate texts; the matching module is used for determining the matching degree between the candidate text and the search text according to the first feature information, the second feature information, the third feature information and the candidate feature information for each candidate text; and the sorting module is used for sorting the plurality of candidate texts according to the matching degree and obtaining a retrieval result corresponding to the retrieval text based on the sorting result.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method provided in accordance with the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method provided according to the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method provided according to the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of an exemplary system architecture to which text retrieval methods and apparatus may be applied, according to embodiments of the present disclosure;

FIG. 2 is a flow chart of a text retrieval method according to an embodiment of the present disclosure;

FIGS. 3A and 3B are schematic diagrams of text retrieval methods according to embodiments of the present disclosure;

FIG. 4 is a block diagram of a text retrieval device according to an embodiment of the present disclosure; and

fig. 5 is a block diagram of an electronic device for implementing a text retrieval method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

FIG. 1 is a schematic diagram of an exemplary system architecture to which text retrieval methods and apparatus may be applied, according to embodiments of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various client applications can be installed on the terminal devices 101, 102, 103. For example, knowledge reading class applications, document processing class applications, web browser applications, search class applications, instant messaging tools, mailbox clients or social platform software, and the like (just examples).

The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud computing, network service, and middleware service.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by users using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

For example, the server 105 may acquire the search text from the terminal devices 101, 102, 103 through the network 104, and acquire a plurality of candidate texts associated with the search text based on a plurality of keywords in the search text. And then, analyzing the search text to obtain first feature information, second feature information and third feature information corresponding to the search text. And then respectively analyzing the plurality of candidate texts to obtain candidate feature information corresponding to each of the plurality of candidate texts. Then, for each candidate text, according to the first feature information, the second feature information, the third feature information and the candidate feature information, the matching degree between the candidate text and the search text is determined, the plurality of candidate texts are ordered according to the matching degree, and the search result corresponding to the search text is obtained based on the ordering result. In some examples, the server 105 may also send the search results corresponding to the search text to the terminal devices 101, 102, 103 so that the user obtains the search results for the search text.

It should be noted that the text retrieval method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the text retrieval device provided by the embodiments of the present disclosure may generally be provided in the server 105.

Alternatively, the text retrieval method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and that is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the text retrieval apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

It should be noted that the sequence numbers of the respective operations in the following methods are merely representative of the operations for the purpose of description, and should not be construed as representing the order of execution of the respective operations. The method need not be performed in the exact order shown unless explicitly stated.

Fig. 2 is a flow chart of a text retrieval method according to an embodiment of the present disclosure.

As shown in fig. 2, the text retrieval method 200 may include operations S210 to S250, for example.

In operation S210, a plurality of candidate texts associated with the search text are acquired according to a plurality of keywords in the search text.

In operation S220, the search text is parsed to obtain first feature information, second feature information, and third feature information corresponding to the search text.

In operation S230, the plurality of candidate texts are respectively parsed, so as to obtain candidate feature information corresponding to each of the plurality of candidate texts.

In operation S240, for each candidate text, a degree of matching between the candidate text and the search text is determined according to the first feature information, the second feature information, the third feature information, and the candidate feature information.

In operation S250, the plurality of candidate texts are ranked according to the degree of matching, and a search result corresponding to the search text is obtained based on the ranking result.

According to the embodiment of the present disclosure, the search text may be, for example, text content input by a user, or text content obtained by converting voice information input by the user, and the method for obtaining the search text is not limited in the present disclosure.

The search text may include, for example, a plurality of keywords. The plurality of keywords may be used to characterize key information in the search text, which is relevant for the search purpose.

In embodiments of the present disclosure, a plurality of candidate texts associated with a search text may be retrieved from a plurality of keywords in the search text. Thus, the rough search of the search text can be realized based on a plurality of keywords in the search text.

On the basis of the rough search, the search text and the plurality of candidate texts can be respectively subjected to analysis processing so as to extract the characteristic information corresponding to the search text and the candidate characteristic information corresponding to the plurality of candidate texts, so that the plurality of candidate texts are further matched on the basis of the characteristic information corresponding to the search text and the candidate characteristic information corresponding to each candidate text, and the precision and the accuracy of a search result are further improved.

For example, the search text may be parsed to obtain first feature information, second feature information, and third feature information corresponding to the search text. The first characteristic information is used for representing the part of speech of each keyword and the semantic importance degree of each keyword for searching the text. The second feature information is used for representing the confidence level of the intention information corresponding to the search text. The third feature information is used for representing the true entity meaning of each keyword in the search text and the semantic importance degree of the entity to the search text.

For example, the analysis processing may be performed on each candidate text, so as to obtain candidate feature information corresponding to each candidate text. The candidate feature information is used for representing the true entity meaning of each candidate feature in the candidate text and the semantic importance degree of the entity to the candidate text.

Next, for each candidate text, matching calculation can be performed by using the first feature information, the second feature information and the third feature information to obtain the matching degree between the candidate text and the search text. Then, the plurality of candidate texts are ranked according to the matching degree, and a search result corresponding to the search text is obtained based on the ranking result.

It should be noted that, although the steps of the method are described above in a specific order, embodiments of the present disclosure are not limited thereto, and the steps may be performed in other orders as needed. For example, in some embodiments, step S220 may be performed after step S230 or concurrently with step S230, which is not limiting of the present disclosure.

In the technical scheme of the disclosure, firstly, rough retrieval of a retrieval text is realized based on a plurality of keywords in the retrieval text, and a plurality of candidate texts are obtained. Then, a search result corresponding to the search text is determined by performing a matching calculation with the candidate feature information using the first feature information, the second feature information, and the third feature information. Because the semantic association condition between the search text and the candidate text is considered in the matching calculation process, and the search tendency is determined based on the search text, the matching degree between the search text and the candidate text can be more accurately determined, and the accuracy of text search is improved.

According to the embodiment of the disclosure, a plurality of keywords in the search text can be obtained by performing word segmentation processing on the search text, for example. In some embodiments, the plurality of keywords in the search text may also be obtained in other suitable manners, which are not limited herein.

In addition, when the word segmentation processing is performed on the search text, the word segmentation weight of each keyword can be determined. The word segmentation weight is used for indicating the word segmentation accuracy of each keyword.

In one example, the word segmentation results after word segmentation processing may be matched with a proprietary word stock to determine the word segmentation accuracy of each keyword. For example, if a word corresponding to a keyword can be matched in a private word stock, determining the similarity between the word and the keyword as a word segmentation accuracy; if the word is not matched with the corresponding word, the word segmentation weight of the keyword is set to be a preset value.

According to an embodiment of the present disclosure, the first, second, and third characteristic information described above may be determined in the following manner.

For example, part-of-speech recognition may be performed on a plurality of keywords in the search text, so as to obtain part-of-speech recognition results and keyword weights of the plurality of keywords, and the part-of-speech recognition results and keyword weights of the plurality of keywords may be determined as the first feature information.

According to embodiments of the present disclosure, part-of-speech recognition results are used to characterize the part of speech of individual keywords, such as nouns, verbs, adjectives, and the like. The keyword weights are used to indicate the semantic importance of each keyword to the retrieved text.

It is understood that nouns, verbs, adjectives, and other parts of speech are typically of different semantic importance to text. In a general scene, real words such as nouns, verbs and the like are usually keywords, namely, the semantic importance degree of the keywords on the text is higher. While words such as mood words, questions, stop words, etc. are often non-keywords, i.e. of less semantic importance to the text. Thus, its keyword weight may be determined based on the part of speech of the keyword.

In one example, a preset word weight correspondence table may be queried according to the part of speech corresponding to each keyword in the part of speech recognition result, so as to determine the word weight corresponding to each keyword, and the word weight is used as the keyword weight.

The word weight correspondence table characterizes the correspondence between parts of speech and word weights. Illustratively, word weight labels "3" may be used to represent word weights corresponding to nouns, word weight labels "2" may be used to represent word weights corresponding to verbs, adjectives, and word weight labels "1" may be used to represent word weights corresponding to other parts of speech. Of course, the correspondence between the part of speech and the word weight may also be expressed in other manners, and may be specifically set according to the actual setting, which is not limited herein.

In another example, since word segmentation processing is generally required for the search text before part-of-speech recognition is performed for a plurality of keywords in the search text, a plurality of keywords are obtained. Therefore, when determining the keyword weights of the respective keywords, the word segmentation weights and the word weights may also be integrated, and for example, the word segmentation weights and the word weights may be multiplied to obtain the keyword weights. Thereby improving the accuracy of the keyword weights.

According to the embodiment of the disclosure, the intention classification can be performed on the search text, the intention classification result and the intention confidence corresponding to the search text are obtained, and the intention classification result and the intention confidence are determined to be the second feature information.

The intention classification result is used for characterizing intention information corresponding to the search text, and the intention information is used for indicating the search tendency of a user, such as elements of search personnel names, object names, place names, time and the like, or sub-elements related to the elements, such as high and low weight of people, clothing and the like. The intention confidence is used for indicating the confidence corresponding to the intention information when the intention is classified on the search text.

In one example, a pre-trained intent classification model may be utilized to classify intent of the retrieved text. The pre-trained intent classification model may include, for example, but is not limited to, a FastText classifier, or other pre-trained model, specifically selected according to the actual application scenario.

According to the embodiment of the disclosure, entity recognition can be performed on the search text, a first entity recognition result and a first importance degree recognition result associated with the first entity recognition result are obtained, and the first entity recognition result and the first importance degree recognition result are determined to be third feature information.

In embodiments of the present disclosure, the first entity recognition result may be obtained based on the search text using a pre-trained named entity recognition model. The first entity identification result includes a plurality of first entities. The plurality of first entities respectively correspond to named entity recognition results of the plurality of keywords in the search text. The named entity recognition model can include, but is not limited to, a Bi-LSTM+CRF model, and can be specifically selected according to actual needs.

The first importance identification result includes importance corresponding to each of the plurality of first entities. Wherein the importance of each first entity characterizes the semantic importance of the first entity to the retrieved text.

In one example, the importance of each first entity may be determined by querying a preset entity importance table according to each first entity.

The entity importance table characterizes the correspondence between entities and importance. Illustratively, importance labels 1-9 may be used to label importance corresponding to respective entities. Wherein, the larger the corresponding value of the importance tag, the greater the importance of the entity marked by the importance tag, and vice versa. If the importance of two entities is the same, the importance of the two entities may be tagged with the same importance tag. It should be noted that, in some embodiments, the correspondence between the entity and the importance may be represented in other manners, and the correspondence may be specifically set according to the actual setting, which is not limited herein.

According to an embodiment of the present disclosure, the above-described candidate feature information may be determined in the following manner.

For example, for each candidate text, entity recognition is performed on the candidate text, a second entity recognition result and a second importance recognition result associated with the second entity recognition result are obtained, and the second entity recognition result and the second importance recognition result are determined as candidate feature information.

The second entity identification result includes a plurality of second entities. And the plurality of second entities respectively correspond to named entity recognition results of the plurality of keywords in the candidate text.

The second importance identification result includes importance corresponding to each of the plurality of second entities. Wherein the importance of each second entity characterizes the semantic importance of the second entity to the candidate text.

In the embodiment of the present disclosure, the second entity recognition result and the second importance recognition result are the same or similar to the above-mentioned methods for obtaining the first entity recognition result and the first importance recognition result, which are not described herein again.

After the first feature information, the second feature information, the third feature information, and the candidate feature information are acquired, the degree of matching between the candidate text and the search text may be determined based on these feature information.

Fig. 3A and 3B are schematic diagrams of text retrieval methods according to embodiments of the present disclosure. The process of determining the degree of matching between the candidate text and the search text is exemplified below with reference to fig. 3A and 3B.

As shown in fig. 3A, the search text 31 may be subjected to word segmentation processing, resulting in a plurality of keywords in the search text 31. Then, a plurality of candidate texts associated with the search text 31 are searched for based on the plurality of keywords.

Next, the search text 31 may be subjected to parsing processing, to obtain part-of-speech recognition results and keyword weights 311 (i.e., first feature information), intention classification results and intention confidence 312 (i.e., second feature information), and first entity recognition results and first importance recognition results 313 (i.e., third feature information) of a plurality of keywords corresponding to the search text 31.

In addition, the parsing process may be performed on each candidate text 32 to obtain the second entity recognition result and the second importance recognition result 321 corresponding to the candidate text 32, and the second entity recognition result and the second importance recognition result 321 may be determined as candidate feature information.

Next, a first matching degree may be obtained from the part-of-speech recognition results and keyword weights 311, the intention classification results and intention confidence 312, and the first entity recognition results, the first importance recognition results 313, and the second entity recognition results and the second importance recognition results 321 of the plurality of keywords, and the first matching degree may be determined as a matching degree 3132 between the candidate text 32 and the search text.

Next, the plurality of candidate texts are ranked according to the degree of matching 3132 between the candidate text 32 and the search text 31, and a preset number of candidate texts are selected as the search results corresponding to the search text 31 based on the ranking results.

In some embodiments, the attribute profile information 322 may also be determined based on the update time of the candidate text 32 and the number of text associated with the candidate text 32. Then, the attribute feature information 322, the second entity recognition result, and the second importance degree recognition result 321 are determined as candidate feature information. Wherein the number of texts associated with the candidate texts 32 refers to the number of texts in the search database having an association with the candidate texts 32. The association here includes, for example, text content having similarity to the candidate text 32, a reference relationship with the candidate text 32, and the like.

Accordingly, when determining the degree of matching 3132 between the candidate text 32 and the retrieval text 31, the attribute matching degree 3221 may also be determined according to the attribute feature information 322. And then, the attribute matching degree 3221 is overlapped with the first matching degree to obtain a second matching degree. Then, the second matching degree is determined as the matching degree 3132 between the candidate text 32 and the retrieval text 31.

In the embodiment of the present disclosure, the attribute matching degree 3221 may be determined as follows.

For example, a temporal decay function may be employed to determine a temporal matching degree based on the update time and the retrieval time of the candidate text 32.

In one example, the timeliness match satisfies the following relationship.

In the formula (1), p represents the timeliness matching degree,represents the time decay coefficient, t _now Represents the retrieval time, t _update Representing the update time of the candidate text.

Next, a text relevance match may be determined based on the number of texts associated with the candidate text 32. For example, the text relevance match may be obtained from the number of texts associated with the candidate text 32 divided by the number of all texts in the search database.

Next, the attribute matching degree 3221 is determined from the timeliness matching degree and the text relevance matching degree.

Fig. 3B schematically illustrates a process of determining the first degree of matching. As shown in fig. 3B, the first entity identification result includes a plurality of first entities, for example, first entity 1, first entity 2. The first importance identification result includes importance levels respectively corresponding to the plurality of first entities, for example, importance levels respectively corresponding to the first entity 1 to the first entity n. The first entity i (i=1, 2., n) characterizes the semantic importance of the first entity i to the retrieved text 31.

The second entity identification result includes a plurality of second entities, e.g., second entity 1, second entity 2. The second importance identification result includes importance levels respectively corresponding to the plurality of second entities, for example, importance levels respectively corresponding to the second entity 1 to the second entity n. The second entity i (i=1, 2., n) characterizes the semantic importance of the second entity i to the candidate text 32.

For each second entity i (e.g., second entity 1), a target keyword of the plurality of keywords that matches the second entity i is determined. Then, the part-of-speech recognition results of the keywords and the keyword weights 311 are queried according to the target keywords, and target keyword weights 3111 corresponding to the target keywords are obtained. The target keyword weight 3111 is used to indicate the semantic importance of the target keyword to the search text 31.

In addition, for each second entity i (for example, the second entity 1), the second entity i is compared with the intention classification result to determine intention information matched with the second entity i in the intention classification result. For example, if the intent information characterizes that the search purpose is a person name and the named entity corresponding to the second entity i is also a person name, then the intent information is indicated to be matched with the second entity i; otherwise, the two are not matched. Then, the target intention confidence 3121 corresponding to the intention information is determined according to the intention information.

Next, according to the second entity i, the first entity i corresponding to the second entity i in the first entity identification result, and the target keyword weight 3111, an initial matching degree between the second entity i and the corresponding first entity i is determined.

Next, the matching degree between the second entity i and the corresponding first entity i is determined according to the initial matching degree between the second entity i and the corresponding first entity i, the target intention confidence 3121, the importance corresponding to the second entity i, and the importance corresponding to the first entity i.

Next, a first degree of matching is obtained from the degree of matching between each second entity and the corresponding first entity, and the first degree of matching is determined as the degree of matching 31_32 between the candidate text 32 and the search text 31.

In the embodiment of the disclosure, on one hand, by matching the plurality of entity information corresponding to the search text and the candidate text, comparison of multi-dimensional key information in the search text and the candidate text is realized, so that the accuracy of a search result is improved. On the other hand, in the matching process, the matching degree between the search text and the candidate text is adjusted by using the keyword weight, the intention confidence level, the importance degree of the first entity and the importance degree of the second entity, namely, the factors such as semantic association conditions and search purposes between the search text and the candidate text are considered, so that the accuracy of a search result is improved.

Fig. 4 is a block diagram of a text retrieval device according to an embodiment of the present disclosure.

As shown in fig. 4, the text retrieval apparatus 400 includes: the system comprises an acquisition module 410, a first parsing module 420, a second parsing module 430, a matching module 440 and a sorting module 450.

The obtaining module 410 is configured to obtain a plurality of candidate texts associated with the search text according to a plurality of keywords in the search text.

The first parsing module 420 is configured to parse the search text to obtain first feature information, second feature information, and third feature information corresponding to the search text.

The second parsing module 430 is configured to parse the plurality of candidate texts, respectively, to obtain candidate feature information corresponding to each of the plurality of candidate texts.

The matching module 440 is configured to determine, for each candidate text, a matching degree between the candidate text and the search text according to the first feature information, the second feature information, the third feature information, and the candidate feature information.

The ranking module 450 is configured to rank the plurality of candidate texts according to the matching degree, and obtain a search result corresponding to the search text based on the ranking result.

According to an embodiment of the present disclosure, the first parsing module 420 includes: part of speech recognition unit, intention classification unit and first entity identification unit. The part-of-speech recognition unit is used for performing part-of-speech recognition on a plurality of keywords in the search text to obtain part-of-speech recognition results and keyword weights of the keywords, and determining the part-of-speech recognition results and the keyword weights of the keywords as first characteristic information; the intention classifying unit is used for carrying out intention classification on the search text to obtain an intention classifying result and intention confidence corresponding to the search text, and determining the intention classifying result and the intention confidence as second characteristic information; the first entity recognition unit is used for carrying out entity recognition on the search text to obtain a first entity recognition result and a first importance recognition result related to the first entity recognition result, and determining the first entity recognition result and the first importance recognition result as third characteristic information; the first importance recognition result is used for representing the importance of each first entity in the first entity recognition result.

According to an embodiment of the present disclosure, the second parsing module 430 includes: the second entity identification unit and the first determination unit. The second entity recognition unit is used for carrying out entity recognition on the candidate texts aiming at each candidate text to obtain a second entity recognition result and a second importance recognition result related to the second entity recognition result; the second importance recognition result is used for representing the importance of each second entity in the second entity recognition result; and the first determining unit is used for determining the second entity identification result and the second importance identification result as candidate feature information.

According to an embodiment of the present disclosure, the matching module 440 includes: the second determination unit, the third determination unit, the fourth determination unit, and the fifth determination unit. The second determining unit is used for determining, for each second entity, a target keyword weight corresponding to a target keyword matched with the second entity in the plurality of keywords and a target intention confidence corresponding to intention information matched with the second entity in the intention classification result; the third determining unit is used for determining the initial matching degree between the second entity and the corresponding first entity according to the second entity, the first entity corresponding to the second entity and the target keyword weight in the first entity identification result; the fourth determining unit is used for determining the matching degree between the second entity and the corresponding first entity according to the initial matching degree, the target intention confidence degree, the importance corresponding to the second entity and the importance corresponding to the first entity; and a fifth determining unit for determining the matching degree between the candidate text and the retrieval text according to the matching degree between each second entity and the corresponding first entity.

According to an embodiment of the present disclosure, the second parsing module 430 further includes: a sixth determination unit and a seventh determination unit. The sixth determining unit is used for determining attribute characteristic information according to the updating time of the candidate texts and the number of texts associated with the candidate texts for each candidate text; and a seventh determining unit configured to determine the attribute feature information, the second entity recognition result, and the second importance recognition result as candidate feature information.

According to an embodiment of the present disclosure, the matching module 440 further includes: eighth and ninth determination units. The eighth determining unit is used for determining attribute matching degree according to the attribute characteristic information; and a ninth determining unit for determining the matching degree between the candidate text and the search text according to the attribute matching degree and the matching degree between each second entity and the corresponding first entity.

According to an embodiment of the present disclosure, the text retrieval apparatus 400 further includes: and the processing module is used for carrying out word segmentation processing on the search text to obtain a plurality of keywords in the search text.

It should be noted that, in the embodiment of the apparatus portion, the implementation manner, the solved technical problem, the realized function, and the achieved technical effect of each module/unit/subunit and the like are the same as or similar to the implementation manner, the solved technical problem, the realized function, and the achieved technical effect of each corresponding step in the embodiment of the method portion, and are not described herein again.

In the technical scheme of the disclosure, the related data (such as including but not limited to personal information of a user) are collected, stored, used, processed, transmitted, provided, disclosed, applied and the like, and all meet the requirements of related laws and regulations without violating the public welfare.

In the technical scheme of the disclosure, the authorization or consent of the data attribution is acquired before the related data is acquired or collected.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as in an embodiment of the present disclosure.

According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method as in an embodiment of the present disclosure.

According to an embodiment of the present disclosure, a computer program product comprising a computer program which, when executed by a processor, implements a method as an embodiment of the present disclosure.

Fig. 5 illustrates a schematic block diagram of an example electronic device 500 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the apparatus 500 includes a computing unit 501 that can perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The computing unit 501, ROM 502, and RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Various components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, etc.; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508 such as a magnetic disk, an optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 501 performs the various methods and processes described above, such as a text retrieval method. For example, in some embodiments, the text retrieval method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by computing unit 501, one or more steps of the text retrieval method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the text retrieval method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A text retrieval method comprising:

acquiring a plurality of candidate texts associated with the search text according to a plurality of keywords in the search text;

analyzing the search text to obtain first feature information, second feature information and third feature information corresponding to the search text; the first characteristic information comprises part-of-speech recognition results and keyword weights of the keywords; the second characteristic information comprises an intention classification result and an intention confidence level corresponding to the search text; the third feature information comprises a first entity identification result of the search text and a first importance identification result associated with the first entity identification result; the first entity identification result comprises a first entity;

Respectively analyzing the plurality of candidate texts to obtain candidate feature information corresponding to each of the plurality of candidate texts; the candidate feature information comprises a second entity identification result of the candidate text and a second importance identification result associated with the second entity identification result; the second entity identification result comprises a second entity; for each candidate text, determining the matching degree between the candidate text and the search text according to the first feature information, the second feature information, the third feature information and the candidate feature information; and

sorting the plurality of candidate texts according to the matching degree, and obtaining a retrieval result corresponding to the retrieval text based on the sorting result;

wherein determining the degree of matching between the candidate text and the retrieved text comprises:

determining target keyword weights corresponding to target keywords matched with the second entity in the plurality of keywords and target intention confidence degrees corresponding to intention information matched with the second entity in the intention classification result according to each second entity;

Determining initial matching degree between the second entity and the corresponding first entity according to the second entity, the first entity corresponding to the second entity in the first entity identification result and the target keyword weight;

determining the matching degree between the second entity and the corresponding first entity according to the initial matching degree between the second entity and the corresponding first entity, the target intention confidence degree, the importance corresponding to the second entity and the importance corresponding to the first entity; and

and determining the matching degree between the candidate text and the retrieval text according to the matching degree between each second entity and the corresponding first entity.

2. The method of claim 1, wherein the parsing the search text to obtain first feature information, second feature information, and third feature information corresponding to the search text comprises:

part-of-speech recognition is carried out on a plurality of keywords in the search text, part-of-speech recognition results and keyword weights of the keywords are obtained, and the part-of-speech recognition results and the keyword weights of the keywords are determined to be the first characteristic information;

Performing intention classification on the search text to obtain an intention classification result and an intention confidence coefficient corresponding to the search text, and determining the intention classification result and the intention confidence coefficient as second characteristic information; and

performing entity recognition on the search text to obtain a first entity recognition result and a first importance recognition result related to the first entity recognition result, and determining the first entity recognition result and the first importance recognition result as third characteristic information; the first importance identification result is used for representing the importance of each first entity in the first entity identification result.

3. The method of claim 2, wherein the parsing the plurality of candidate texts to obtain candidate feature information corresponding to each of the plurality of candidate texts includes:

performing entity recognition on each candidate text to obtain a second entity recognition result and a second importance recognition result associated with the second entity recognition result; the second importance recognition result is used for representing the importance of each second entity in the second entity recognition result; and

And determining the second entity identification result and the second importance identification result as the candidate feature information.

4. The method of claim 3, wherein the parsing the plurality of candidate texts to obtain candidate feature information corresponding to each of the plurality of candidate texts further comprises:

determining attribute characteristic information according to the update time of the candidate texts and the number of texts associated with the candidate texts for each candidate text; and

and determining the attribute characteristic information, the second entity identification result and the second importance identification result as the candidate characteristic information.

5. The method of claim 4, wherein the determining a degree of matching between the candidate text and the retrieved text based on the first feature information, the second feature information, the third feature information, and the candidate feature information further comprises:

determining attribute matching degree according to the attribute characteristic information; and

and determining the matching degree between the candidate text and the retrieval text according to the attribute matching degree and the matching degree between each second entity and the corresponding first entity.

6. The method of any one of claims 1 to 5, further comprising:

and performing word segmentation processing on the search text to obtain a plurality of keywords in the search text.

7. A text retrieval apparatus comprising:

the acquisition module is used for acquiring a plurality of candidate texts associated with the search text according to a plurality of keywords in the search text;

the first analysis module is used for analyzing the search text to obtain first characteristic information, second characteristic information and third characteristic information corresponding to the search text; the first characteristic information comprises part-of-speech recognition results and keyword weights of the keywords; the second characteristic information comprises an intention classification result and an intention confidence level corresponding to the search text; the third feature information comprises a first entity identification result of the search text and a first importance identification result associated with the first entity identification result; the first entity identification result comprises a first entity;

the second analysis module is used for respectively carrying out analysis processing on the plurality of candidate texts to obtain candidate feature information corresponding to each of the plurality of candidate texts; the candidate feature information comprises a second entity identification result of the candidate text and a second importance identification result associated with the second entity identification result; the second entity identification result comprises a second entity;

The matching module is used for determining the matching degree between the candidate text and the search text according to the first characteristic information, the second characteristic information, the third characteristic information and the candidate characteristic information for each candidate text; and

the ranking module is used for ranking the plurality of candidate texts according to the matching degree and obtaining a retrieval result corresponding to the retrieval text based on the ranking result;

wherein, the matching module includes:

a second determining unit, configured to determine, for each of the second entities, a target keyword weight corresponding to a target keyword matched with the second entity from the plurality of keywords, and a target intention confidence corresponding to intention information matched with the second entity from the intention classification result;

a third determining unit, configured to determine an initial matching degree between the second entity and the corresponding first entity according to the second entity, the first entity corresponding to the second entity in the first entity identification result, and the target keyword weight;

a fourth determining unit, configured to determine a degree of matching between the second entity and the corresponding first entity according to an initial degree of matching between the second entity and the corresponding first entity, the target intent confidence level, the importance corresponding to the second entity, and the importance corresponding to the first entity; and

And a fifth determining unit, configured to determine a matching degree between the candidate text and the search text according to the matching degree between each second entity and the corresponding first entity.

8. The apparatus of claim 7, wherein the first parsing module comprises:

the part-of-speech recognition unit is used for performing part-of-speech recognition on a plurality of keywords in the search text to obtain part-of-speech recognition results and keyword weights of the keywords, and determining the part-of-speech recognition results and the keyword weights of the keywords as the first characteristic information;

the intention classification unit is used for carrying out intention classification on the search text to obtain an intention classification result and an intention confidence coefficient corresponding to the search text, and determining the intention classification result and the intention confidence coefficient as second characteristic information; and

the first entity identification unit is used for carrying out entity identification on the search text to obtain a first entity identification result and a first importance identification result related to the first entity identification result, and determining the first entity identification result and the first importance identification result as third characteristic information; the first importance identification result is used for representing the importance of each first entity in the first entity identification result.

9. The apparatus of claim 8, wherein the second parsing module comprises:

the second entity recognition unit is used for carrying out entity recognition on each candidate text to obtain a second entity recognition result and a second importance recognition result associated with the second entity recognition result; the second importance recognition result is used for representing the importance of each second entity in the second entity recognition result; and

and the first determining unit is used for determining the second entity identification result and the second importance identification result as the candidate feature information.

10. The apparatus of claim 9, wherein the second parsing module further comprises:

a sixth determining unit configured to determine attribute feature information for each candidate text according to an update time of the candidate text and the number of texts associated with the candidate text; and

and a seventh determining unit configured to determine the attribute feature information, the second entity identification result, and the second importance identification result as the candidate feature information.

11. The apparatus of claim 10, wherein the matching module further comprises:

An eighth determining unit, configured to determine an attribute matching degree according to the attribute feature information; and

and a ninth determining unit, configured to determine a matching degree between the candidate text and the search text according to the attribute matching degree and the matching degree between each second entity and the corresponding first entity.

12. The apparatus of any of claims 7 to 11, further comprising:

and the processing module is used for carrying out word segmentation processing on the search text to obtain a plurality of keywords in the search text.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 6.

14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 6.