CN116340500A - Information retrieval method and device, computing device, storage medium and program product - Google Patents

Information retrieval method and device, computing device, storage medium and program product Download PDF

Info

Publication number
CN116340500A
CN116340500A CN202310328779.9A CN202310328779A CN116340500A CN 116340500 A CN116340500 A CN 116340500A CN 202310328779 A CN202310328779 A CN 202310328779A CN 116340500 A CN116340500 A CN 116340500A
Authority
CN
China
Prior art keywords
information
text information
semantic
text
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310328779.9A
Other languages
Chinese (zh)
Inventor
王寒
石智中
梁霄
雷涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China International Financial Ltd By Share Ltd
Original Assignee
China International Financial Ltd By Share Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China International Financial Ltd By Share Ltd filed Critical China International Financial Ltd By Share Ltd
Priority to CN202310328779.9A priority Critical patent/CN116340500A/en
Publication of CN116340500A publication Critical patent/CN116340500A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses an information retrieval method, comprising the following steps: acquiring first text information and a plurality of second text information; determining semantic similarity of the first text information and each of the plurality of second text information; selecting at least one text message to be searched from the plurality of second text messages according to the semantic similarity of the first text message and each of the plurality of second text messages; extracting at least one third text message semantically related to the first text message from at least one text message to be searched, wherein the at least one third text message corresponds to the at least one text message to be searched one by one; and determining an information retrieval result according to the at least one third text message.

Description

Information retrieval method and device, computing device, storage medium and program product
Technical Field
The present application relates to the field of computer technology, and in particular, to an information retrieval method and apparatus, a computing device, a computer readable storage medium, and a computer program product.
Background
With the development of the Internet and the continuous growth of network information, more and more information can be retrieved from the Internet through a search engine, and search results have the characteristics of data sea quantity, diversified forms, comprehensive coverage and the like. On the one hand, the possibility of searching results by the user is improved, on the other hand, the user is difficult to quickly and accurately locate required information, for example, the user needs to determine the webpage to be checked by combining the information such as the webpage title, the text abstract, the webpage link and the like, and the user extracts the required answer by himself. Therefore, the existing search engine technology does not have a deep question-answering function, can not directly provide answers to questions for users, and has poor presentation effect of search results.
As user expectations for search engines have increased (e.g., to begin a transition from basic related web page recalls to intelligent question-answering), information retrieval techniques based on machine-reading understanding have evolved. How to use information retrieval technology based on machine reading understanding to help users find satisfactory answers has become a classical topic in the field of natural language processing and information retrieval technology research. However, the information retrieval method based on machine reading understanding of the related art has the following problems: firstly, the search mode based on keyword or character string matching in the related technology can only search articles with the same characters, can not search information with different characters but the same semantics, is easy to cause the deletion of important information resources highly related to the search problem, and has limited search result breadth and low accuracy; secondly, the related art retrieval mode based on keyword extraction and comparison causes larger calculation amount and workload due to the use of more complex preset rules, has lower efficiency, and the keyword can not completely and accurately reflect the characteristics of the whole retrieval problem, so that the accuracy of the retrieval result is not high.
Disclosure of Invention
In view of the above, this application provides an information retrieval method and apparatus, computing device, computer-readable storage medium, and computer program product, which desirably mitigate or overcome some or all of the above-identified deficiencies and other possible disadvantages.
According to a first aspect of the present application, there is provided an information retrieval method, comprising: acquiring first text information and a plurality of second text information; determining semantic similarity of the first text information and each of the plurality of second text information; selecting at least one text message to be searched from the plurality of second text messages according to the semantic similarity of the first text message and each of the plurality of second text messages; extracting at least one third text message semantically related to the first text message from the at least one text message to be searched, wherein the at least one third text message corresponds to the at least one text message to be searched one by one; and determining an information retrieval result according to the at least one third text message.
In an information retrieval method according to some embodiments of the present application, determining a semantic similarity of the first text information and each of the plurality of second text information includes: acquiring a first semantic feature vector corresponding to the first text information and a plurality of second semantic feature vectors corresponding to the plurality of second text information respectively; calculating the similarity between the first semantic feature vector and each of the plurality of second semantic feature vectors; and determining the semantic similarity of the first text information and each second text information in the plurality of second text information according to the similarity of the first semantic feature vector and each second semantic feature vector.
In an information retrieval method according to some embodiments of the present application, calculating the similarity of the first semantic feature vector to each of the plurality of second semantic feature vectors includes: calculating a first similarity of the first semantic feature vector and each of the plurality of second semantic feature vectors based on the distances of the plurality of second semantic feature vectors from the first semantic feature vector; calculating a second similarity of the first semantic feature vector and each of the plurality of second semantic feature vectors based on cosine of an included angle between the plurality of second semantic feature vectors and the first semantic feature vector; and determining a similarity of the first semantic feature vector to each of the plurality of second semantic feature vectors based on at least one of the first similarity and the second similarity.
In the information retrieval method according to some embodiments of the present application, obtaining a first semantic feature vector corresponding to the first text information and a plurality of second semantic feature vectors corresponding to the plurality of second text information respectively includes: determining a first semantic feature vector corresponding to the first text information by using a semantic understanding model; a plurality of second semantic feature vectors corresponding to the plurality of second text information are obtained from a preset semantic feature vector index library, and the plurality of second semantic feature vectors determined in advance by utilizing the semantic understanding model are stored in the preset semantic feature vector index library.
In an information retrieval method according to some embodiments of the present application, extracting at least one third text information semantically related to the first text information from the at least one text information to be retrieved includes: for each text message to be retrieved in the at least one text message to be retrieved, executing the following steps: inputting the first text information and the text information to be searched into a reading understanding model; determining a first probability and a second probability corresponding to each word in the text information to be searched by using the reading understanding model, wherein the first probability corresponding to each word indicates the probability that the word is the beginning word of the third text information semantically related to the first text information, and the second probability corresponding to each word indicates the probability that the word is the ending word of the third text information; determining a beginning word segmentation and an ending word segmentation of third text information from each word segmentation of the text information to be searched according to a first probability and a second probability corresponding to each word segmentation in the text information to be searched; and extracting third text information from the text information to be retrieved according to the determined beginning segmentation and ending segmentation.
In an information retrieval method according to some embodiments of the present application, determining an information retrieval result according to the at least one third text information includes: for each third text message in the at least one third text message, determining the retrieval matching degree of the third text message according to at least one of the first probability corresponding to the beginning word segmentation and the second probability corresponding to the ending word segmentation; sorting the at least one third text information according to the retrieval matching degree of each third text information; and determining an information retrieval result according to the ordering of the at least one third text information.
In the information retrieval method according to some embodiments of the present application, for each third text information in the at least one third text information, determining a retrieval matching degree of the third text information according to at least one of a first probability corresponding to the beginning segmentation and a second probability corresponding to the ending segmentation includes: determining a retrieval matching degree of the third text information based on at least one of the following values: an arithmetic average of the first probability corresponding to the beginning word segmentation and the second probability corresponding to the ending word segmentation; a geometric average of the first probability corresponding to the beginning segmentation and the second probability corresponding to the ending segmentation; maximum values in the first probability corresponding to the beginning word segmentation and the second probability corresponding to the ending word segmentation; and the minimum value of the first probability corresponding to the beginning word segmentation and the second probability corresponding to the ending word segmentation.
In an information retrieval method according to some embodiments of the present application, a training process of a semantic understanding model includes: acquiring a plurality of sample information pairs for training a semantic understanding model and corresponding semantic similarity labels thereof, wherein each sample information pair comprises a first text information sample and a second text information sample, and the corresponding semantic similarity label of each sample information pair indicates preset semantic similarity of the first text information sample and the second text information sample in the sample information pair; inputting the plurality of sample information respectively to a semantic understanding model to obtain a third semantic feature vector corresponding to a first text information sample and a fourth semantic feature vector corresponding to a second text information sample of each sample information pair; for each sample information pair of the plurality of sample information pairs, determining a predicted semantic similarity corresponding to the sample information pair based on the similarity of the third semantic feature vector and the fourth semantic feature vector; determining semantic loss of a semantic understanding model based on the predicted semantic similarity and the semantic similarity label corresponding to each sample information pair in the plurality of sample information pairs; and based on the semantic loss, iteratively updating parameters of the semantic understanding model until the semantic loss meets a preset condition to obtain a pre-trained semantic understanding model.
In the information retrieval method according to some embodiments of the present application, the training process of the semantic understanding model further includes: inputting the plurality of sample information into a pre-trained semantic understanding model respectively to obtain a fifth semantic vector corresponding to a first text information sample and a sixth semantic feature vector corresponding to a second text information sample of each of the plurality of sample information pairs; for each of the plurality of sample information pairs, determining a semantic similarity corresponding to the sample information pair based on the similarity of the fifth semantic feature vector and the sixth semantic feature vector; sorting the plurality of sample information pairs according to the sequence from the large to the small of the corresponding semantic similarity of the plurality of sample information pairs; selecting first N sample information pairs from the sequencing, and inputting the first N sample information pairs into a reading understanding model, wherein N is a preset positive integer; extracting predicted third text information semantically related to the first text information sample from the second text information sample by using the reading and understanding model aiming at each of the N sample information pairs; determining at least one difficult negative example sample and at least one positive example sample for training a semantic understanding model from the N sample information pairs according to the predicted third text information and the second text information samples; training the pre-trained semantic understanding model using the at least one difficult negative example sample and at least one positive example sample.
In an information retrieval method according to some embodiments of the present application, selecting at least one text information to be retrieved from the plurality of second text information according to semantic similarity between the first text information and each of the plurality of second text information, including: sorting the plurality of second text messages according to the sequence from the big semantic similarity of the first text message to each second text message; and selecting the first M pieces of second text information from the ordering as M pieces of text information to be searched, wherein M is a preset positive integer.
In an information retrieval method according to some embodiments of the present application, the first text information includes question information to be retrieved, and the third text information includes answer information corresponding to the question information to be retrieved.
According to a second aspect of the present application, there is provided an information retrieval apparatus comprising: the acquisition module is configured to acquire the first text information and a plurality of second text information; a first determining module configured to determine a semantic similarity of the first text information and each of the plurality of second text information; the selecting module is configured to select at least one text message to be searched from the plurality of second text messages according to the semantic similarity of the first text message and each of the plurality of second text messages; the extraction module is configured to extract at least one third text message semantically related to the first text message from the at least one text message to be searched, and the at least one third text message corresponds to the at least one text message to be searched one by one; and the second determining module is configured to determine an information retrieval result according to the at least one third text message.
According to a third aspect of the present application there is provided a computing device comprising a memory and a processor, wherein the memory has stored therein a computer program which, when executed by the processor, causes the processor to perform the steps of an information retrieval method according to some embodiments of the present application.
According to a fourth aspect of the present application, there is provided a computer readable storage medium having stored thereon computer readable instructions which, when executed, implement an information retrieval method according to some embodiments of the present application.
According to a fifth aspect of the present application, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement an information retrieval method according to some embodiments of the present application.
In the information retrieval method and device according to some embodiments of the present application, advanced information retrieval tasks such as intelligent questions and answers are efficiently and accurately completed by utilizing a two-stage information screening or retrieval manner based on semantic similarity and semantic relevance. Firstly, obtaining semantic similarity (rather than mere literal matching) between a large amount of second text information (such as a plurality of articles with higher semantic similarity to the problem to be searched) and a second text information (i.e. a search object, such as an article used for obtaining an answer to the problem to be searched) according to the first text information, so as to screen out a relatively small amount of the text information to be searched (such as a plurality of articles with higher semantic similarity to the problem to be searched), solve the problems of important search resource deficiency caused by accurate keyword matching and low efficiency caused by complex flow and huge calculation amount in the related art, and remarkably improve the overall working efficiency under the condition of ensuring that important information search resources with high correlation to the search problem are not lost (i.e. ensuring search breadth and accuracy); secondly, for each piece of text information to be searched obtained through screening in the first stage, extracting or searching third text information (namely candidate answers corresponding to the questions to be searched) corresponding to the first text information (such as the questions to be searched) again based on semantic relevance (for example, by using a machine reading understanding model), and obtaining a final search result based on the third text information, so that extraction or searching of answers is achieved by using semantic features in the search questions again, and higher accuracy of the candidate answers obtained through searching and the final search result is further ensured.
These and other advantages of the present application will become apparent from and elucidated with reference to the embodiments described hereinafter.
Drawings
Embodiments of the present application will now be described in more detail and with reference to the accompanying drawings, in which:
FIG. 1 illustrates an exemplary application scenario of an information retrieval method according to some embodiments of the present application;
FIG. 2 is a flow chart of an information retrieval method according to some embodiments of the present application;
FIG. 3 is a flow chart of determining semantic similarity of a first text message to each of a plurality of second text messages according to some embodiments of the present application;
FIG. 4 is a schematic diagram of determining semantic feature vectors according to some embodiments of the present application;
FIG. 5 is a flow chart of extracting at least one third text message semantically related to the first text message from at least one text message to be retrieved according to some embodiments of the present application;
FIG. 6 is a schematic diagram of extracting third text information using a reading understanding model according to some embodiments of the present application;
FIG. 7 is a flow chart of training a semantic understanding model according to some embodiments of the present application;
FIG. 8 is a schematic diagram of training a semantic understanding model according to some embodiments of the present application;
FIG. 9 is a flow chart of a complementary training semantic understanding model according to some embodiments of the present application;
FIG. 10 is a schematic diagram of a complete process of an information retrieval method according to some embodiments of the present application;
FIG. 11 is an exemplary block diagram of an information retrieval device according to further embodiments of the present application;
FIG. 12 illustrates an example system including an example computing device that represents one or more systems and/or devices that can implement the various methods described herein.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments can be embodied in many forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present application. One skilled in the relevant art will recognize, however, that the aspects of the application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.
The block diagrams shown in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another element. Thus, a first component discussed below could be termed a second component without departing from the teachings of the present application concept. As used herein, the term "and/or" and similar terms include all combinations of any, many, and all of the associated listed items.
Those skilled in the art will appreciate that the drawings are schematic representations of example embodiments, and that the modules or flows in the drawings are not necessarily required to practice the present application, and therefore, should not be taken to limit the scope of the present application.
Before describing embodiments of the present application in detail, some related concepts will be explained first for clarity.
How to use text retrieval and machine reading understanding technology to help users find satisfactory answers is a classical topic in the field of natural language processing and information retrieval technology research. The text retrieval is used as a sub-field of information retrieval, so that the machine is provided with a related text required by a user retrieved from massive Internet texts; the question and answer search is a high-level form of information search system, which can answer the question of user with natural language accurately and simply, and is a new generation search engine integrating natural language processing technology and information search technology, and its appearance aims to provide powerful information acquisition tool. The machine reading understanding is used as a sub-field of natural language understanding, so that the machine has the reading understanding and question-answering capabilities of natural language, is always a hot problem focused by academia and industry, and is also a core problem of intelligent voice and man-machine interaction at present. Machine reading understanding is to let a machine read natural language text like a human, and then to summarize by reasoning, so that questions related to the read content can be accurately answered.
The recall neural network is one of the neural networks, and its function is to be able to efficiently acquire, for input information, a set of candidate information related to the input information from local information. The performance of the recall neural network plays a key role for the subsequent search engine, and directly influences the final search effect. Once the relevant information cannot be recalled, i.e. the recall rate is low, the ideal search result cannot be obtained even if the following search engine does the same. Common recall neural networks include BERT neural networks based on BERT neural networks, roberta neural networks, albert neural networks, and the like.
Fig. 1 illustrates an exemplary application scenario 100 of an information retrieval method according to some embodiments of the present application. The application scenario 100 may include a client 101, a network 102, a service unit 103, and a storage unit 104, where the service unit 103 is communicatively coupled to the client 101 via the network 102 and may communicate with the storage unit 104. In this embodiment only one client 101 is shown, but this is not limiting and multiple clients may be communicatively coupled together with the service unit 103.
In this embodiment, the client 101 transmits the first text information to the service unit 103 through the network 102, and the service unit 103 acquires a plurality of second text information from the storage unit 104. The network 102 may be, for example, a Wide Area Network (WAN), a Local Area Network (LAN), a wireless network, a public telephone network, an intranet, and any other type of network known to those skilled in the art.
The service unit 103 then determines a semantic similarity of the first text information with each of the plurality of second text information. The service unit 103 performs preliminary screening of the plurality of second text information in this step.
The service unit 103 selects at least one text information to be retrieved from the plurality of second text information according to the semantic similarity of the first text information and each of the plurality of second text information. In this embodiment, the service unit 103 may select a preset number of second text information from the plurality of second text information at a time as the text information to be retrieved according to the size of the computing capability.
The service unit 103 then extracts at least one third text information semantically related to the first text information from the at least one text information to be retrieved, where the at least one third text information corresponds to the at least one text information to be retrieved one by one.
Finally, the service unit 103 determines an information retrieval result according to at least one third text message. The service unit 103 may transmit the information retrieval result to the client 101.
In the application scenario shown in fig. 1, the information retrieval method according to some embodiments of the present application is implemented on the service unit 103, but this is merely illustrative and not restrictive, and the information retrieval method according to some embodiments of the present application may also be implemented on other subjects having sufficient computing resources and computing capabilities, for example, on the client 101 having sufficient computing resources and computing capabilities, or the like. Of course, it may also be implemented partly on the service unit 103 and partly on the client 101, which is not limiting.
As understood by those of ordinary skill in the art, the service unit 103 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.
The client 101 may be any type of mobile computing device, including a mobile computer (e.g., personal Digital Assistant (PDA), laptop computer, notebook computer, tablet computer, netbook, etc.), mobile phone (e.g., cellular phone, smart phone, etc.), wearable computing device (e.g., smart watch, head mounted device, including smart glasses, etc.), or other type of mobile device. In some embodiments, the client 101 may also be a stationary computing device, such as a desktop computer, a gaming machine, a smart television, or the like. Further, in the case where the application scenario 100 includes a plurality of clients 101, the plurality of clients 101 may be the same or different types of computing devices.
As shown in fig. 1, the client 101 may include a display screen and a terminal application that may interact with an end user via the display screen. The terminal application may be a local application, a Web page (Web) application, or an applet (Lite App, e.g., a cell phone applet, a WeChat applet) as a lightweight application. In the case where the terminal application is a local application program that needs to be installed, the terminal application may be installed at the client 101. In the case where the terminal application is a Web application, the terminal application may be accessed through a browser. In the case that the terminal application is an applet, the terminal application may be directly opened on the client 101 by searching for related information of the terminal application (e.g., name of the terminal application, etc.), scanning a graphic code of the terminal application (e.g., bar code, two-dimensional code, etc.), or the like, without installing the terminal application.
Fig. 2 is a flow chart of an information retrieval method according to some embodiments of the present application. The illustrated method 200 may be implemented on a service unit (e.g., may be on the service unit 103 illustrated in fig. 1). In some embodiments, where the client 101 has sufficient computing resources and computing power, information retrieval methods according to some embodiments of the present application may be performed directly on the client 101. In other embodiments, information retrieval methods according to some embodiments of the present application may also be performed by the service unit 103, the client 101 in combination. As shown in fig. 2, an information retrieval method according to some embodiments of the present application may include steps S201-S205.
In step S201, first text information and a plurality of second text information are acquired. In some embodiments, the first text information may be text information transmitted by the user through the client for obtaining semantically related information from the plurality of second text information, and thus the first text information to be retrieved may be obtained by receiving question information that the user needs to retrieve from the terminal device or the client. Alternatively, the first text information may be converted from other forms of non-text information. For example, the content of information such as voice information and video information may be converted into text information as the first text information. Similarly, the final search result information can be converted into other forms of information for presentation.
In step S202, a semantic similarity of the first text information and each of the plurality of second text information is determined. In this embodiment, the comparison objects of the semantics are the semantics of the first text information as a whole and the semantics of each second text information as a whole, respectively, and therefore, the semantics of the first text information cover the semantic information of each part (e.g., word segmentation or word) in the first text information and the related information thereof, not just part of the word or keyword thereof; the semantics of the second text information cover the semantics information of each part (e.g. sentence, word, etc.) in the second text information and its associated information, not just one of the sentences or one of the paragraphs. In this way, the semantic similarity of the first text information and the second text information can represent the similarity or proximity of the intrinsic and intrinsic semantic aspects of the first text information and the second text information, rather than the consistency or matching degree of simple and surface characters, so that the search breadth and the search precision can be improved remarkably.
For example, taking chinese as an example, since there are a large number of synonyms, even if there are no words in a certain article that are exactly the same as at least some words (e.g., keywords) in the search question, the article may still be semantically similar or close to the search question, so that a large number of articles semantically similar to the search question may be lost simply depending on keyword matching, causing important search resources to be lost and limited search breadth; on the other hand, because of the presence of ambiguities in Chinese, even if words with the same word-finding meaning may vary widely in different contexts, keyword matching does not necessarily ensure that the required information matching the search problem is found or retrieved, resulting in difficulty in ensuring the search accuracy. In contrast, in the application, the information retrieval is realized by semantic comparison (namely, calculation of semantic similarity) based on the first text information and the second text information, so that the method is not limited to whether the literal information is consistent, the defect of low retrieval breadth and precision of the related technology is fundamentally eliminated, and the preliminary screening of a plurality of second text information is efficiently realized.
In step S203, at least one text message to be retrieved is selected from the plurality of second text messages according to the semantic similarity between the first text message and each of the plurality of second text messages. In this embodiment, the more likely the second text information with higher semantic similarity is selected as the text information to be retrieved, and the number of selected text information to be retrieved may be preset. The selected text information to be searched is the basis of subsequent searching, and other text information to be searched which is not selected does not need to be processed any more, so that the searching efficiency can be obviously improved, the calculated amount can be reduced, and the searching time can be reduced.
In some embodiments, selecting at least one text message to be retrieved from the plurality of second text messages according to the semantic similarity of the first text message to each of the plurality of second text messages comprises: firstly, sorting a plurality of second text messages according to the sequence from big to small of semantic similarity between the first text message and each second text message; and then selecting the first M pieces of second text information from the ordering as M pieces of text information to be searched, wherein M is a preset positive integer. In the embodiment, the first N pieces of second text information are selected from the massive pieces of second text information to be searched as the text information to be searched, so that the second text information with less similar semantics does not need to enter the subsequent searching process, the calculated amount in the information searching process can be obviously reduced, and the searching efficiency is improved.
In step 204, at least one third text message semantically related to the first text message is extracted from the at least one text message to be retrieved, the at least one third text message being in one-to-one correspondence with the at least one text message to be retrieved. In some embodiments, step 204 may be to extract, for each text message to be retrieved, a third text message semantically related to the first text message, thereby forming at least one third text message corresponding to each text message to be retrieved; the semantic relevance may be understood as that the third text information extracted from each text information to be retrieved is a candidate retrieval result corresponding to the first text information. For example, when the first text information is a question to be retrieved, the third text information may be a candidate answer extracted from the article to be retrieved that is most relevant or most matching the question. In this step, the third text information may be at least a part of information in the corresponding text information to be retrieved. For example, the third text message may be a sentence, a paragraph, etc. in the article to be retrieved. Regarding the extraction of the third text information, the extraction may be implemented by using a machine-readable understanding model (e.g., BERT-based neural network model, etc.), and the detailed process may be referred to in fig. 5 and its corresponding description.
In step S205, an information retrieval result is determined based on at least one third text information. The information search result may be determined in various ways, and the method is not limited thereto, and for example, all of at least one third text information may be presented as the information search result, or a part of the third text information may be presented as the information search result. Since the third text information is partial information extracted from the text information to be retrieved, more accurate retrieval can be achieved. In some application scenarios, the first text information includes question information to be retrieved, and the third text information includes answer information corresponding to the question information to be retrieved. Under the situation, the embodiment not only can provide a plurality of pieces of third text information related to semantics for users to reference, but also can give out direct and accurate answer information aiming at the question information to be searched, and can improve the user experience.
In the information retrieval method according to some embodiments of the present application, advanced information retrieval tasks such as intelligent question-answering are efficiently and accurately completed by utilizing a two-stage information screening or retrieval manner based on semantic similarity and semantic correlation. Firstly, obtaining semantic similarity (rather than mere literal matching) between a large amount of second text information (such as a plurality of articles with higher semantic similarity to the problem to be searched) and a second text information (i.e. a search object, such as an article used for obtaining an answer to the problem to be searched) according to the first text information, so as to screen out a relatively small amount of the text information to be searched (such as a plurality of articles with higher semantic similarity to the problem to be searched), solve the problems of important search resource deficiency caused by accurate keyword matching and low efficiency caused by complex flow and huge calculation amount in the related art, and remarkably improve the overall working efficiency under the condition of ensuring that important information search resources with high correlation to the search problem are not lost (i.e. ensuring search breadth and accuracy); secondly, for each piece of text information to be searched obtained through screening in the first stage, extracting or searching third text information (namely candidate answers corresponding to the questions to be searched) corresponding to the first text information (such as the questions to be searched) again based on semantic relevance (for example, by using a machine reading understanding model), and obtaining a final search result based on the third text information, so that extraction or searching of answers is achieved by using semantic features in the search questions again, and higher accuracy of the candidate answers obtained through searching and the final search result is further ensured.
In some embodiments, an extension is made to step S202. Fig. 3 is a flowchart of determining semantic similarity of a first text message to each of a plurality of second text messages according to some embodiments of the present application, including steps S301-S303.
In step S301, a first semantic feature vector corresponding to the first text information and a plurality of second semantic feature vectors corresponding to the plurality of second text information are obtained. In this embodiment, the semantic similarity of the first text information and each of the plurality of second text information is determined based on the mathematical vector representation, so it is necessary to first acquire the semantic feature vector of the first text information and each of the second text information. In some embodiments, the acquisition or calculation of semantic feature vectors or encodings (for calculating semantic similarity) with respect to a large amount of second text information (i.e. search objects such as articles) may be performed in advance on-line (since the search objects may be relatively fixed known data pre-stored in, for example, a server database), thereby further greatly improving data processing efficiency, significantly optimizing network resource scheduling in the context of large data volumes of articles to be searched.
In step S302, a similarity between the first semantic feature vector and each of the plurality of second semantic feature vectors is calculated. The similarity of semantic feature vectors can be expressed as either a similarity in direction between two vectors or a proximity in distance.
Specifically, in some embodiments, a first similarity of the first semantic feature vector to each of the plurality of second semantic feature vectors is calculated based on a distance of the plurality of second semantic feature vectors from the first semantic feature vector. In addition, a second similarity between the first semantic feature vector and each of the plurality of second semantic feature vectors may be calculated based on cosine of an included angle between the plurality of second semantic feature vectors and the first semantic feature vector. Finally, a similarity of the first semantic feature vector to each of the plurality of second semantic feature vectors is determined based on at least one of the first similarity and the second similarity. In this embodiment, when calculating the distance between semantic feature vectors, a smaller distance indicates that the two semantic feature vectors are more similar, and a larger distance indicates that the two semantic feature vectors are less similar. When the cosine of the included angle between the semantic feature vectors is calculated, the smaller the included angle is, the more similar the two semantic feature vectors are, and the larger the included angle is, the more dissimilar the two semantic feature vectors are. Thus, according to the foregoing rules, the similarity of the computed first semantic feature vector to each of the plurality of second semantic feature vectors may also be determined based on a weighted sum of both the first similarity and the second similarity. Whether the distance between the plurality of second semantic feature vectors and the first semantic feature vector or the cosine of the included angle between the plurality of second semantic feature vectors and the first semantic feature vector is calculated, the two calculation modes are relatively simple, and the semantic feature vectors can accurately represent the semantics, so that the requirement of accurate retrieval is met, the retrieval efficiency can be improved, the retrieval time is saved, the two characteristics complement each other, and the method is very beneficial to scenes in which massive second text information is stored.
In step S303, the semantic similarity between the first text information and each of the plurality of second text information is determined according to the similarity between the first semantic feature vector and each of the second semantic feature vectors. In this embodiment, the similarity between the first semantic feature vector and each of the second semantic feature vectors may be directly regarded as a semantic similarity, or the similarity between the semantic feature vectors may be processed, for example, by increasing or decreasing the degree of distinction of the similarity between the semantic feature vectors, performing normalization operation, or the like, to determine the semantic similarity, which is not limited in this application.
In the embodiment shown in fig. 3, the semantics of the first text information and each second text information are represented by using a mathematical-based vector representation, so that abstract semantics can be converted into explicit numbers, further, semantic similarity is converted into a digital relationship between vectors, which definitely greatly simplifies the process of determining and comparing the semantics, and the accuracy of determining the semantics and the efficiency of comparing the semantics in information retrieval are remarkably improved compared with the process of designing a series of rules to judge the semantics of the first text information and each second text information.
In an extended embodiment of step S301, the first semantic feature vector and the second semantic feature vector may be acquired under different conditions. Specifically, in some embodiments, obtaining a first semantic feature vector corresponding to the first text information and a plurality of second semantic feature vectors corresponding to the plurality of second text information respectively includes: determining a first semantic feature vector corresponding to the first text information by using a semantic understanding model; and then a plurality of second semantic feature vectors corresponding to the plurality of second text information are obtained from a preset semantic feature vector index library, wherein the preset semantic feature vector index library stores a plurality of second semantic feature vectors which are determined in advance by utilizing a semantic understanding model.
The semantic understanding model is an artificial intelligence technology, and can be realized by using a recall neural network (for example, a recall neural network based on a BERT neural network), wherein the input is a character string sequence obtained by word segmentation of text information, the specific function is to determine semantic feature vectors of the input text information through calculation of each layer of neural network, and the output is the semantic feature vectors of the text information. In this embodiment, the first semantic feature vector corresponding to the first text information may be determined under a real-time condition, and the plurality of second semantic feature vectors corresponding to the plurality of second text information may be determined in advance using the semantic understanding model under an offline condition and stored in the semantic feature vector index library, so that after the first text information is obtained in real time, the stored plurality of second semantic feature vectors may be directly called from the semantic feature vector index library each time to complete subsequent computation of semantic similarity, without repeatedly computing the plurality of second semantic feature vectors, thereby saving a large amount of search time and improving the search efficiency.
Fig. 4 is a schematic diagram of determining semantic feature vectors according to some embodiments of the present application. In this embodiment, a recall neural network is used to determine semantic feature vectors of the entered text information. Taking the first text information as an example, firstly, word segmentation processing is carried out on the first text information to obtain k segmented words, wherein k is a positive integer. And adding a global character [ CLS ] at the head end of the sequence formed by the k segmentation words, and inputting the global character [ CLS ] into the neural network. As an example, if the first text information is "reading in school" the character string sequence { "small", "in", "school", "reading" } can be obtained by word segmentation, and then the global character [ CLS ] is added to the head end of the sequence to obtain the input { "CLS", "small", "in", "school", "reading" }. And (3) processing each network layer of the recall neural network, wherein the vector representation obtained by processing the neural network unit corresponding to the global character [ CLS ] is the semantic feature vector of the first text information. Similarly, for each of the plurality of second text information, a semantic feature vector for the entered second text information may be determined from the recall neural network.
In addition, a more specific embodiment will be described with respect to step S204. Fig. 5 is a flow chart of extracting at least one third text message semantically related to the first text message from at least one text message to be retrieved according to some embodiments of the present application. In the embodiment shown in fig. 5, extracting at least one third text information semantically related to the first text information from the at least one text information to be retrieved comprises: for each of the at least one text information to be retrieved, steps S501-S503 are performed:
in step S501, the first text information and the text information to be retrieved are input into a reading understanding model. It should be noted that in this embodiment the first text information and the text information to be retrieved are together as input to a reading understanding model, whereas in an embodiment of a semantic understanding model the first text information and the second text information are respectively as input to a reading understanding model.
In step S502, a first probability and a second probability corresponding to each word segment in the text information to be retrieved are determined by using the reading understanding model, the first probability corresponding to each word segment indicates a probability that the word segment is a beginning word segment of third text information related to the first text information semantic meaning, and the second probability corresponding to each word segment indicates a probability that the word segment is an ending word segment of third text information related to the first text information semantic meaning. The reading and understanding model is an artificial intelligence technique, which can be implemented using a recall neural network (e.g., another recall neural network based on a BERT neural network), and the input is a string sequence in which both the first text information and the text information to be retrieved are segmented. The output of the neural network is the probability that each word of the text information to be retrieved therein is taken as a beginning word segment and an ending word segment. The neural network has the specific function of determining the third text information related to the semantics from the text information to be retrieved through the calculation of the neural network of each layer.
In step S503, according to the first probability and the second probability corresponding to each word in the text information to be retrieved, the beginning word segmentation and the ending word segmentation of the third text information are determined from the words of the text information to be retrieved. The greater the first probability, the more likely the word segment is a beginning word segment of the third text information, and the greater the second probability, the more likely the word segment is an ending word segment of the third text information. Therefore, the word with the highest first probability can be determined as the beginning word, and the word with the highest second probability is selected as the ending word in the subsequent words of the word. Optionally, in all word segmentation pairs, determining the sum of the first probability corresponding to the preceding word segmentation and the second probability corresponding to the following word segmentation, and selecting the word segmentation pair with the largest sum as the beginning word segmentation and the ending word segmentation. The beginning segmentation and ending segmentation may also be determined in other ways.
In step S504, third text information is extracted from the text information to be retrieved according to the determined start word segmentation and end word segmentation. And the word segmentation from the beginning word segmentation to the ending word segmentation jointly forms third text information of the text information to be searched.
In the embodiment shown in fig. 5, by analyzing and extracting the text information to be searched one by using the reading understanding model, a candidate answer corresponding to the first text information (such as a question to be searched) can be found from each text information to be searched. This process is also performed based on the semantics of the first text information and the text information to be retrieved, so that a higher accuracy of the retrieved candidate answers and the final retrieval result can be ensured. In addition, the reading and understanding model gives the first probability and the second probability of each word as the beginning word segmentation and the ending word segmentation, which provides a flexible extraction method for determining the third text information, so that different information retrieval scenes can be adapted.
Fig. 6 is a schematic diagram of extracting third textual information using a read understanding model based on a BERT neural network, according to some embodiments of the present application.
In this embodiment, the first text message is a question transmitted by the client and the text message to be retrieved is a locally stored article. When the reading understanding model is executed each time, the questions and the articles are used as a question and answer pair to perform word segmentation processing, q (1) -q (k) in fig. 6 represent k word segments of the transmitted questions, p (1) -p (m) are m word segments of the articles, the word segments of the text information pair are connected through a connector SEP, word segmentation boundaries of the two text information are indicated, and finally a global character CLS is added at the head end to obtain a character string sequence of the text information pair and input the character string sequence into the reading understanding model.
The reading understanding model is used for determining the first probability and the second probability of each word segmentation in p (1) -p (m) through calculation of a multi-layer neural network, and outputting finally determined third text information. In this embodiment, as shown in the upper part of FIG. 6, among the determined first and second probabilities of p (1) -p (m) segmentation, p (m) 1 ) The first probability of (2) is the largest, soIs determined to start word segmentation, p (m 2 ) Is determined to begin word segmentation and then the mth 1 The individual word is the m 2 And determining information among the individual segmentation words as third text information to output.
For example, the problem is "what the total revenue of enterprise a" and the process gets 6 participles, namely q (1) -q (6): "A business", "total" and "what" are. An article is, for example, "the total revenue for enterprise A is one million yuan, and enterprise B is located in the Huada lane. ", processing results in 10 segmentations, i.e., p (1) -p (10): "A enterprise", "total" and "is", "one million yuan", "B enterprise", "located", "Chenghua channel", "and". ". The input of reading understanding model { "CLS", "A enterprise", "total income", "yes", "one million yuan", "SEP", "B enterprise", "located", "Chenghua Dao channel", is obtained by connecting through the connector "SEP" and adding the global character "CLS". "}.
After calculation of each layer of neural network, the words "A enterprise", "total income", "yes", "one million yuan", "and", "B enterprise", "located", "Chenghua Dao dao", "are used. "having a first probability and a second probability, respectively. The first probability corresponding to the "a enterprise" is 90% and is the largest first probability, the second probability corresponding to the "one million yuan" is "95%" and is the largest second probability, so that the "total revenue of the a enterprise" can be determined to be the final third text information and output.
Further, in some embodiments, determining the information retrieval result based on the at least one third text information includes: determining the retrieval matching degree of the third text information according to at least one of the first probability corresponding to the beginning word segmentation and the second probability corresponding to the ending word segmentation for each of the at least one third text information; then sorting at least one third text message according to the retrieval matching degree of each third text message; and finally, determining an information retrieval result according to the ordering of at least one third text message. The distribution condition of the first probability and the second probability corresponding to each word segment can well reflect the semantic relevance degree of the word segment and the first text information, so that the retrieval matching degree of the third text information can be determined based on the first probability corresponding to the beginning word segment and the second probability corresponding to the ending word segment, the retrieval matching degree reflects the matching degree between the third text information and the first text information in the semantic dimension, and therefore retrieval based on the semantic can be achieved, and the accuracy of information retrieval is improved. By ordering the at least one third text information, the more matching third text information may be placed in a primary position, which improves the user experience.
In some embodiments, for each of the at least one third text information, determining a search matching degree of the third text information according to at least one of a first probability corresponding to a start word and a second probability corresponding to an end word, includes: determining a search matching degree of the third text information based on at least one of the following values: (1) An arithmetic average of a first probability corresponding to the beginning of word segmentation and a second probability corresponding to the ending of word segmentation; (2) Geometric mean of a first probability corresponding to the beginning of word segmentation and a second probability corresponding to the ending of word segmentation; (3) Maximum value in first probability corresponding to beginning word segmentation and second probability corresponding to ending word segmentation; (4) The minimum value of the first probability corresponding to the beginning word segmentation and the second probability corresponding to the ending word segmentation. The embodiment provides different ways of determining the search matching degree so as to adapt to different situations, and the accuracy of information search can be further improved by selecting a proper determination way of the search matching degree under different situations. For example, when the second text information is an article, and the information in the article tends to have a relatively high information density, and a single content is expressed by a very short information, for example, "annual income of a company a is 100 ten thousand yuan", in this scenario, the arithmetic average of the first probability corresponding to the beginning word segmentation and the second probability corresponding to the ending word segmentation is more accurate, so the search matching degree of the third text information can be determined by selecting the (1) th mode. In addition, the four provided determination modes have smaller calculation amount, can save retrieval time and can effectively determine accurate retrieval matching degree.
Good retrieval accuracy also depends on the performance of the semantic understanding model. FIG. 7 is a flow chart of training a semantic understanding model according to some embodiments of the present application. The training process of the semantic understanding model comprises the following steps:
in step S701, a plurality of sample information pairs for training a semantic understanding model and corresponding semantic similarity labels thereof are acquired, each sample information pair including a first text information sample and a second text information sample, and each sample information pair corresponding semantic similarity label indicates a preset semantic similarity between the first text information sample and the second text information sample in the sample information pair. The semantic similarity label may indicate semantic similarity or semantic dissimilarity, and may also indicate quantified semantic similarity, e.g., 50% semantic similarity.
In step S702, the plurality of sample information pairs are respectively input to the semantic understanding model, so as to obtain a third semantic feature vector corresponding to the first text information sample and a fourth semantic feature vector corresponding to the second text information sample of each of the plurality of sample information pairs. The semantic understanding model may be a recall neural network as shown in fig. 5, and sample information in the sample information pair is respectively input into the recall neural network to obtain the output of the neural unit corresponding to the global character "CLS" as the third semantic feature vector and the fourth semantic feature vector.
In step S703, for each of the plurality of sample information pairs, a predicted semantic similarity corresponding to the sample information pair is determined based on the similarity of the third semantic feature vector and the fourth semantic feature vector. The predictive semantic similarity may be determined in various manners of determination as described herein above.
In step S704, a semantic loss of the semantic understanding model is determined based on the predicted semantic similarity and the semantic similarity label corresponding to each of the plurality of sample information pairs.
In step S705, based on the semantic loss, the parameters of the semantic understanding model are iteratively updated until the semantic loss satisfies a preset condition, so as to obtain a pre-trained semantic understanding model. Semantic loss is continuously reduced by adjusting parameters, so that the prediction precision of the semantic understanding model is improved.
The parameters of the semantic understanding model are adjusted in a manner of reducing semantic loss, so that the semantic understanding model can more accurately determine semantic feature vectors of input text information, a good foundation is laid for a subsequent information retrieval process, and relatively good retrieval precision is ensured.
To more clearly explain the training process of the semantic understanding model, fig. 8 is a schematic diagram of training the semantic understanding model according to some embodiments of the present application.
Although FIG. 8 shows two semantic understanding models, the two models are actually one semantic understanding model is used twice. The left side of fig. 8 is a case of inputting a first text information sample in a sample information pair, and a global character "CLS" is attached to a sequence (i.e., q (1) to q (k) word segments, k being a positive integer) segmented for the first text information sample and input to a semantic understanding model, so as to obtain a third semantic feature vector of the first text information sample.
The right side of fig. 8 is a case of inputting a second text information sample in the sample information pair, and a global character "CLS" is attached to a sequence (i.e., p (1) to p (m) word segments, where m is a positive integer) segmented for the second text information sample and is input to the semantic understanding model, so as to obtain a fourth semantic feature vector of the second text information sample.
Then, according to step S703, the similarity between two semantic feature vectors is determined as the predicted semantic similarity of the sample information pair. Then, according to step S704, a semantic loss is determined based on the semantic similarity indicated by the label, and finally, in step S705, the parameters of the semantic understanding model are adjusted according to the semantic loss.
In order to further improve the accuracy of information retrieval results, embodiments of a training process for improving a semantic understanding model are also provided. FIG. 9 is a flow chart of a complementary training semantic understanding model according to some embodiments of the present application. The training process of the semantic understanding model further comprises:
In step S901, the plurality of sample information pairs are respectively input to a pre-trained semantic understanding model, so as to obtain a fifth semantic feature vector corresponding to the first text information sample and a sixth semantic feature vector corresponding to the second text information sample of each of the plurality of sample information pairs. In this embodiment, the semantic understanding model is already pre-trained, so that the accuracy is better, and more accurate semantic feature vectors can be determined.
In step S902, for each of the plurality of sample information pairs, a semantic similarity corresponding to the sample information pair is determined based on the similarity of the fifth semantic feature vector and the sixth semantic feature vector. In this embodiment, the semantic similarity may be determined in the manner of determination described in the above embodiment.
In step S903, the plurality of sample information pairs are ordered according to the order in which the semantic similarity of the plurality of sample information pairs is from large to small.
In step S904, the first N sample information pairs are selected from the ranking, and input to the reading understanding model, where N is a preset positive integer. In this step, N sample information pairs having a large semantic similarity are selected from among the plurality of sample information pairs, and it should be noted that the semantic similarity is obtained based on the foregoing semantic understanding model prediction, and in a real-world situation, the semantic similarity of these sample information pairs may not be predicted as large due to the limitation of the semantic understanding model.
In step S905, for each of the N sample information pairs, predicted third text information semantically related to the first text information sample is extracted from the second text information sample using a reading understanding model. Likewise, the reading comprehension model herein is also a trained reading comprehension model with good predictive capabilities.
At step S906, at least one difficult negative example sample and at least one positive example sample for training the semantic understanding model are determined from the N sample information pairs according to the predicted third text information and the second text information samples. In the related art, a difficult negative example sample refers to such a sample information pair: in a real situation, the sample information is dissimilar to the semantics, but the semantic understanding model predicts that the semantic similarity is higher, the sample information is arranged in the first N sample information pairs, and then the reading understanding model cannot find the correct third text information from the information to be searched in the sample information pairs or the found third text information is wrong. That is, the difficult-to-negative example is a sample that is very confusing to the semantic understanding model, and is a sample that improves the semantic understanding model. The difficult negative example is often difficult to find, but through the step, the difficult negative example of the semantic understanding model can be effectively found.
In step S907, a pre-trained semantic understanding model is trained using at least one difficult negative example sample and at least one positive example sample.
Through the linkage training process of the semantic understanding model and the reading understanding model, a difficult negative example sample can be easily found, and the difficult negative example sample is used for further optimizing the semantic understanding model, so that the retrieval accuracy is improved, and the information retrieval result with more accurate semantics is obtained. Further, through a linkage training mode of a semantic understanding model and a reading understanding model, the two models capture the semantic similarity between the to-be-searched problem and the searched article proposed by the user in semantic understanding more deeply, and the final model result is more accurate.
Fig. 10 is a schematic diagram of a complete process of an information retrieval method according to some embodiments of the present application. In the embodiment shown in fig. 10, the first text information is a question to be retrieved, and the second text information is an article. The process above the broken line 1001 is a process when online, and the process below the broken line 1001 is a process when offline.
In the screening stage of the text information to be searched, the acquisition or calculation of semantic feature vectors or codes (for calculating semantic similarity) on a large amount of second text information (i.e. search objects such as articles) can be performed in advance on line (because the search objects can be relatively fixed known data pre-stored in a server database, for example), thereby greatly improving the data processing efficiency and remarkably optimizing the network resource scheduling in the scene of the articles to be searched with a large data amount.
As shown in fig. 10, in an offline case, semantic feature vectors of a plurality of articles are determined using a semantic understanding model, and then stored in a local semantic feature vector index library for use in online retrieval.
Then in the case of online retrieval, as shown in fig. 10, in the first stage, the question information to be retrieved is received from the client, and then semantic feature vectors of the question information to be retrieved are determined using a semantic understanding model. In order to find the articles related to the semantics, the semantic feature vectors of the articles are matched from a semantic feature vector index library, the similarity between the semantic feature vectors of the problem information to be searched and the semantic feature vectors of the articles is used as semantic similarity, the semantic similarity is ordered, the first N articles are selected as articles to be searched (namely, text information to be searched), and N is a positive integer.
In the next stage, as shown in fig. 10, the N articles to be retrieved are processed using the reading understanding model. Specifically, an input sequence of question information to be searched and one article to be searched is input each time to determine third text information of the article to be searched. And inputting the N articles to be searched and the questions to be searched one by one, thereby obtaining N pieces of third text information, and calculating the search matching degree of each piece of third text information.
And finally, sorting according to the retrieval matching degree of each third text message, and presenting one or more third text messages as a final information retrieval result.
Fig. 11 is an exemplary block diagram of an information retrieval apparatus 1100 according to some embodiments of the present application. The apparatus 1100 comprises: an acquisition module 1101, a first determination module 1102, a selection module 1103, an extraction module 1104, a second determination module 1105. The acquisition module 1101 is configured to acquire a first text information and a plurality of second text information. The first determining module 1102 is configured to determine a semantic similarity of the first text information to each of the plurality of second text information. The selecting module 1103 is configured to select at least one text message to be retrieved from the plurality of second text messages according to the semantic similarity between the first text message and each of the plurality of second text messages. The extraction module 1104 is configured to extract at least one third text information semantically related to the first text information from the at least one text information to be retrieved, the at least one third text information being in one-to-one correspondence with the at least one text information to be retrieved. The second determining module 1105 is configured to determine an information retrieval result from the at least one third text information.
It should be noted that the various modules described above may be implemented in software or hardware or a combination of both. The different modules may be implemented in the same software or hardware structure or one module may be implemented by different software or hardware structures.
In an information retrieval apparatus according to some embodiments of the present application, advanced information retrieval tasks such as intelligent questions and answers are efficiently and accurately accomplished by utilizing a two-stage information screening or retrieval approach based on semantic similarity and semantic relevance. Firstly, obtaining semantic similarity (rather than mere literal matching) between a large amount of second text information (such as a plurality of articles with higher semantic similarity to the problem to be searched) and a second text information (i.e. a search object, such as an article used for obtaining an answer to the problem to be searched) according to the first text information, so as to screen out a relatively small amount of the text information to be searched (such as a plurality of articles with higher semantic similarity to the problem to be searched), solve the problems of important search resource deficiency caused by accurate keyword matching and low efficiency caused by complex flow and huge calculation amount in the related art, and remarkably improve the overall working efficiency under the condition of ensuring that important information search resources with high correlation to the search problem are not lost (i.e. ensuring search breadth and accuracy); secondly, for each piece of text information to be searched obtained through screening in the first stage, extracting or searching third text information (namely candidate answers corresponding to the questions to be searched) corresponding to the first text information (such as the questions to be searched) again based on semantic relevance (for example, by using a machine reading understanding model), and obtaining a final search result based on the third text information, so that extraction or searching of answers is achieved by using semantic features in the search questions again, and higher accuracy of the candidate answers obtained through searching and the final search result is further ensured.
FIG. 12 illustrates an example system 1200 that includes an example computing device 1210 that represents one or more systems and/or devices that can implement the various methods described herein. Computing device 1210 may be, for example, a server of a service provider, a device associated with a server, a system-on-chip, and/or any other suitable computing device or computing system. The information retrieval apparatus 1100 described above with reference to fig. 11 may take the form of a computing device 1210. Alternatively, information retrieval apparatus 1100 may be implemented as a computer program in the form of application 1216.
The example computing device 1210 as illustrated includes a processing system 1211, one or more computer-readable media 1212, and one or more I/O interfaces 1213 communicatively coupled to each other. Although not shown, computing device 1210 may also include a system bus or other data and command transfer system that couples the various components to one another. The system bus may include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. Various other examples are also contemplated, such as control and data lines.
The processing system 1211 represents functionality to perform one or more operations using hardware. Thus, the processing system 1211 is illustrated as including hardware elements 1214 that may be configured as processors, functional blocks, and the like. This may include implementation in hardware as application specific integrated circuits or other logic devices formed using one or more semiconductors. The hardware element 1214 is not limited by the material from which it is formed or the processing mechanism employed therein. For example, the processor may be comprised of semiconductor(s) and/or transistors (e.g., electronic Integrated Circuits (ICs)). In such a context, the processor-executable instructions may be electronically-executable instructions.
Computer-readable media 1212 is illustrated as including memory/storage 1215. Memory/storage 1215 represents memory/storage capacity associated with one or more computer-readable media. Memory/storage 1215 may include volatile media (such as Random Access Memory (RAM)) and/or nonvolatile media (such as Read Only Memory (ROM), flash memory, optical disks, magnetic disks, and so forth). The memory/storage 1215 may include fixed media (e.g., RAM, ROM, a fixed hard drive, etc.) and removable media (e.g., flash memory, a removable hard drive, an optical disk, and so forth). The computer-readable medium 1212 may be configured in a variety of other ways as described further below.
One or more I/O interfaces 1213 represents functionality that allows a user to enter commands and information to computing device 1210 using various input devices, and optionally also allows information to be presented to the user and/or other components or devices using various output devices. Examples of input devices include keyboards, cursor control devices (e.g., mice), microphones (e.g., for voice input), scanners, touch functions (e.g., capacitive or other sensors configured to detect physical touches), cameras (e.g., motion that does not involve touches may be detected as gestures using visible or invisible wavelengths such as infrared frequencies), and so forth. Examples of output devices include a display device, speakers, printer, network card, haptic response device, and the like. Accordingly, computing device 1210 may be configured in a variety of ways to support user interaction as described further below.
Computing device 1210 also includes application 1216. Application 1216 may be, for example, a software instance of information retrieval apparatus 1100 according to some embodiments of the present application, and implements the techniques described herein in combination with other elements in computing device 1210.
The present application provides a computer program product comprising computer instructions stored in a computer readable storage medium. The processor of the computing device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computing device to perform an information retrieval method according to some embodiments of the present application.
Various techniques may be described herein in the general context of software hardware elements or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and "component" as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.
An implementation of the described modules and techniques may be stored on or transmitted across some form of computer readable media. Computer readable media can include a variety of media that are accessible by computing device 1210. By way of example, and not limitation, computer readable media may comprise "computer readable storage media" and "computer readable signal media".
"computer-readable storage medium" refers to a medium and/or device that can permanently store information and/or a tangible storage device, as opposed to a mere signal transmission, carrier wave, or signal itself. Thus, computer-readable storage media refers to non-signal bearing media. Computer-readable storage media include hardware such as volatile and nonvolatile, removable and non-removable media and/or storage devices implemented in methods or techniques suitable for storage of information such as computer-readable instructions, data structures, program modules, logic elements/circuits or other data. Examples of a computer-readable storage medium may include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical storage, hard disk, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other storage devices, tangible media, or articles of manufacture adapted to store the desired information and which may be accessed by a computer.
"computer-readable signal medium" refers to a signal bearing medium configured to transmit instructions to hardware of computing device 1210, such as via a network. Signal media may typically be embodied in computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, data signal, or other transport mechanism. Signal media also include any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
As before, the hardware elements 1214 and computer-readable media 1212 represent instructions, modules, programmable device logic, and/or fixed device logic implemented in hardware that, in some embodiments, may be used to implement at least some aspects of the techniques described herein. The hardware elements may include integrated circuits or components of a system on a chip, application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), complex Programmable Logic Devices (CPLDs), and other implementations in silicon or other hardware devices. In this context, the hardware elements may be implemented as processing devices that perform program tasks defined by instructions, modules, and/or logic embodied by the hardware elements, as well as hardware devices that store instructions for execution, such as the previously described computer-readable storage media.
Combinations of the foregoing may also be used to implement the various techniques and modules herein. Thus, software, hardware, or program modules and other program modules may be implemented as one or more instructions and/or logic embodied on some form of computer readable storage medium and/or by one or more hardware elements 1214. Computing device 1210 may be configured to implement specific instructions and/or functions corresponding to software and/or hardware modules. Thus, for example, by using the computer-readable storage medium of the processing system and/or the hardware elements 1214, a module may be implemented at least in part in hardware as a module executable by the computing device 1210 as software. The instructions and/or functions may be executable/operable by one or more articles of manufacture (e.g., one or more computing devices 1210 and/or processing systems 1211) to implement the techniques, modules, and examples described herein.
In various implementations, computing device 1210 may take on a variety of different configurations. For example, computing device 1210 may be implemented as a computer-like device including a personal computer, desktop computer, multi-screen computer, laptop computer, netbook, and the like. Computing device 1210 may also be implemented as a mobile appliance-like device including a mobile device such as a mobile phone, portable music player, portable gaming device, tablet computer, multi-screen computer, or the like. Computing device 1210 may also be implemented as a television-like device including devices having or connected to generally larger screens in casual viewing environments. Such devices include televisions, set-top boxes, gaming machines, and the like.
The techniques described herein may be supported by these various configurations of computing device 1210 and are not limited to the specific examples of techniques described herein. Functionality may also be implemented in whole or in part on the "cloud" 1220 through the use of a distributed system, such as through the platform 1222 as described below.
Cloud 1220 includes and/or represents platform 1222 for resources 1224. The platform 1222 abstracts underlying functionality of hardware (e.g., servers) and software resources of the cloud 1220. The resources 1224 may include applications and/or data that may be used when executing computer processes on servers remote from the computing device 1210. Resources 1224 may also include services provided over the internet and/or over subscriber networks such as cellular or Wi-Fi networks.
The platform 1222 may abstract resources and functionality to connect the computing device 1210 with other computing devices. The platform 1222 may also be used to abstract a hierarchy of resources to provide a corresponding level of hierarchy of encountered demand for the resources 1224 implemented via the platform 1222. Thus, in an interconnected device embodiment, the implementation of the functionality described herein may be distributed throughout the system 1200. For example, functionality may be implemented in part on computing device 1210 and by platform 1222 abstracting the functionality of cloud 1220.
It should be understood that for clarity, embodiments of the present application have been described with reference to different functional units. However, it will be apparent that the functionality of each functional unit may be implemented in a single unit, in a plurality of units or as part of other functional units without departing from the present application. For example, functionality illustrated to be performed by a single unit may be performed by multiple different units. Thus, references to specific functional units are only to be seen as references to suitable units for providing the described functionality rather than indicative of a strict logical or physical structure or organization. Thus, the present application may be implemented in a single unit or may be physically and functionally distributed between different units and circuits.
Although the present application has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present application is limited only by the appended claims. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. The order of features in the claims does not imply any specific order in which the features must be worked. Furthermore, in the claims, the word "comprising" does not exclude other elements, and the term "a" or "an" does not exclude a plurality. Reference signs in the claims are provided merely as a clarifying example and shall not be construed as limiting the scope of the claims in any way.

Claims (15)

1. An information retrieval method, comprising:
acquiring first text information and a plurality of second text information;
determining semantic similarity of the first text information and each of the plurality of second text information;
selecting at least one text message to be searched from the plurality of second text messages according to the semantic similarity of the first text message and each of the plurality of second text messages;
extracting at least one third text message semantically related to the first text message from the at least one text message to be searched, wherein the at least one third text message corresponds to the at least one text message to be searched one by one;
and determining an information retrieval result according to the at least one third text message.
2. The method of claim 1, wherein the determining the semantic similarity of the first text information to each of the plurality of second text information comprises:
acquiring a first semantic feature vector corresponding to the first text information and a plurality of second semantic feature vectors corresponding to the plurality of second text information respectively;
calculating the similarity between the first semantic feature vector and each of the plurality of second semantic feature vectors;
And determining the semantic similarity of the first text information and each second text information in the plurality of second text information according to the similarity of the first semantic feature vector and each second semantic feature vector.
3. The method of claim 2, wherein the calculating the similarity of the first semantic feature vector to each of the plurality of second semantic feature vectors comprises:
calculating a first similarity of the first semantic feature vector and each of the plurality of second semantic feature vectors based on the distances of the plurality of second semantic feature vectors from the first semantic feature vector;
calculating a second similarity of the first semantic feature vector and each of the plurality of second semantic feature vectors based on cosine of an included angle between the plurality of second semantic feature vectors and the first semantic feature vector;
and determining a similarity of the first semantic feature vector to each of the plurality of second semantic feature vectors based on at least one of the first similarity and the second similarity.
4. The method of claim 2, wherein the obtaining a first semantic feature vector corresponding to the first text information and a plurality of second semantic feature vectors corresponding to the plurality of second text information, respectively, comprises:
Determining a first semantic feature vector corresponding to the first text information by using a semantic understanding model;
a plurality of second semantic feature vectors corresponding to the plurality of second text information are obtained from a preset semantic feature vector index library, and the plurality of second semantic feature vectors determined in advance by utilizing the semantic understanding model are stored in the preset semantic feature vector index library.
5. The method of claim 4, wherein said extracting at least one third text information semantically related to the first text information from the at least one text information to be retrieved comprises:
for each text message to be retrieved in the at least one text message to be retrieved, executing the following steps:
inputting the first text information and the text information to be searched into a reading understanding model;
determining a first probability and a second probability corresponding to each word in the text information to be searched by using the reading understanding model, wherein the first probability corresponding to each word indicates the probability that the word is the beginning word of the third text information semantically related to the first text information, and the second probability corresponding to each word indicates the probability that the word is the ending word of the third text information;
Determining a beginning word segmentation and an ending word segmentation of third text information from each word segmentation of the text information to be searched according to a first probability and a second probability corresponding to each word segmentation in the text information to be searched;
and extracting third text information from the text information to be retrieved according to the determined beginning segmentation and ending segmentation.
6. The method of claim 5, wherein said determining information retrieval results from said at least one third text message comprises:
for each third text message in the at least one third text message, determining the retrieval matching degree of the third text message according to at least one of the first probability corresponding to the beginning word segmentation and the second probability corresponding to the ending word segmentation;
sorting the at least one third text information according to the retrieval matching degree of each third text information;
and determining an information retrieval result according to the ordering of the at least one third text information.
7. The method of claim 6, wherein the determining, for each of the at least one third text information, a search matching degree of the third text information according to at least one of a first probability corresponding to the start word and a second probability corresponding to the end word, comprises:
Determining a retrieval matching degree of the third text information based on at least one of the following values:
an arithmetic average of the first probability corresponding to the beginning word segmentation and the second probability corresponding to the ending word segmentation;
a geometric average of the first probability corresponding to the beginning segmentation and the second probability corresponding to the ending segmentation;
maximum values in the first probability corresponding to the beginning word segmentation and the second probability corresponding to the ending word segmentation;
and the minimum value of the first probability corresponding to the beginning word segmentation and the second probability corresponding to the ending word segmentation.
8. The method of claim 5, wherein the training process of the semantic understanding model comprises:
acquiring a plurality of sample information pairs for training a semantic understanding model and corresponding semantic similarity labels thereof, wherein each sample information pair comprises a first text information sample and a second text information sample, and the corresponding semantic similarity label of each sample information pair indicates preset semantic similarity of the first text information sample and the second text information sample in the sample information pair;
inputting the plurality of sample information respectively to a semantic understanding model to obtain a third semantic feature vector corresponding to a first text information sample and a fourth semantic feature vector corresponding to a second text information sample of each sample information pair;
For each sample information pair of the plurality of sample information pairs, determining a predicted semantic similarity corresponding to the sample information pair based on the similarity of the third semantic feature vector and the fourth semantic feature vector;
determining semantic loss of a semantic understanding model based on the predicted semantic similarity and the semantic similarity label corresponding to each sample information pair in the plurality of sample information pairs;
and based on the semantic loss, iteratively updating parameters of the semantic understanding model until the semantic loss meets a preset condition to obtain a pre-trained semantic understanding model.
9. The method of claim 8, wherein the training process of the semantic understanding model further comprises:
inputting the plurality of sample information into a pre-trained semantic understanding model respectively to obtain a fifth semantic vector corresponding to a first text information sample and a sixth semantic feature vector corresponding to a second text information sample of each of the plurality of sample information pairs;
for each of the plurality of sample information pairs, determining a semantic similarity corresponding to the sample information pair based on the similarity of the fifth semantic feature vector and the sixth semantic feature vector;
Sorting the plurality of sample information pairs according to the sequence from the large to the small of the corresponding semantic similarity of the plurality of sample information pairs;
selecting first N sample information pairs from the sequencing, and inputting the first N sample information pairs into a reading understanding model, wherein N is a preset positive integer;
extracting predicted third text information semantically related to the first text information sample from the second text information sample by using the reading and understanding model aiming at each of the N sample information pairs;
determining at least one difficult negative example sample and at least one positive example sample for training a semantic understanding model from the N sample information pairs according to the predicted third text information and the second text information samples;
training the pre-trained semantic understanding model using the at least one difficult negative example sample and at least one positive example sample.
10. The method of claim 1, wherein the selecting at least one text message to be retrieved from the plurality of second text messages according to semantic similarity of the first text message to each of the plurality of second text messages comprises:
sorting the plurality of second text messages according to the sequence from the big semantic similarity of the first text message to each second text message;
And selecting the first M pieces of second text information from the ordering as M pieces of text information to be searched, wherein M is a preset positive integer.
11. The method of claim 1, wherein the first text information includes question information to be retrieved and the third text information includes answer information corresponding to the question information to be retrieved.
12. An information retrieval apparatus comprising:
the acquisition module is configured to acquire the first text information and a plurality of second text information;
a first determining module configured to determine a semantic similarity of the first text information and each of the plurality of second text information;
the selecting module is configured to select at least one text message to be searched from the plurality of second text messages according to the semantic similarity of the first text message and each of the plurality of second text messages;
the extraction module is configured to extract at least one third text message semantically related to the first text message from the at least one text message to be searched, and the at least one third text message corresponds to the at least one text message to be searched one by one;
and the second determining module is configured to determine an information retrieval result according to the at least one third text message.
13. A computing device, comprising:
a memory and a processor, wherein the memory is configured to store,
wherein the memory has stored therein a computer program which, when executed by the processor, causes the processor to perform the method of any of claims 1-11.
14. A computer readable storage medium having stored thereon computer readable instructions which, when executed, implement the method of any of claims 1-11.
15. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the method according to any of claims 1-11.
CN202310328779.9A 2023-03-30 2023-03-30 Information retrieval method and device, computing device, storage medium and program product Pending CN116340500A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310328779.9A CN116340500A (en) 2023-03-30 2023-03-30 Information retrieval method and device, computing device, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310328779.9A CN116340500A (en) 2023-03-30 2023-03-30 Information retrieval method and device, computing device, storage medium and program product

Publications (1)

Publication Number Publication Date
CN116340500A true CN116340500A (en) 2023-06-27

Family

ID=86883782

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310328779.9A Pending CN116340500A (en) 2023-03-30 2023-03-30 Information retrieval method and device, computing device, storage medium and program product

Country Status (1)

Country Link
CN (1) CN116340500A (en)

Similar Documents

Publication Publication Date Title
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
US11334635B2 (en) Domain specific natural language understanding of customer intent in self-help
CN110162593B (en) Search result processing and similarity model training method and device
CN111241237B (en) Intelligent question-answer data processing method and device based on operation and maintenance service
CN108287858B (en) Semantic extraction method and device for natural language
CN111581510A (en) Shared content processing method and device, computer equipment and storage medium
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
CN111190997B (en) Question-answering system implementation method using neural network and machine learning ordering algorithm
CN111324728A (en) Text event abstract generation method and device, electronic equipment and storage medium
CN111291195B (en) Data processing method, device, terminal and readable storage medium
WO2021121198A1 (en) Semantic similarity-based entity relation extraction method and apparatus, device and medium
KR20170004154A (en) Method and system for automatically summarizing documents to images and providing the image-based contents
US20230386238A1 (en) Data processing method and apparatus, computer device, and storage medium
CN111625715B (en) Information extraction method and device, electronic equipment and storage medium
CN113220864B (en) Intelligent question-answering data processing system
CN113704507B (en) Data processing method, computer device and readable storage medium
CN110147494A (en) Information search method, device, storage medium and electronic equipment
CN108595411B (en) Method for acquiring multiple text abstracts in same subject text set
CN112148831A (en) Image-text mixed retrieval method and device, storage medium and computer equipment
US11379527B2 (en) Sibling search queries
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN106570196B (en) Video program searching method and device
Wei et al. Online education recommendation model based on user behavior data analysis
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
CN116842934A (en) Multi-document fusion deep learning title generation method based on continuous learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination