CN116431838B

CN116431838B - Document retrieval method, device, system and storage medium

Info

Publication number: CN116431838B
Application number: CN202310705899.6A
Authority: CN
Inventors: 孙鹏飞; 王俊平
Original assignee: Beijing Moqiu Technology Co ltd
Current assignee: Beijing Moqiu Technology Co ltd
Priority date: 2023-06-15
Filing date: 2023-06-15
Publication date: 2024-01-30
Anticipated expiration: 2043-06-15
Also published as: CN116431838A

Abstract

The application discloses a document retrieval method, a device, a system and a storage medium, belonging to the technical field of natural language processing. In the embodiment of the application, the keyword selection is considered to be one step of the key of the scientific and technical literature retrieval, the keyword selection is improved, after the keyword is extracted from the input text, the keyword is not directly used for retrieval, but the combination of the keywords is used for analyzing how different keyword combination effects are achieved, so that the keyword or the keyword combination selection or the retrieval results are ordered.

Description

Document retrieval method, device, system and storage medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a system, and a storage medium for document retrieval.

Background

In many technical literature service fields, users often need to acquire technical literature data in related fields through retrieval, and with rapid development of the technical fields of artificial intelligence and natural language processing, more and more literature retrieval methods are generated. The document retrieval refers to a process of acquiring documents according to the needs of learning and working.

Currently, a document retrieval method is generally that a user inputs a text, a retrieval system directly extracts keywords from the input text, and the extracted keywords are directly matched with documents in a database, so that the matched documents are displayed.

The keyword selection is a key step of scientific and technical literature retrieval, and directly influences the accuracy and effect of a retrieval result. The traditional keyword selection method is generally based on manual experience or rules, has subjectivity and limitation, and therefore, the efficiency of the retrieval process is low, and the accuracy of the retrieval result is low.

Disclosure of Invention

The embodiment of the application provides a document retrieval method, a device, a system and a storage medium, which can achieve the effects of ensuring objective retrieval results and improving retrieval efficiency and accuracy. The technical scheme is as follows:

in one aspect, a document retrieval method is provided, the method comprising:

extracting keywords from an input text to obtain a plurality of keywords corresponding to the input text;

combining the keywords to obtain a plurality of keyword combinations;

invoking an artificial intelligent content generation system based on the plurality of keyword combinations to respectively determine generated texts;

And calculating the similarity between the generated text and the input text one by one, and determining the retrieval priority.

In some embodiments, the search priority includes any one of keyword or keyword combination selection priority and/or search result ranking priority.

In some embodiments, the combining the plurality of keywords results in a plurality of keyword combinations, including any one of the following:

combining any target number of keywords in the keywords to obtain a plurality of keyword combinations;

and combining any first number of keywords and any second number of keywords in the plurality of keywords to obtain a plurality of keyword combinations, wherein the first number and the second number are different.

In some embodiments, the combining any target number of the plurality of keywords to obtain a plurality of keyword combinations includes:

combining any first number of keywords with any second number of keywords in the plurality of keywords, or combining any target number of keywords in the plurality of keywords to obtain a plurality of candidate keyword combinations;

and filtering the candidate keyword combinations based on the input text to obtain a plurality of keyword combinations.

In some embodiments, the filtering the plurality of candidate keyword combinations based on the input text to obtain a plurality of keyword combinations includes:

and matching the candidate keyword combinations with the input text, and taking the candidate keyword combinations meeting the target condition as the keyword combinations.

In some embodiments, the computing similarity between the generated text and the input text one by one determines a retrieval priority, including any one of:

calculating the similarity between the generated text and the input text one by one, and determining the retrieval priority of the keyword combinations according to the similarity;

and calculating the similarity between the generated text and the input text one by one, sorting the plurality of keyword combinations based on the similarity, and determining the retrieval priority of the plurality of keywords according to the sorting of the plurality of keyword combinations and the difference between the keyword combinations.

In some embodiments, the determining the search priority of the plurality of keywords according to the ordering of the plurality of keyword combinations and the difference between the keyword combinations includes any one of:

Responding to each keyword combination to contain a target number of keywords, and for any two adjacent keyword combinations after sorting, obtaining difference keywords between the two adjacent keyword combinations, and determining the sorting of the difference keywords in a plurality of difference keywords according to the sorting of the plurality of keyword combinations at present to obtain the retrieval priority of the plurality of keywords;

in response to the plurality of keyword combinations including a first keyword combination and a second keyword combination, for any one of the first keyword combination and the second keyword combination, obtaining a difference keyword between the first keyword combination and the second keyword combination, determining a ranking of the difference keyword in the plurality of difference keywords based on similarity between a generated text of the first keyword combination and a generated text of the second keyword combination, and obtaining a retrieval priority of the plurality of keywords, wherein each first keyword combination contains a first number of keywords, and each second keyword combination contains a second number of keywords.

In another aspect, there is provided a document retrieval apparatus, the apparatus comprising:

The extraction module is used for extracting keywords from the input text to obtain a plurality of keywords corresponding to the input text;

the combination module is used for combining the keywords to obtain a plurality of keyword combinations;

the determining module is used for calling an artificial intelligent content generating system based on the plurality of keyword combinations to respectively determine generated texts;

and the retrieval module is used for calculating the similarity between the generated text and the input text one by one and determining the retrieval priority.

In some embodiments, the combining module is configured to perform any one of:

In some embodiments, the combining module is to:

In some embodiments, the combination module is configured to match the plurality of candidate keyword combinations with the input text, and use the plurality of candidate keyword combinations that meet the target condition as a plurality of keyword combinations.

In some embodiments, the retrieval module is to perform any one of:

In another aspect, an electronic device is provided that includes one or more processors and one or more memories having at least one computer program stored therein, the at least one computer program being loaded and executed by the one or more processors to implement various alternative implementations of the above-described document retrieval method.

In another aspect, a document retrieval system is provided that includes at least one electronic device for performing various alternative implementations of the above document retrieval method.

In another aspect, a computer readable storage medium having stored therein at least one computer program loaded and executed by a processor to implement various alternative implementations of the above-described document retrieval method is provided.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising one or more program codes, the one or more program codes being stored in a computer readable storage medium. The one or more processors of the electronic device are capable of reading the one or more program codes from the computer-readable storage medium, the one or more processors executing the one or more program codes such that the electronic device is capable of performing the document retrieval method of any one of the possible embodiments described above.

In the embodiment of the application, the keyword selection is considered to be one step of the key of the scientific and technical literature retrieval, the keyword selection is improved, after the keyword is extracted from the input text, the keyword is not directly used for retrieval, but the combination of the keywords is used for analyzing how different keyword combination effects are achieved, so that the keyword or the keyword combination selection or the retrieval results are ordered.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an implementation environment of a document retrieval method provided in an embodiment of the present application;

FIG. 2 is a flow chart of a document retrieval method provided by an embodiment of the present application;

FIG. 3 is a flow chart of a document retrieval method provided by an embodiment of the present application;

FIG. 4 is a flow chart of a document retrieval method provided by an embodiment of the present application;

FIG. 5 is a flow chart of a document retrieval method provided by an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a document retrieval device according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 8 is a block diagram of a terminal according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," and the like in this application are used to distinguish between identical or similar items that have substantially the same function and function, and it should be understood that there is no logical or chronological dependency between the "first," "second," and "nth" terms, nor is it limited to the number or order of execution. It will be further understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another element. For example, a first image can be referred to as a second image, and similarly, a second image can be referred to as a first image, without departing from the scope of the various examples. The first image and the second image can both be images, and in some cases, can be separate and distinct images.

The term "at least one" in this application means one or more, the term "plurality" in this application means two or more, for example, a plurality of data packets means two or more.

It is to be understood that the terminology used in the description of the various examples herein is for the purpose of describing particular examples only and is not intended to be limiting. As used in the description of various such examples and in the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It will also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. The term "and/or" is an association relationship describing an associated object, meaning that three relationships can exist, e.g., a and/or B, can be represented: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" in the present application generally indicates that the front-rear association object is an or relationship.

It should also be understood that, in the embodiments of the present application, the sequence number of each process does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not constitute any limitation on the implementation process of the embodiments of the present application.

It should also be understood that determining B from a does not mean determining B from a alone, but can also determine B from a and/or other information.

It will be further understood that the terms "Comprises" and/or "Comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "if" may be interpreted to mean "when" ("white" or "upon") or "in response to a determination" or "in response to detection". Similarly, the phrase "if a [ stated condition or event ] is detected" may be interpreted to mean "upon a determination" or "in response to a determination" or "upon a detection of a [ stated condition or event ] or" in response to a detection of a [ stated condition or event ], depending on the context.

The terms referred to in this application are described below.

The large language model is a natural language processing technology based on deep learning, and has strong text generation and language understanding capability. Through a large amount of training data and a complex neural network structure, a large language model can generate high-quality texts, including articles, dialogs, poems and the like, and even can simulate texts of specific styles and language styles.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Natural language processing (Nature Language Processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

In recent years, with the rapid development of the fields of artificial intelligence and natural language processing, a plurality of popular artificial intelligence content generation systems are developed, and the artificial intelligence content generation systems can grasp semantic information of texts through training and learning and have a powerful text generation function. The large language model is an artificial intelligent content generating system. The following is an introduction to some of the more recently popular large language models:

GPT-3 (generating Pre-trained Transformer 3, chat generating Pre-training Transformer): a large-scale language model based on a Transformer architecture developed by OpenAI (company name). GPT-3 is one of the largest language models at present, and has 1750 hundred million parameters, so that high-quality text contents such as articles, conversations and the like can be generated. GPT-3 has achieved remarkable results in tasks such as natural language generation, question and answer, text classification, and the like.

BERT (Bidirectional Encoder Representations from Transformers, bi-directional coded representation based on converter): a pre-trained language model based on a transducer architecture developed by Google. The BERT, by performing unsupervised training on large-scale text data, can generate contextually relevant word vector representations for various natural language processing tasks, such as text classification, named entity recognition, and the like. BERT achieves excellent performance in multiple natural language processing contests.

T5 (Text-to-Text Transfer Transformer, text-to-Text generation model): a generic text-to-text conversion model was proposed by Google Research team. T5 uses a transducer architecture, and by performing supervised and unsupervised training on a large scale dataset, end-to-end training and reasoning can be achieved on various text processing tasks, such as text summarization, text translation, text classification, etc.

XLNet: an autoregressive language model based on a transducer architecture developed jointly by CMU (Carnegie Mellon University, university of california) and Google Brain team. And BERT, while referencing transducer-XL, so called XLNet (XL meaning from garment size meaning model is laterally wider). The XLNet is pre-trained by using an autoregressive and autocoding mode, so that the problem of a unidirectional language model in BERT is solved, context information can be better processed, and the performance of the model is improved.

The following describes the environment in which the present application is implemented.

Fig. 1 is a schematic diagram of a document retrieval system according to an embodiment of the present application. The document retrieval system comprises at least one electronic device, which may be a terminal or a server. For example, the document retrieval system includes a terminal 101, or the document retrieval system includes a terminal 101 and a document retrieval platform 102. The terminal 101 is connected to the document retrieval platform 102 via a wireless network or a wired network.

The terminal 101 can be at least one of a smart phone, a game console, a desktop computer, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III, moving picture experts compression standard audio layer 3) player, or an MP4 (Moving Picture Experts Group Audio Layer IV, moving picture experts compression standard audio layer 4) player, a laptop portable computer. The terminal 101 installs and runs an application program supporting document retrieval, which can be, for example, a browser application or a retrieval application, or the like.

The terminal 101 can have a document retrieval function, and can extract an input text, retrieve a document based on the input text, and display the retrieved document, for example. The terminal 101 is capable of doing this independently and also of providing data services to it through the document retrieval platform 102. The embodiments of the present application are not limited in this regard.

Document retrieval platform 102 includes at least one of a server, a plurality of servers, a cloud computing platform, and a virtualization center. The document retrieval platform 102 is used to provide background services for document retrieval applications. Optionally, document retrieval platform 102 takes on primary processing work and terminal 101 takes on secondary processing work; alternatively, the document retrieval platform 102 takes on secondary processing work and the terminal 101 takes on primary processing work; alternatively, the document retrieval platform 102 or the terminal 101 can each independently undertake processing work. Alternatively, the document retrieval platform 102 and the terminal 101 are cooperatively computed using a distributed computing architecture.

Optionally, the document retrieval platform 102 includes at least one server 1021 and a database 1022, where the database 1022 is configured to store data, and in this embodiment of the present application, the database 1022 can store documents or artificial intelligence content generating systems in various fields to provide data services for the at least one server 1021.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms. The terminal can be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc.

Those skilled in the art will appreciate that the number of terminals 101 and servers 1021 can be greater or fewer. For example, the number of the terminals 101 and the servers 1021 can be only one, or the number of the terminals 101 and the servers 1021 can be tens or hundreds, or more, and the number and the device type of the terminals or the servers are not limited in the embodiment of the present application.

Fig. 2 is a flowchart of a document retrieval method provided in an embodiment of the present application, where the method is applied to a document retrieval system, where the document retrieval system may include at least one document retrieval system. The at least one document retrieval system may be a terminal or a server, see fig. 2, the method comprising the following steps.

201. The document retrieval system extracts keywords from an input text, and obtains a plurality of keywords corresponding to the input text.

In the embodiment of the present application, the document retrieval system may have a document retrieval function. The user can input text on a certain document retrieval system in the document retrieval systems, and retrieve the related technical document according to the input text. The document retrieval method provided by the embodiment of the application can be executed by one electronic device or can be executed by a plurality of electronic devices in a cooperative manner, that is, the document retrieval system can comprise one electronic device or a plurality of electronic devices, and the plurality of electronic devices can complete document retrieval in a cooperative manner.

In some embodiments, an electronic device in the document retrieval system may obtain an input text, perform keyword extraction on the input text, obtain a plurality of keywords corresponding to the input text, and perform a subsequent retrieval step based on the plurality of keywords.

In another embodiment, a terminal in the document retrieval system may obtain an input text, perform keyword extraction on the input text to obtain a plurality of keywords corresponding to the input text, then send the plurality of keywords to a server in the document retrieval system, perform keyword extraction on the input text by the server, and perform a subsequent retrieval step by the server based on the plurality of keywords.

In other embodiments, the terminal in the document retrieval system may obtain the input text, then send the input text to a server in the document retrieval system, and the server performs keyword extraction on the input text to obtain a plurality of keywords corresponding to the input text, and then perform subsequent retrieval steps based on the plurality of keywords.

For keyword extraction, the keyword extraction process may be implemented in a variety of manners, and the embodiment of the present application does not limit what manner is specifically adopted.

In a first mode, the keyword extraction process may be implemented based on an artificial intelligence content generation system, and specifically, the document retrieval system may input an input text into the artificial intelligence content generation system, and the artificial intelligence content generation system performs keyword extraction on the input text and outputs a plurality of keywords corresponding to the input text. In some embodiments, the artificial intelligence content generating system may be a large language model, and accordingly, the keyword extraction process may be implemented based on the large language model, and in particular, the document retrieval system may input the input text into the large language model, perform keyword extraction on the input text by the large language model, and output a plurality of keywords corresponding to the input text.

In a specific possible embodiment, the document retrieval system may input the input text into an artificial intelligence content generation system (e.g., a large language model), and the artificial intelligence content generation system extracts keywords from the input text based on at least one of word frequency, word vector similarity, and co-occurrence relationship, and outputs a plurality of keywords corresponding to the input text.

Specifically, the keyword extraction process may also filter keywords by setting a threshold or using a probabilistic statistical approach. In combination with understanding of the vocabulary in the literature by the artificial intelligence content generation system, keywords with high relevance and representativeness can be selected.

An artificial intelligence content generation system and a large language model are described in detail below.

The large language model is one of artificial intelligence content generation systems, the capabilities of which are not limited to text generation, but can also be used for a variety of tasks such as machine translation, emotion analysis, text classification, question answering, etc. The artificial intelligence content generating system can understand and process complex language information by learning a large number of language rules and semantic associations, and realize natural language processing tasks.

Artificial intelligence content generation systems have a certain application potential in scientific literature retrieval, but also face some technical and practical problems. First, the expertise and complexity of the scientific literature field requires that artificial intelligence content generation systems possess a high degree of domain knowledge and understanding capabilities of specialized vocabulary, which is a challenge for artificial intelligence content generation systems. Secondly, the number of documents in the scientific and technological document database is huge, and documents in various disciplines and fields are included, so that the artificial intelligence content generation system is required to have understanding and processing capabilities in multiple fields. In addition, language expressions in scientific literature are often rigorous and complex, and artificial intelligence content generation systems need to have processing power for complex sentence patterns and terms of art. Finally, scientific and technological literature retrieval involves privacy and security of users, and there is a need to enhance privacy protection and information security considerations in artificial intelligence content generation system applications.

In the technical literature retrieval process, keyword selection is a key step of technical literature retrieval, and the accuracy and effect of a retrieval result are directly affected. Traditional keyword selection methods are generally based on manual experience or rules, and have subjectivity and limitation. By using the artificial intelligent content generation system, potential keywords in scientific literature can be better captured by pre-training and fine-tuning a large number of patent or paper literature and learning rich semantic information and context information from the patent or paper literature. However, current artificial intelligence content generation systems focus on using larger parameter scales with less related research.

In summary, current artificial intelligence content generation systems are not capable of direct scientific literature retrieval. But its language understanding capabilities, as well as the ability to reason with "emerging" and "thought chain" capabilities, are also incomparable with conventional schemes. In the embodiment of the present application, in order to meet the requirements of the expertise and the high efficiency of the scientific and technical literature retrieval, a method for performing literature retrieval by using an artificial intelligence content generating system is provided, wherein the artificial intelligence content generating system can be used for extracting keywords in the step 201 and determining the generated text in the subsequent step 203, and the process of how the artificial intelligence content generating system determines the generated text is not described in detail herein, but see the description in the following step 203.

For how the artificial intelligence content generation system overcomes the challenges and practical problems mentioned above, the artificial intelligence content generation system can be trained based on sample documents and an initial artificial intelligence content generation system that is pre-trained based on sample text.

That is, the initial artificial intelligence content generating system can be pre-trained through large-scale text content, so that the initial artificial intelligence content generating system learns semantic information and context information of the text, and then the initial artificial intelligence content generating system is trained by using sample documents, so that the artificial intelligence content generating system learns knowledge and professional terms of scientific and technological documents in various fields, the understanding capability of the knowledge is improved, the processing capability of the text with high professionality and complexity is provided, and the language understanding capability of the artificial intelligence content generating system is combined, and the document retrieval capability is utilized by utilizing the capabilities of 'emerging' and 'thinking chain', so that the accuracy of retrieval results can be effectively improved, and the retrieval efficiency is improved.

In a specific possible embodiment, the training process of the artificial intelligence content generating system may be implemented through the following steps 1 to 4.

Step 1, a document retrieval system can acquire a large number of sample texts, and the large number of sample texts carry labeling data.

And 2, the document retrieval system pretrains the initial artificial intelligent content generation system based on a large number of sample texts, and the pretraining process is performed based on a prediction label obtained by processing the large number of sample texts by the initial artificial intelligent content generation system and label data carried by the large number of sample texts.

And 3, collecting and arranging a literature data set by the literature retrieval system.

The document data set may include patent data, which may include information such as titles, summaries, patent descriptions, etc. The collection process may be implemented by extracting literature data from various literature databases.

The process of sorting the document dataset may include at least one of a pre-processing step of cleaning, removing noise, removing non-critical information, lexical analysis, stem extraction, etc. the document dataset may be processed by a processor. After the document data set is preprocessed, the initial artificial intelligence content generation system is trained, so that the accuracy and the training efficiency of the training process can be assisted, interference information is eliminated, and the language understanding capability and the processing capability of the artificial intelligence content generation system are improved.

And 4, training the pre-trained initial artificial intelligent content generation system by the document retrieval system based on the document data set to obtain the artificial intelligent content generation system.

It should be noted that, the training process is selected to perform pre-training on large-scale text data, for example, using pre-training models such as GPT-3 and Bert, and if the resources are limited, a small model may be selected to perform pre-training. At present, some new open source small models can obtain better effects. For example, an open source llama model of meta. The pre-trained initial artificial intelligence content generation system performs fine adjustment on the patent literature data set so that the artificial intelligence content generation system is more suitable for the semantics and the context of the patent field, and therefore the artificial intelligence content generation system has better understanding capability and processing capability for general texts and professional texts.

In one possible implementation, the artificial intelligence content generation system may be trained on the document retrieval system, and the trained artificial intelligence content generation system is invoked when the document retrieval system is required to perform a document retrieval. In another possible implementation, the artificial intelligence content generation system may be trained on or stored on other systems, and when the document retrieval system requires document retrieval, the artificial intelligence content generation system of the other systems is invoked. The training position and the storage position of the artificial intelligent content generating system are not limited in the embodiment of the application.

Through the training process, the artificial intelligence content generating system has excellent language processing capability, so that the artificial intelligence content generating system can be used for extracting keywords of an input text to obtain a plurality of keywords of the input text, and the keywords of the input text can be extracted rapidly by using the artificial intelligence content generating system to extract the keywords due to the natural advantages of the artificial intelligence content generating system and the expertise reserve obtained by the training, so that the extraction efficiency is improved. The keywords are extracted based on the strong language processing capability of the artificial intelligent content generation system, keywords with high relevance and representativeness can be extracted, other processing is carried out subsequently, and the obtained retrieval result is more accurate naturally, which is a benign effect.

In the second mode, the keyword extraction process may not adopt an artificial intelligence content generation system, but perform word segmentation based on a word bag, and all word segmentation results are used as keywords.

The above only provides two possible keyword extraction manners, and this step 201 may also be implemented in other manners, for example, based on statistical, semantic, and other methods, which are not limited in this embodiment of the present application.

202. The document retrieval system combines the plurality of keywords to obtain a plurality of keyword combinations.

After the document retrieval system obtains a plurality of keywords, the keywords can be combined, and then how the effects of different keyword combinations are analyzed to analyze the representativeness of the keyword combinations or individual keywords.

If, as is typical of analyzing keyword combinations, by analyzing how well different keyword combinations are effective, a higher search priority can be determined for the effective keyword combinations. A poor performing keyword combination may determine a lower search priority for it.

In some embodiments, the search priority may include any of keyword or keyword combination selection priority and/or search result ranking priority. The selection priority refers to the priority selected as the retrieval basis, and the retrieval result sorting priority is the priority selected as the retrieval basis and displayed by the retrieval result obtained through retrieval.

If a representative of a single keyword is analyzed, it is assumed that one keyword is better in combination with the other keyword being worse in combination, then the difference keyword between the two keyword combinations has a larger effect, the representative of the keyword is better, the representative of the keyword is more suitable as a representation of the input text, a better search effect can be obtained by searching, and the search result searched by the keyword is preferentially displayed and may more meet the search requirement of the user. That is, a representative keyword may be determined for which a higher retrieval priority is determined. For which a low retrieval priority may be determined for the keywords that are poor in representativeness. Of course, the opposite search priority may be set according to the search requirement, which is not limited in the embodiment of the present application.

The combination of keywords represents a technical feature of a technological scheme, and the technical feature is often a specific step formed by connecting a plurality of keywords. That is, the keywords are combined to make up a specific technological scheme. For example, the input text is "compress channel estimation using neural network, the neural network is DNN", and the keyword selection step in step 201 may obtain "neural network", "channel estimation", "DNN" as keywords. However, the current search task is to determine whether a solution of "compressing channel estimation using neural networks" has been disclosed. Thus, the keywords can be combined by the step 202, so that the technological scheme after various keyword combinations can be analyzed to better retrieve the expected technological document.

The keywords may be combined in a variety of ways, and two possible embodiments are provided below, however, the keywords may be combined in other ways, which are not limited in this embodiment.

In the first embodiment, the document retrieval system combines any target number of keywords among the plurality of keywords to obtain a plurality of keyword combinations.

In the first embodiment, the document retrieval system may construct keyword combinations, and the number of keywords in each keyword combination is the target number, that is, the number of keywords in each keyword combination is the same.

The target number may be set by a person skilled in the relevant art according to the requirement, which is not limited in the embodiment of the present application.

Let the target number be N, which is a positive integer. Each keyword combination contains at least N keywords, and in the first embodiment, a plurality of keyword combinations are constructed such that M different keywords are each between two keyword combinations. M is a positive integer. For example, the input text is "compressing channel estimation using neural network, the neural network is DNN", and three keywords of "neural network", "channel estimation", and "DNN" are obtained in the keyword selection step in step 201. When n=2, m=1, three keyword combinations can be constructed:

keyword combination 1, namely 'neural network' and 'channel estimation'.

Keyword combination 2, neural network and DNN.

Keyword combination 3 is DNN and channel estimation.

Embodiment two: the document retrieval system respectively combines any first number of keywords and any second number of keywords in the plurality of keywords to obtain a plurality of keyword combinations, wherein the first number and the second number are different.

In the embodiment of the present application, a keyword combination obtained by combining any first number of keywords may be referred to as a first keyword combination, and a keyword combination obtained by combining any second number of keywords may be referred to as a second keyword combination. That is, each of the first keyword combinations includes a first number of keywords, and each of the second keyword combinations includes a second number of keywords.

Accordingly, the second embodiment may be: the document retrieval system combines any first number of keywords in the plurality of keywords to obtain a first keyword combination, and combines any second number of keywords in the plurality of keywords to obtain a second keyword combination, wherein the first number and the second number are different.

In the second embodiment, the document retrieval system may construct keyword combinations, wherein the number of keywords in the first keyword combination is a first number, the number of keywords in the second keyword combination is a second number, and the first number and the second number are different, that is, the number of keywords in each keyword combination is different. Thus, how the keyword combination effect comprising different numbers of keywords is analyzed can be performed, the analysis is more comprehensive and accurate, or the difference between the first keyword combination and the second keyword combination can be analyzed to more obviously compare the representativeness of a certain keyword or a certain keywords.

The first number and the second number may be set by a related technician according to requirements, which is not limited in the embodiment of the present application.

In some embodiments, the first number and the second number may differ by one. Assuming that the first number is N and the second number is M, the N, M is a positive integer. For example, the input text is "compressing channel estimation using neural network, the neural network is DNN", and three keywords of "neural network", "channel estimation", and "DNN" are obtained in the keyword selection step in step 201. Each time two keyword combination words are generated, which are n=2 and m=n+1=3, for example, respectively:

keyword combination 1 channel estimation, neural network, DNN (m=3).

Keyword combination 2, neural network, DNN (n=2).

In some examples, in the first and second embodiments, the same number or different numbers of keywords may be combined to be used as candidates and alternatives, and then the candidate and alternative keyword combinations are filtered by inputting text to select a more suitable keyword combination for subsequent analysis.

Specifically, the document retrieval system combines any first number of keywords with any second number of keywords in the plurality of keywords, or combines any target number of keywords in the plurality of keywords to obtain a plurality of candidate keyword combinations, and filters the plurality of candidate keyword combinations based on the input text to obtain a plurality of keyword combinations.

Similarly, the document retrieval system may construct keyword combinations, and the number of keywords in each keyword combination may be the target number, or the first number and the second number. That is, the number of keywords in each keyword combination may be the same or different.

Similarly, the target number, the first number, and the second number may be set by a person skilled in the relevant arts according to the need, which is not limited in the embodiment of the present application.

For this filtering process, different filtering approaches may be employed, and in some embodiments, the filtering process may be implemented by matching with the input text. The construction of the keyword combinations in the first and second embodiments is arbitrary. It is possible that some combinations between keywords are meaningless. Thus, in this embodiment, after the candidate keyword combinations are combined, the keyword combinations are filtered according to the input text, and these nonsensical keyword combinations are excluded to improve the accuracy of the subsequent determination of the retrieval priority. That is, the keyword combinations are matched with the input text. If the keyword combination meets a predefined condition (i.e., a target condition), the keyword combination is retained. Discarding if the keyword combinations do not meet the predefined conditions.

Specifically, the document retrieval system may match the plurality of candidate keyword combinations with the input text, and use the plurality of candidate keyword combinations that meet the target condition as a plurality of keyword combinations. The target conditions may be set by a relevant technician according to requirements, which is not limited in the embodiment of the present application.

203. The document retrieval system invokes the artificial intelligence content generation system to determine the generated text based on the plurality of keyword combinations, respectively.

The artificial intelligent content generation system has excellent language understanding capability and processing capability through training, and can generate accurate language description. The artificial intelligent content generation system determines a generation text, the generation text can be understood as descriptive information, the technological scheme after keyword combination can be well described, and further, whether the keyword combination is representative or not can be accurately analyzed according to the fact that the generation text is further obtained in similarity with an input text, and the keyword combination or the retrieval priority of the keyword can be determined.

In one possible implementation, the artificial intelligence content generation system may be trained on the document retrieval system, and the trained artificial intelligence content generation system is invoked when the document retrieval system is required to perform a document retrieval. In another possible implementation, the artificial intelligence content generation system may be trained on or stored on other systems, and when the document retrieval system requires document retrieval, the artificial intelligence content generation system of the other document retrieval system is invoked. The training position and the storage position of the artificial intelligent content generating system are not limited in the embodiment of the application.

The keyword combination is a technological scheme, which can be described by a text generation function of the artificial intelligence content generation system, and the obtained description information is what is described by the technological scheme, so that the similarity between the description information and the input text can be compared, namely, the first similarity can be used for representing the semantic similarity between the description information and the input text. And then whether the technological scheme corresponding to the keyword combination is matched with the technological scheme corresponding to the input text can be determined, and whether the keyword combination is representative is also determined.

Specifically, the document retrieval system may input the plurality of keyword combinations obtained in step 202 into the artificial intelligence content generation system, and the artificial intelligence content generation system may process the plurality of keyword combinations and output a generated text corresponding to each keyword combination.

In one embodiment, the plurality of keywords are input into the artificial intelligence content generation system after conversion. That is, rather than directly entering keywords into the artificial intelligence content generation system, for example, it may include using the hint word to associate keywords as input; for example, a mask (mask) for keyword input text may be used as input; as an example, a rewrite of the input text with keywords may be used as input. In summary, keywords may be used as important features, either directly or after processing, as inputs to instruct the artificial intelligence content generation system to perform content generation.

204. The document retrieval system calculates the similarity between the generated text and the input text one by one, and determines the retrieval priority.

In step 203, where the document retrieval system obtains the generated text corresponding to each keyword combination, step 204 may be performed to calculate the similarity between each generated text and the input text one by one, so as to determine the retrieval priority based on the similarity.

In some embodiments, for a keyword combination, the greater the similarity between the generated text and the input text determined by the keyword combination, the better the representativeness of the keyword combination, for which a higher search priority may be determined. That is, the search priority is positively correlated with the similarity. The greater the similarity, the higher the retrieval priority. The smaller the similarity, the lower the retrieval priority.

In other embodiments, the retrieval priority of the keyword combination is inversely related to the similarity. The greater the similarity, the lower the retrieval priority. The smaller the similarity, the higher the retrieval priority. The less similar the generated text and the input text, the lower the probability of the literature in the prior art is, the literature in the department is searched out, and the literature is preferentially presented to the client, so that the search requirement is met.

The similarity may be obtained by any similarity calculation method, for example, cosine similarity, which is not limited in the embodiment of the present application.

In addition, the similarity may also use any dimension to reflect other indexes associated with the two, for example, the length of the generated content is used as an index, and the longer the length of the generated content is, the more dissimilar the two are considered. At this time, the similarity between the generated content and the input content is not directly calculated, but only a single index of the generated content is calculated, because the artificial intelligence content generation system needs more iterations to generate the content containing all keywords in the process of using the cyclic recursion prediction. In this application, we cannot simply understand the similarity as a calculation that must be performed by the input and output contents.

Similarly, in some embodiments, the search priority includes any of keyword or keyword combination selection priority and/or search result ranking priority.

In some embodiments, the search priority of the keyword combinations and the manner in which the search priorities of the keywords are determined may be different. The search priority determination method for both will be described below.

If it is a search priority to determine a keyword combination, this step 204 may be: the document retrieval system calculates the similarity between the generated text and the input text one by one, and determines the retrieval priority of the plurality of keyword combinations according to the similarity.

In some embodiments, the search priority of the keyword combination is positively correlated with the similarity between the generated text and the input text of the keyword combination. The more similar the generated text and the input text are, the more similar the technological scheme corresponding to the generated text and the technological scheme corresponding to the input text are, the higher the keyword combination should be in priority when being selected for searching or sorting the searching results, so that the literature wanted by the user is easier to search or the literature wanted by the user is displayed at the forefront.

In other embodiments, the retrieval priority of the keyword combination is inversely related to the similarity between the generated text and the input text of the keyword combination. The more similar the generated text and the input text, the more similar the technical scheme corresponding to the generated text and the technical scheme corresponding to the input text are, the more the possible number of searched schemes in the document library is, the dissimilar technical schemes are preferentially searched, the documents with lower probability in the prior art can be preferentially searched, and the documents are preferentially presented to clients, so that the search requirements are met.

In some embodiments, the search priority of the keyword combination may include a selection priority of the keyword combination and/or a search result ranking priority.

Specifically, the document retrieval system may calculate the similarity between the generated text and the input text one by one, rank the plurality of keyword combinations according to the similarity, and use the ranking of the plurality of keyword combinations as the retrieval priority of the plurality of keyword combinations.

As for the sorting process based on the similarity, the sorting process may be set by the relevant technician according to the requirements based on the order from small to large or the order from large to small, which is not limited in the embodiment of the present application.

For example, assuming that the similarity between the generated text and the input text is denoted by Sc, the similarity corresponding to the keyword combination 1 ("neural network", "channel estimation") may be denoted by Sc1, the similarity corresponding to the keyword combination 2 (neural network "," DNN ") may be denoted by Sc2, and the similarity corresponding to the keyword combination 3 (" DNN "," channel estimation ") may be denoted by Sc3. For example, sc1=20, sc2=10, sc3=15 are determined by similarity calculation. For example, sc2, sc3, sc1 are obtained by sorting in reverse order of similarity, that is, the keyword combination is sorted as follows: keyword combination 2, keyword combination 3, keyword combination 1. The result of this ranking may be used as a retrieval priority in the embodiments of the present application.

If it is a search priority of the determined keyword, this step 204 may be: the document retrieval system calculates the similarity between the generated text and the input text one by one, sorts the plurality of keyword combinations based on the similarity, and determines the selection priority and/or the retrieval result sorting priority of the plurality of keywords according to the sorting of the plurality of keyword combinations and the difference between the keyword combinations.

In some embodiments, the retrieval priority of the keyword may include a selection priority of the keyword and/or a ranking priority of the retrieval results.

As for the keyword search priority, since the above-mentioned step 203 and step 204 are both to compare the related content of the keyword combination, it is necessary to determine the keyword search priority by combining the comparison result of the keyword combination and the difference between the keyword combinations.

The ranking process of the keyword combinations is the same as the ranking process when determining the retrieval priority of the keyword combinations, and will not be described in detail here.

In some embodiments, when the number of keywords in the obtained plurality of keyword combinations is the same or different, the manner of determining the difference keywords may be different. When the single keyword is further analyzed after the keyword combinations are ordered, the following two possible steps may be included, and of course, the process may be implemented in other manners, which is not limited in this embodiment of the present application.

Step A: and responding to each keyword combination to contain a target number of keywords, and for any two adjacent keyword combinations after sorting, acquiring a difference keyword between the two adjacent keyword combinations by the document retrieval system, and determining the sorting of the difference keyword in a plurality of difference keywords according to the current sorting of the plurality of keyword combinations to obtain the retrieval priority of the plurality of keywords.

By comparing the two keyword combinations, a keyword having a phase difference between the two keyword combinations, that is, a difference keyword can be obtained. In the example of ordering the keyword combinations, the neighboring keyword combinations are the keyword combination 2 and the keyword combination 3, and the keyword combination 3 and the keyword combination 1, and in this step b, it may be determined that the difference keyword between the keyword combination 2 and the keyword combination 3 is "neural network", and the difference keyword between the keyword combination 3 and the keyword combination 1 is "DNN". Then combining the key word combinations and sorting as follows: keyword combination 2, keyword combination 3, keyword combination 1. The difference keyword ordering is also referred to as "neural network", "DNN", "channel estimation". That is, the three keywords "neural network", "channel estimation", "DNN" determined in step 201 are ranked as "neural network", "DNN", "channel estimation". The ordering can be expressed as: p1 is neural network, P2 is DNN, and P3 is channel estimation. Considering that the current ordering is the reverse order, i.e. the earlier the ordering, the lower the priority and the later the ordering, the higher the priority.

By combining the above embodiment in step 202 with the above manner of determining the search priority of the keyword combination, a specific embodiment may be obtained, and the specific case of the specific embodiment may be referred to as an embodiment shown in fig. 3.

It should be noted that, because the current artificial intelligence content generating systems are all based on a transducer mechanism with a multi-head attention mechanism, system parameters are determined based on pre-training data and then based on training data. That is, the pre-training data "becomes" a parameter in the system that can also accumulate knowledge in the training data. Thus, when generating text based on keywords in a keyword combination, the system may generate a higher probability of dependency of the keywords already in the training data. In the scientific literature, a higher probability means that it is mentioned by repeated existing research. For example, in the above example, DNN is a neural network, and there is a lot of literature on the relevant content, and then the description information (text) generated based on DNN and the content input by the user (i.e., the input text) may be highly matched. In contrast, if less research is done on channel estimation using DNN, the system generated descriptive information (generated text) and the user input (i.e., the input text) are less consistent, and here it is assumed that the user has the same background knowledge as the model by reading a lot of literature, resulting in the above-described reasoning process. Therefore, the keyword combination with higher priority obtained by the method means that the probability of existing documents is lower, and the documents should be preferentially found and presented to the client in the retrieval process, so as to meet the retrieval requirement.

And (B) step (B): in response to the plurality of keyword combinations including a first keyword combination and a second keyword combination, for any one of the first keyword combination and the second keyword combination, the document retrieval system obtains a differential keyword between the first keyword combination and the second keyword combination, determines a ranking of the differential keyword among the plurality of differential keywords based on a similarity between the generated text of the first keyword combination and the generated text of the second keyword combination, and obtains a retrieval priority of the plurality of keywords.

The number of keywords in the first keyword combination and the second keyword combination is different, and when the obtained keyword combination is analyzed, different keywords between any one of the first keyword combination and any one of the second keyword combination can be compared, and then the different keywords are ranked according to the similarity between the first keyword combination and the second keyword combination.

Wherein the similarity between the first keyword combination and the second keyword combination is measured by the similarity between their respective generated texts. The first keyword combination and the second keyword combination are respectively different technological schemes, and can be described by a text generation function of the artificial intelligent content generation system, and the obtained similarity between generated texts is the similarity between the two technological schemes.

Similarly, for the process of sorting based on the similarity between the generated texts, the sorting process may be set by the relevant technician according to the requirements based on the order from small to large and the order from large to small, which is not limited in the embodiment of the present application.

Similarly, the similarity between the generated texts may be obtained by any similarity calculation method, for example, cosine similarity, which is not limited in the embodiment of the present application.

For example, assume that the similarity between the generated texts is also denoted by Sc. Each time two keyword combination words are generated, which are n=2 and m=n+1=3, for example, respectively:

keyword combination 1 channel estimation, neural network, DNN (n+1=3).

Keyword combination 2, neural network, DNN (N2).

An artificial intelligence content generation system is used to generate a description (generated text) based on the two keyword combinations, respectively. The semantic similarity between the two descriptions (the generated text) is further calculated and a score is derived, the score Sc1 corresponding to the difference keyword "channel estimation". Ordering in reverse order, results in ordering of individual keywords, for example: p1, channel estimation, P2, neural network, P3, DNN.

By combining the above-described manner of determining the search priority of the keywords with the above-described step a according to the above-described embodiment of step 202, a specific embodiment can be obtained, and the specific case of the specific embodiment can be seen in the embodiment shown in fig. 4.

By combining the second embodiment in step 202 with the above manner of determining the search priority of the keyword and step B, a specific embodiment can be obtained, and the specific case of the specific embodiment can be referred to as an embodiment shown in fig. 5.

It will be appreciated that retrieving a priority includes any of selecting a priority and/or ranking a search result. If the priority is selected, the keyword or the keyword combination with high priority is selected to be searched preferentially, or the weight given by the keyword or the keyword combination with high priority is higher during the search.

In some embodiments, a threshold may be set, keywords with priorities higher than the threshold may be used as a search criterion, and keywords with priorities lower than the threshold may be filtered out, and search is not performed based on the filtered keywords. The threshold may be set by a person of ordinary skill in the art according to needs or experience, and the embodiment of the present application is not limited thereto.

In some embodiments, the retrieval process may be: the document retrieval system generates a retrieval formula based on the selection priority of the plurality of keywords or the plurality of keyword combinations, the plurality of keywords or keyword combinations and a retrieval formula generation mode, and then retrieves in a document library based on the retrieval formula to obtain the document retrieved by the input text.

The search formula generation method may be any method, and when generating the search formula, the selection priority of the plurality of keywords or the keyword combinations may be used to determine the weight of each keyword or each keyword combination, which is not limited in the embodiment of the present application.

For example, in one possible way, only keywords or keyword combinations before a certain threshold are retrieved, e.g. s=f (p 1, p2, pthr), where thr is a predefined threshold and f is a retrievable method of construction. In another possible way, keywords with different priorities are given different weights in the search engine, or a special search formula for searching is constructed, so that the priority ranking is embodied.

If the search result ranking priority is the search result ranking priority, the documents searched by the keywords or the keyword combinations with the high search result ranking priority can be displayed preferentially, or the weight given by the documents corresponding to the keywords or the keyword combinations with the high search result ranking priority is higher when the documents are displayed.

For example, assuming that the keyword combination includes "neural network, DNN", "neural network, channel estimation", "channel estimation, DNN", and the generated text obtained by the artificial intelligence content generating system is found by comparison that the similarity between the "neural network, DNN" and the input text is highest, the selection priority or the search result ranking priority of the keyword combination of "neural network, DNN" may be set to be highest, that is, prioritized as the search condition. For example, the search formula may be placed in front when the search formula is generated, or the generated text based on a & B is more similar to the original text, and then a & B may be prioritized as a search condition, for example, placed in front in the search formula, or the search result of "neural network, DNN" may be placed in front when the search result of the three keyword combinations is presented.

In the embodiment shown in fig. 2, if the above step 202 is implemented in the first embodiment, a specific embodiment may be obtained in combination with the above manner of determining the search priority of the keyword combination. This will be described in detail below by way of the embodiment shown in fig. 3. Fig. 3 is a flowchart of a document retrieval method provided in an embodiment of the present application, and referring to fig. 3, the method includes the following steps.

301. The document retrieval system extracts keywords from an input text, and obtains a plurality of keywords corresponding to the input text.

The step 301 is the same as the step 201, and will not be described again.

302. The document retrieval system combines any target number of the plurality of keywords to obtain a plurality of keyword combinations.

This step 302 is similar to the embodiment in step 202, and will not be described in detail herein.

303. The document retrieval system invokes the artificial intelligence content generation system to determine the generated text based on the plurality of keyword combinations, respectively.

The step 302 is the same as the step 302 described above, and will not be described again.

304. The document retrieval system calculates the similarity between the generated text and the input text one by one, and determines the retrieval priority of the plurality of keyword combinations according to the similarity.

The same manner as the above step 304 of determining the search priority of the keyword combination is described in detail herein.

In the embodiment of the application, the keyword selection is considered to be one step of the key of the scientific and technical literature retrieval, the keyword selection is improved, after the keyword is extracted from the input text, the keyword is not directly used for retrieval, but the combination of the keywords is used for analyzing how different keyword combination effects are achieved, so that the keyword combination selection or the retrieval results are ordered.

In the embodiment shown in fig. 2, if the above step 202 is implemented in the first embodiment, a specific embodiment may be obtained in combination with the above manner of determining the retrieval priority of the keyword. This will be described in detail below with the embodiment shown in fig. 4. Fig. 4 is a flowchart of a document retrieval method provided in an embodiment of the present application, and referring to fig. 4, the method includes the following steps.

401. The document retrieval system extracts keywords from an input text, and obtains a plurality of keywords corresponding to the input text.

The step 401 is the same as the step 201, and will not be described again.

402. The document retrieval system combines any target number of the plurality of keywords to obtain a plurality of candidate keyword combinations.

This step 402 is similar to the embodiment in step 202, and will not be described in detail herein.

403. The document retrieval system invokes the artificial intelligence content generation system to determine the generated text based on the plurality of keyword combinations, respectively.

The step 403 is similar to the step 203, and will not be described in detail herein.

Similarly, the filtering process may be: the document retrieval system matches the plurality of candidate keyword combinations with the input text, and uses the plurality of candidate keyword combinations meeting the target condition as a plurality of keyword combinations.

404. The document retrieval system calculates the similarity between the generated text and the input text one by one, and ranks the plurality of keyword combinations based on the similarity.

405. For any two adjacent keyword combinations after sorting, the document retrieval system obtains a difference keyword between the two adjacent keyword combinations.

406. And determining the ranks of the different keywords in the plurality of different keywords according to the ranks of the current plurality of keyword combinations by the document retrieval system, and obtaining the retrieval priorities of the plurality of keywords.

The steps 404 to 406 are similar to the above-mentioned method of determining the search priority of the keyword in the step 204 and the step a, and are not repeated here.

In the embodiment of the application, the keyword selection is considered to be one step of the key of the scientific and technical literature retrieval, the keyword selection is improved, after the keyword is extracted from the input text, the keyword is not directly used for retrieval, but the combination effect of different keywords is analyzed by combining the keywords, so that the keyword selection or the retrieval result is ordered.

In the embodiment shown in fig. 2, if the step 202 is implemented in the second embodiment, a specific embodiment may be obtained by combining the above-mentioned keyword search priority determining manner and the step B. This will be described in detail below with the embodiment shown in fig. 5. Fig. 5 is a flowchart of a document retrieval method provided in an embodiment of the present application, and referring to fig. 5, the method includes the following steps.

501. The document retrieval system extracts keywords from an input text, and obtains a plurality of keywords corresponding to the input text.

The step 501 is the same as the step 201 described above, and will not be described in detail here.

502. The document retrieval system combines any first number of the plurality of keywords to obtain a first keyword combination.

503. The document retrieval system combines any second number of keywords in the plurality of keywords to obtain a second keyword combination, wherein the first number and the second number are different.

The steps 502 and 503 are the same as those of the above-mentioned embodiment 202, and will not be described again.

504. The document retrieval system invokes the artificial intelligence content generation system to determine the generated text based on the plurality of keyword combinations, respectively.

The step 504 is similar to the step 203, and will not be described in detail herein.

505. The document retrieval system acquires, for any one of the first keyword combination and the second keyword combination, a difference keyword between the first keyword combination and the second keyword combination.

506. The document retrieval system determines the order of the difference keywords in a plurality of difference keywords based on the similarity between the generated text of the first keyword combination and the generated text of the second keyword combination, and obtains the retrieval priority of the keywords.

The steps 505 and 506 are the same as the step B in the step 204, and will not be described in detail herein.

All the above optional solutions can be combined to form an optional embodiment of the present application, which is not described in detail herein.

Fig. 6 is a schematic structural diagram of a document retrieval device provided in an embodiment of the present application, referring to fig. 6, the device includes:

the extracting module 601 is configured to extract keywords from an input text, so as to obtain a plurality of keywords corresponding to the input text;

a combination module 602, configured to combine the plurality of keywords to obtain a plurality of keyword combinations;

A determining module 603, configured to invoke the artificial intelligence content generating system based on the plurality of keyword combinations, and determine to generate texts respectively;

and the retrieval module 604 is used for calculating the similarity between the generated text and the input text one by one and determining the retrieval priority.

In some embodiments, the search priority includes any of keyword or keyword combination selection priority and/or search result ranking priority.

In some embodiments, the combining module 602 is configured to perform any one of:

combining any target number of keywords in the plurality of keywords to obtain a plurality of keyword combinations;

and combining any first number of keywords and any second number of keywords in the plurality of keywords respectively to obtain a plurality of keyword combinations, wherein the first number of keywords and the second number of keywords are different.

In some embodiments, the combining module 602 is configured to:

and filtering the plurality of candidate keyword combinations based on the input text to obtain a plurality of keyword combinations.

In some embodiments, the combining module 602 is configured to match the plurality of candidate keyword combinations with the input text, and use the plurality of candidate keyword combinations that meet the target condition as a plurality of keyword combinations.

In some embodiments, the retrieval module 604 is configured to perform any one of:

responding to each keyword combination containing a target number of keywords, for any two adjacent keyword combinations after sorting, obtaining a difference keyword between the two adjacent keyword combinations, and determining the sorting of the difference keyword in a plurality of difference keywords according to the sorting of the plurality of current keyword combinations to obtain the retrieval priority of the plurality of keywords;

In response to the plurality of keyword combinations including a first keyword combination and a second keyword combination, for any one of the first keyword combination and the second keyword combination, obtaining a difference keyword between the first keyword combination and the second keyword combination, determining a ranking of the difference keyword among the plurality of difference keywords based on a similarity between a generated text of the first keyword combination and a generated text of the second keyword combination, and obtaining a search priority of the plurality of keywords, wherein each of the first keyword combination contains a first number of keywords, and each of the second keyword combination contains a second number of keywords.

According to the device provided by the embodiment of the application, the keyword selection is improved in consideration of the fact that the keyword selection is a key step of scientific and technical literature retrieval, after the keyword is extracted from the input text, the keyword is not directly used for retrieval, but different keyword combination effects are analyzed through combination of the keywords, so that the keyword or keyword combination selection or retrieval results are ordered.

It should be noted that: in the document searching apparatus provided in the above embodiment, only the division of the above functional modules is used as an example in document searching, and in practical application, the above functional allocation can be performed by different functional modules according to needs, that is, the internal structure of the document searching apparatus is divided into different functional modules to perform all or part of the functions described above. In addition, the document searching device and the document searching method provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the document searching device and the document searching method are detailed in the method embodiments, and are not repeated here.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application, where the electronic device 700 may have a relatively large difference due to different configurations or performances, and may include one or more processors (Central Processing Units, CPU) 701 and one or more memories 702, where at least one computer program is stored in the memories 702, and the at least one computer program is loaded and executed by the processors 701 to implement the document retrieval method provided in the above method embodiments. The electronic device can also include other components for implementing device functions, for example, the electronic device can also have wired or wireless network interfaces, input-output interfaces, and the like for input-output. The embodiments of the present application are not described herein.

The electronic device in the method embodiment described above can be implemented as a terminal. For example, fig. 8 is a block diagram of a terminal according to an embodiment of the present application. The terminal 800 may be a portable mobile terminal such as: a smart phone, a tablet, an MP3 (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3) player, an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook or a desktop. Terminal 800 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, and the like.

In general, the terminal 800 includes: a processor 801 and a memory 802.

Processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 801 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 801 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 801 may integrate a GPU (Graphics Processing Unit, image processor) for taking care of rendering and rendering of the content that the display screen is required to display. In some embodiments, the processor 801 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 802 may include one or more computer-readable storage media, which may be non-transitory. Memory 802 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 802 is used to store at least one instruction for execution by processor 801 to implement the document retrieval method provided by the method embodiments in the present application.

In some embodiments, the terminal 800 may further optionally include: a peripheral interface 803, and at least one peripheral. The processor 801, the memory 802, and the peripheral interface 803 may be connected by a bus or signal line. Individual peripheral devices may be connected to the peripheral device interface 803 by buses, signal lines, or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 804, a display 805, a camera assembly 806, audio circuitry 807, a positioning assembly 808, and a power supply 809.

Peripheral interface 803 may be used to connect at least one Input/Output (I/O) related peripheral to processor 801 and memory 802. In some embodiments, processor 801, memory 802, and peripheral interface 803 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 801, the memory 802, and the peripheral interface 803 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 804 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 804 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 804 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 804 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 804 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 804 may also include NFC (Near Field Communication ) related circuitry, which is not limited in this application.

The display 805 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 805 is a touch display, the display 805 also has the ability to collect touch signals at or above the surface of the display 805. The touch signal may be input as a control signal to the processor 801 for processing. At this time, the display 805 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 805 may be one and disposed on a front panel of the terminal 800; in other embodiments, the display 805 may be at least two, respectively disposed on different surfaces of the terminal 800 or in a folded design; in other embodiments, the display 805 may be a flexible display disposed on a curved surface or a folded surface of the terminal 800. Even more, the display 805 may be arranged in an irregular pattern other than rectangular, i.e., a shaped screen. The display 805 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 806 is used to capture images or video. Optionally, the camera assembly 806 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, the camera assembly 806 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

Audio circuitry 807 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and the environment, converting the sound waves into electric signals, inputting the electric signals to the processor 801 for processing, or inputting the electric signals to the radio frequency circuit 804 for voice communication. For stereo acquisition or noise reduction purposes, a plurality of microphones may be respectively disposed at different portions of the terminal 800. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 801 or the radio frequency circuit 804 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuit 807 may also include a headphone jack.

The location component 808 is utilized to locate the current geographic location of the terminal 800 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 808 may be a positioning component based on the United states GPS (Global Positioning System ), the Beidou system of China, or the Galileo system of Russia.

A power supply 809 is used to power the various components in the terminal 800. The power supply 809 may be an alternating current, direct current, disposable battery, or rechargeable battery. When the power supply 809 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 800 also includes one or more sensors 810. The one or more sensors 810 include, but are not limited to: acceleration sensor 811, gyroscope sensor 812, pressure sensor 813, fingerprint sensor 814, optical sensor 815, and proximity sensor 816.

The acceleration sensor 811 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 800. For example, the acceleration sensor 811 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 801 may control the display screen 805 to display a user interface in a landscape view or a portrait view based on the gravitational acceleration signal acquired by the acceleration sensor 811. Acceleration sensor 811 may also be used for the acquisition of motion data of a game or user.

The gyro sensor 812 may detect a body direction and a rotation angle of the terminal 800, and the gyro sensor 812 may collect a 3D motion of the user to the terminal 800 in cooperation with the acceleration sensor 811. The processor 801 may implement the following functions based on the data collected by the gyro sensor 812: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 813 may be disposed at a side frame of the terminal 800 and/or at a lower layer of the display 805. When the pressure sensor 813 is disposed on a side frame of the terminal 800, a grip signal of the terminal 800 by a user may be detected, and the processor 801 performs left-right hand recognition or shortcut operation according to the grip signal collected by the pressure sensor 813. When the pressure sensor 813 is disposed at the lower layer of the display screen 805, the processor 801 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 805. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 814 is used to collect a fingerprint of a user, and the processor 801 identifies the identity of the user based on the fingerprint collected by the fingerprint sensor 814, or the fingerprint sensor 814 identifies the identity of the user based on the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 801 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 814 may be disposed on the front, back, or side of the terminal 800. When a physical key or vendor Logo is provided on the terminal 800, the fingerprint sensor 814 may be integrated with the physical key or vendor Logo.

The optical sensor 815 is used to collect the ambient light intensity. In one embodiment, the processor 801 may control the display brightness of the display screen 805 based on the intensity of ambient light collected by the optical sensor 815. Specifically, when the intensity of the ambient light is high, the display brightness of the display screen 805 is turned up; when the ambient light intensity is low, the display brightness of the display screen 805 is turned down. In another embodiment, the processor 801 may also dynamically adjust the shooting parameters of the camera module 806 based on the ambient light intensity collected by the optical sensor 815.

A proximity sensor 816, also referred to as a distance sensor, is typically provided on the front panel of the terminal 800. The proximity sensor 816 is used to collect the distance between the user and the front of the terminal 800. In one embodiment, when the proximity sensor 816 detects that the distance between the user and the front of the terminal 800 gradually decreases, the processor 801 controls the display 805 to switch from the bright screen state to the off screen state; when the proximity sensor 816 detects that the distance between the user and the front surface of the terminal 800 gradually increases, the processor 801 controls the display 805 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 8 is not limiting and that more or fewer components than shown may be included or certain components may be combined or a different arrangement of components may be employed.

The electronic device in the above-described method embodiment can be implemented as a server. For example, fig. 9 is a schematic structural diagram of a server provided in the embodiments of the present application, where the server 900 may have a relatively large difference due to different configurations or performances, and can include one or more processors (Central Processing Units, CPU) 901 and one or more memories 902, where at least one computer program is stored in the memories 902, and the at least one computer program is loaded and executed by the processor 901 to implement the document searching method provided in each of the method embodiments described above. Of course, the server can also have components such as a wired or wireless network interface and an input/output interface for inputting and outputting, and can also include other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, a computer readable storage medium, e.g. a memory comprising at least one computer program executable by a processor to perform the document retrieval method of the above-described embodiment, is also provided. For example, the computer readable storage medium can be Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), compact disk Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), magnetic tape, floppy disk, optical data storage device, etc.

In an exemplary embodiment, a computer program product or a computer program is also provided, the computer program product or the computer program comprising one or more program codes, the one or more program codes being stored in a computer readable storage medium. The one or more processors of the electronic device are capable of reading the one or more pieces of program code from the computer-readable storage medium, the one or more processors executing the one or more pieces of program code so that the electronic device is capable of performing the above-described document retrieval method.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

It should be understood that determining B from a does not mean determining B from a alone, but can also determine B from a and/or other information.

Those of ordinary skill in the art will appreciate that all or a portion of the steps implementing the above-described embodiments can be implemented by hardware, or can be implemented by a program instructing the relevant hardware, and the program can be stored in a computer readable storage medium, and the above-mentioned storage medium can be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description is only of alternative embodiments of the present application and is not intended to limit the present application, but any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A method of document retrieval, the method comprising:

combining the keywords to obtain a plurality of keyword combinations;

based on the keyword combinations, invoking an artificial intelligent content generation system, and respectively determining generated texts, wherein the generated texts are used for describing the technological scheme after the keyword combinations;

calculating the similarity between the generated text and the input text one by one, and determining the retrieval priority of the keyword combinations according to the similarity; or alternatively, the first and second heat exchangers may be,

calculating the similarity between the generated text and the input text one by one, sorting the plurality of keyword combinations based on the similarity, responding to each keyword combination to contain a target number of keywords, obtaining a difference keyword between any two adjacent keyword combinations after sorting, and determining the sorting of the difference keyword in a plurality of difference keywords according to the sorting of the plurality of keyword combinations at present to obtain the retrieval priority of the plurality of keywords; or alternatively, the first and second heat exchangers may be,

Calculating the similarity between the generated text and the input text one by one, sorting the plurality of keyword combinations based on the similarity, responding to the plurality of keyword combinations including a first keyword combination and a second keyword combination, obtaining differential keywords between the first keyword combination and the second keyword combination for any one of the first keyword combination and the second keyword combination, determining the sorting of the differential keywords in the plurality of differential keywords based on the similarity between the generated text of the first keyword combination and the generated text of the second keyword combination, and obtaining the retrieval priority of the plurality of keywords, wherein each first keyword combination comprises a first number of keywords, and each second keyword combination comprises a second number of keywords.

2. The method of claim 1, wherein the search priority comprises any one of keyword or keyword combination selection priority and/or search result ranking priority.

3. The method of claim 1, wherein the combining the plurality of keywords results in a plurality of keyword combinations, comprising any one of:

4. The method of claim 3, wherein combining any target number of the plurality of keywords to obtain a plurality of keyword combinations comprises:

5. The method of claim 4, wherein filtering the plurality of candidate keyword combinations based on the input text to obtain a plurality of keyword combinations comprises:

6. A document retrieval device, the device comprising:

the determining module is used for calling the artificial intelligent content generating system based on the plurality of keyword combinations to respectively determine generated texts, wherein the generated texts are used for describing the technological scheme after the keyword combinations;

the retrieval module is used for calculating the similarity between the generated text and the input text one by one, and determining the retrieval priority of the keyword combination according to the similarity; or alternatively, the first and second heat exchangers may be,

the search module is used for calculating the similarity between the generated text and the input text one by one, sorting the plurality of keyword combinations based on the similarity, responding to each keyword combination to contain a target number of keywords, obtaining the difference keywords between any two adjacent keyword combinations after sorting, and determining the sorting of the difference keywords in a plurality of difference keywords according to the sorting of the plurality of keyword combinations at present to obtain the search priority of the plurality of keywords; or alternatively, the first and second heat exchangers may be,

The search module is used for calculating the similarity between the generated text and the input text one by one, sorting the plurality of keyword combinations based on the similarity, responding to the plurality of keyword combinations including a first keyword combination and a second keyword combination, obtaining difference keywords between the first keyword combination and the second keyword combination for any one of the first keyword combination and the second keyword combination, determining the sorting of the difference keywords in the plurality of difference keywords based on the similarity between the generated text of the first keyword combination and the generated text of the second keyword combination, and obtaining the search priority of the plurality of keywords, wherein each first keyword combination comprises a first number of keywords, and each second keyword combination comprises a second number of keywords.

7. A document retrieval system, characterized in that it comprises at least one electronic device for performing the document retrieval method according to any one of claims 1 to 5.

8. A computer-readable storage medium, wherein at least one computer program is stored in the storage medium, the at least one computer program being loaded and executed by a processor to implement the document retrieval method of any one of claims 1 to 5.