CN114416920A

CN114416920A - Text search method and device, electronic equipment and storage medium

Info

Publication number: CN114416920A
Application number: CN202111622131.XA
Authority: CN
Inventors: 陈君
Original assignee: Beijing Pixel Software Technology Co Ltd
Current assignee: Beijing Pixel Software Technology Co Ltd
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2022-04-29

Abstract

The embodiment of the invention provides a text search method and device, electronic equipment and a storage medium, and relates to the technical field of retrieval. Firstly, screening all texts containing preset search information according to a plurality of word data, an index library and the preset search information to obtain at least one first text; then, sequencing at least one first text to obtain at least one second text; generating label information corresponding to each second text according to the category and the content of the second text; and finally, taking the identification and the label information corresponding to each second text as a search result and displaying the search result. The method generates corresponding label information according to the type and content of the screened text, and displays the text and the label information as a search result, thereby saving the storage space and further realizing text search without large-scale equipment.

Description

Text search method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of retrieval, in particular to a text search method and device, electronic equipment and a storage medium.

Background

At present, various industries need to process a large amount of information, a plurality of projects need to store, classify and search the information, and a search engine is particularly important when facing large-scale information.

Because the amount of information is huge, the searched results are also in the tens of thousands, and particularly when the search results are displayed, not only the number of displayed texts is large, but also the contents of the texts need to occupy storage space. Therefore, the existing search engine has high requirements on equipment, is only suitable for large enterprises or projects, and needs more miniaturized search technology for individuals or without large equipment.

Disclosure of Invention

The invention aims to provide a text searching method, a text searching device, electronic equipment and a storage medium, which can generate corresponding label information according to the category and content of a screened text and display the text and the label information as a searching result, so that the storage space is saved and the user experience is improved.

In order to achieve the above purpose, the embodiment of the present invention adopts the following technical solutions:

in a first aspect, an embodiment of the present invention provides a text search method, which is applied to an electronic device, where the electronic device stores a plurality of texts in advance, and each text has a corresponding identifier; the method comprises the following steps:

obtaining at least one first text according to a plurality of word data, an index database and preset search information, wherein the word data are obtained by dividing the plurality of texts, the index database is used for representing the mapping relation between each word data and at least one text corresponding to each word data, and each first text comprises the preset search information;

sequencing the at least one first text according to a preset scoring mechanism to obtain at least one second text;

generating label information corresponding to each second text, wherein the label information is used for representing the category and the content of the second text;

and taking the identifier corresponding to each second text and the label information corresponding to each second text as search results and displaying the search results.

In a possible embodiment, before the step of obtaining at least one first text according to the plurality of word data, the index database and the preset search information, the method further comprises:

dividing the plurality of texts by using a preset word segmentation tool to obtain a plurality of word data, wherein each word data has at least one corresponding text;

and establishing the index library for the word data.

In a possible implementation manner, the step of sorting the at least one first text according to a preset scoring mechanism to obtain at least one second text includes:

according to the preset scoring mechanism, scoring is respectively carried out on each first text to obtain a score corresponding to each first text, wherein the score represents the frequency of the preset search information appearing in the first text;

and sequencing the at least one first text according to the sequence of the scores from large to small to obtain the at least one second text.

In a possible implementation manner, the step of generating tag information corresponding to each second text includes:

classifying the at least one second text according to a preset classifier to generate classification information corresponding to each second text, wherein the classification information is used for representing the category of the second text;

and generating abstract information corresponding to each second text to obtain the label information corresponding to each second text, wherein the label information comprises the classification information and the abstract information, and the abstract information is used for representing the content of the second text.

In a possible implementation manner, the step of generating the summary information corresponding to each second text according to a preset algorithm includes:

taking any one of the at least one second text as a target second text;

extracting the target second text according to a text sorting algorithm to obtain a plurality of characteristic sentences, wherein the characteristic sentences are used for representing the core content of the second text;

taking the characteristic sentences as abstract information corresponding to the target second text;

and traversing the at least one second text to obtain abstract information corresponding to each second text.

In a second aspect, an embodiment of the present invention further provides a text search apparatus, which is applied to an electronic device, where a plurality of texts are stored in advance in the electronic device, and the text search apparatus includes:

the acquisition module is used for acquiring at least one first text according to a plurality of word data, an index database and preset search information, wherein the word data is obtained by dividing the plurality of texts, the index database is used for representing the mapping relation between each word data and at least one text corresponding to each word data, and each first text comprises the preset search information;

the sorting module is used for sorting the at least one first text according to a preset scoring mechanism to obtain at least one second text;

the generating module is used for generating label information corresponding to each second text, wherein the label information is used for representing the category and the content of the second text;

and the display module is used for obtaining a search result according to the at least one second text and the label information corresponding to each second text.

In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:

one or more processors;

a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the text search method described above.

In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the text search method described above.

Compared with the prior art, the text searching method, the text searching device, the electronic equipment and the storage medium provided by the embodiment of the invention have the advantages that firstly, all texts containing preset searching information are screened out according to a plurality of word data, an index library and the preset searching information to obtain at least one first text; then, sequencing at least one first text to obtain at least one second text; generating label information corresponding to each second text according to the category and the content of the second text; and finally, taking the identification and the label information corresponding to each second text as a search result and displaying the search result. The method generates corresponding label information according to the type and content of the screened text, and displays the text and the label information as a search result, thereby saving the storage space and further realizing text search without large-scale equipment.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a block diagram of an electronic device according to an embodiment of the present invention.

Fig. 2 is a schematic flowchart of a text search method according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of an index library according to an embodiment of the present invention.

Fig. 4 is an exemplary diagram of a search result page provided in an embodiment of the present invention.

Fig. 5 is an exemplary diagram of a detail page of search results provided by an embodiment of the present invention.

Fig. 6 is another schematic flow chart of the text search method according to the embodiment of the present invention.

Fig. 7 is a flowchart illustrating step S120 in the text search method illustrated in fig. 2.

Fig. 8 is a flowchart illustrating step S130 in the text search method illustrated in fig. 2.

Fig. 9 is a flowchart illustrating step S1302 in the text search method illustrated in fig. 8.

Fig. 10 is a block diagram of a text search apparatus according to an embodiment of the present invention.

Icon: 100-an electronic device; 101-a memory; 102-a processor; 103-a bus; 200-text search means; 201-an acquisition module; 202-a sorting module; 203-a generation module; 204-display module.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

It should be noted that the features of the embodiments of the present invention may be combined with each other without conflict.

In the conventional technology, a search engine is often applied to internet web search, the number of internet web pages is large, the amount of stored information data is huge, information data desired by a user is screened from massive information data, and the information data is generally obtained through the following processes:

firstly, finding and collecting webpage information from the Internet, and storing the webpage information in a local database; then, extracting and organizing the information to establish an index library; and then the retriever quickly detects the documents in the index database according to the query keywords input by the user, evaluates the relevancy of the documents and the query, sorts the results to be output and returns the query results to the user.

Because the amount of web page information is huge and the web page information is updated in real time, the searched documents are often tens of thousands, and the traditional searching method directly displays the sorted results on the page after sorting the searched results, the displayed contents comprise the title of the documents and the first section or the first sections of the documents, and a large storage space is required to be occupied, so the traditional searching method is often realized based on large-scale equipment, is only suitable for large-scale enterprises or projects, and needs a more miniaturized searching technology for individuals or the situations without large-scale equipment.

To solve this problem, this embodiment provides a text search method, which generates corresponding tag information according to the content of the screened text, and displays the screened text and the tag information as a search result, so that the storage space occupied by the search result is smaller, and therefore, the text search can be implemented without a large-scale device.

As described in detail below.

Referring to fig. 1, fig. 1 is a block diagram illustrating an electronic device 100 according to the present embodiment, where the electronic device 100 may be, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a server, or other electronic devices with processing capability. Electronic device 100 includes memory 101, processor 102, and bus 103. The memory 101 and the processor 102 are connected by a bus 103.

The memory 101 is used for storing programs, such as the text search apparatus 200, and it should be noted that the text search apparatus 200 in the present embodiment is partially improved from the conventional search engine.

The text search apparatus 200 includes at least one software functional module which can be stored in the memory 101 in the form of software or firmware (firmware), and the processor 102 executes the program to implement the text search method in the present embodiment after receiving the execution instruction.

The Memory 101 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

The processor 102 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the text search method in this embodiment may be implemented by integrated logic circuits of hardware in the processor 102 or instructions in the form of software.

The processor 102 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Micro Control Unit (MCU), a Complex Programmable Logic Device (CPLD), a Field Programmable Gate Array (FPGA), and an embedded ARM.

The text search method provided by the present embodiment is described on the basis of the electronic device 100 shown in fig. 1. Referring to fig. 2, fig. 2 shows a flowchart of a text search method provided in this embodiment, where the method is applied to an electronic device 100, and the electronic device 100 stores a plurality of texts in advance, and each text has a corresponding identifier. The method comprises the following steps:

s110, obtaining at least one first text according to the word data, an index database and preset search information, wherein the word data is obtained by dividing the texts, the index database is used for representing the mapping relation between each word data and at least one text corresponding to each word data, and each first text comprises the preset search information.

In this embodiment, the text may be article data stored in a local database of the electronic device 100, or article data acquired from a network.

Text can be viewed as a set of words consisting of a number of word data, where word data refers to the smallest word element that makes up the text.

For example, for english text, an english word is a word data, such as "we"; for Chinese text, word data is obtained by performing word segmentation processing using a word segmenter, and the obtained word data may be composed of one Chinese character, such as "i", or may be composed of a plurality of Chinese characters, such as "us".

The index database is established in an inverted index mode according to the word data and the texts corresponding to the word data, and comprises a dictionary formed by the word data and an inverted file, wherein the inverted file is used for storing an inverted list, the inverted list is a corresponding relation list of the word data and at least one text corresponding to the word data, and the dictionary is used for storing the word data and the positions of the inverted list corresponding to the word data in the inverted file.

It should be noted that, since one word data may appear in a plurality of texts at the same time, the correspondence stored in the posting list is often that one word data corresponds to a plurality of texts. Here, the text is generally indicated by a label, and the label may be a name of the text, a number of the text, or the like.

For example, fig. 3 is a schematic diagram of an index library, which includes a dictionary and an inverted file, each word data in the dictionary has a corresponding inverted list in the inverted file, and at least one text corresponding to the word data can be obtained according to the inverted list.

The following describes a process of obtaining at least one first text by taking preset search information as word data 1 as an example:

firstly, word data 1 is determined in a dictionary according to preset search information, a reverse arrangement table 1 is obtained from a reverse arrangement file according to the position of the reverse arrangement table corresponding to the word data 1 in the reverse arrangement file, all texts corresponding to the word data 1 are stored in the reverse arrangement table 1, and all the texts in the reverse arrangement table 1 are obtained as at least one first text.

The preset search information is search information input by a user through an interactive interface of the electronic device 100, and may be a single word or a combination of multiple words, and when the preset search information is a combination of multiple words, the preset search information needs to separate the multiple words by using separators.

The first text refers to all texts containing preset search information, and the process of obtaining the first text is specifically explained by taking the preset search information as a single word as an example:

firstly, inquiring target word data matched with preset search information in a dictionary in an index database; then, according to the storage position of the inverted list of the target word data in the inverted file, acquiring the inverted list of the target word data from the inverted file; and finally, acquiring a corresponding text as a first text according to the identifier in the inverted list.

S120, sequencing the at least one first text according to a preset scoring mechanism to obtain at least one second text.

In this embodiment, the second text is obtained by sorting the first text, and it should be noted that the second text is not substantially different from the first text, and is only distinguished for easy understanding, and has no special meaning.

In the second text, the text ranked further forward has a higher relevance between the representation and the preset search information.

And S130, generating label information corresponding to each second text, wherein the label information is used for representing the category and the content of the second text.

And S140, taking the identification corresponding to each second text and the label information corresponding to each second text as search results and displaying the search results.

In this embodiment, the search results are displayed on the web interface in a manner of being segmented according to the sequence of the arrangement of the second text, and the displayed contents include: and the identifier of the second text and the label information of the second text, wherein the identifier of the second text can be a title or a number of the second text, and the user can obtain all contents of the second text by clicking the identifier of the second text.

The webpage interface is realized based on a flash web application framework.

For example, the preset search information "universe" is input into the search box of the search interface, and the obtained search result is as shown in fig. 4, two pieces of search results are displayed on the current page, and the display order of the two pieces of search results is determined according to the score of the second text, that is, in the first piece of search result "universe calendar", the word data "universe" appears more frequently than the second piece of search result.

The first piece of search results includes: the second text identifies a "cosmic calendar" and tag information for the second text, which includes a classification and a digest, where the digest is the three centermost sentences extracted from the "cosmic calendar" full text.

Clicking on the "universe calendar" identification in the first search result will jump to the page, displaying the text details, the web page interface of which is shown in fig. 5.

Compared with the prior art, the text search method provided by the embodiment generates corresponding tag information according to the category and content of the screened text, and displays the identifier and the tag information corresponding to the screened text as the search result, so that the storage space occupied by the search result is smaller, and therefore, the text search can be realized without large-scale equipment.

On the basis of fig. 2, please refer to fig. 6, before step S110, the method further includes the following steps:

and S108, dividing the plurality of texts by using a preset word segmentation tool to obtain a plurality of word data, wherein each word data has at least one corresponding text.

In this embodiment, the preset word segmentation tool may be a jieba chinese word segmentation tool. In the process of word segmentation, some words without practical meaning are often filtered, for example, the sentence "today's weather is very clear" is segmented to obtain word data "today", "weather" and "clear".

And S109, establishing an index database for the word data.

Compared with the prior art, the embodiment adopts an inverted index mode to establish the index database for each word data, can quickly determine the text matched with the preset search information from the mass word data, and improves the search speed.

Referring to fig. 7, the step S120 is described in detail below, and on the basis of fig. 2, the step S120 may include the following detailed steps:

and S1201, respectively scoring each first text according to a preset scoring mechanism to obtain a score corresponding to each first text, wherein the score represents the frequency of the preset search information appearing in the first text.

In this embodiment, the preset scoring mechanism may be a TF-IDF (term frequency-inverse text frequency index) scoring function. TF-IDF is a commonly used weighting technique for information retrieval and data mining, which is a statistical method to evaluate the importance of words to a document in a corpus or set of documents.

For example, the word data "we" appears 10 times in the first text and 5 times in the second first text, and then the score of the first text is larger than that of the second first text.

S1202, sequencing the at least one first text according to the sequence of scores from large to small to obtain at least one second text.

In this embodiment, in the at least one obtained second text, the second text with the earlier ranking represents that the frequency of the preset search information appearing in the second text is higher, so that the user can more easily obtain the text thought by the user, and the user experience is improved.

Referring to fig. 8, step S130 is described in detail below, and on the basis of fig. 2, step S130 may include the following detailed steps:

and S1301, classifying the at least one second text according to a preset classifier, and generating classification information corresponding to each second text, wherein the classification information is used for representing the category of the second text.

In this embodiment, the preset classifier may be a naive bayes classifier. Before classifying the at least one second text using the naive bayes classifier, the naive bayes classifier needs to be trained by dividing all texts stored in the electronic device 100 into a training set and a testing set by 7:3 through a train _ test _ split function.

And classifying all the second texts by using the trained classifier, wherein the classification standard can be the field to which the content of the second texts belongs, and the generated classification information can be the name of the field to which the content of the second texts belongs, such as military, economy, literature and the like.

And S1302, generating abstract information corresponding to each second text to obtain label information corresponding to each second text, wherein the label information comprises classification information and abstract information, and the abstract information is used for representing the content of the second text.

In this embodiment, the summary information is information extracted from the content of the second text, and is usually a short sentence, and can embody the central content of the second text.

Compared with the prior art, the text searching method in the embodiment generates corresponding tag information according to the content of each second text after obtaining at least one second text, and the user can judge whether the second text is the content of interest of the user through the classification information and the abstract information and then decide whether to further acquire the full text, so that the user experience is improved.

Referring to fig. 9, step S1302 may include the following detailed steps, based on fig. 8, as described in detail below, in step S1302:

s13021, any one of the at least one second text is taken as the target second text.

S13022, extracting the target second text according to a text sorting algorithm to obtain a plurality of characteristic sentences, wherein the plurality of characteristic sentences are used for representing core contents of the second text.

In this embodiment, the text sorting algorithm refers to an abstract unsupervised text summarization method TextRank algorithm, which can extract a keyword and a keyword group of a given text from the text, and extract a key sentence of the text by using an abstract automatic summarization method.

Generally, the extracted feature sentences are the three most central sentences in the article.

And S13023, taking the plurality of characteristic sentences as abstract information corresponding to the target second text.

S13024, traverse through at least one second text to obtain each piece of summary information corresponding to each second text.

Compared with the prior art, the text searching method provided by the embodiment extracts the most central characteristic sentences in the text as the abstract information, so that the user can know the approximate content of the text by browsing the abstract information, the time of the user is saved, and the user experience is improved.

Compared with the prior art, the embodiment has the following beneficial effects:

first, the method provided by this embodiment generates corresponding tag information according to the content of the screened text, and displays the text and the tag information as a search result, thereby saving a storage space and further realizing text search without large-scale equipment.

And then, generating classification information of the second text by using a preset classifier, generating abstract information according to a text sorting algorithm, judging whether the second text is the content which is interested by the user through the classification information and the abstract information, and determining whether the second text is further acquired, so that the time of the user is saved, and the user experience is improved.

Referring to fig. 10, fig. 10 is a block diagram illustrating a text search apparatus 200 according to the present embodiment. The text search device 200 is applied to the electronic device 100 and comprises: the system comprises an acquisition module 201, a sorting module 202, a generation module 203 and a display module 204.

The obtaining module 201 is configured to obtain at least one first text according to a plurality of word data, an index database and preset search information, where the word data is obtained by dividing a plurality of texts, the index database is used to represent a mapping relationship between each word data and at least one text corresponding to each word data, and each first text includes the preset search information.

The sorting module 202 is configured to sort the at least one first text according to a preset scoring mechanism to obtain at least one second text.

And the generating module 203 is configured to generate label information corresponding to each second text, where the label information is used to represent the category and content of the second text.

And the display module 204 is configured to take the identifier corresponding to each second text and the tag information corresponding to each second text as search results, and display the search results.

Optionally, the obtaining module 201 is further configured to:

an index library is established for a plurality of word data.

Optionally, the sorting module 202 is specifically configured to:

according to a preset scoring mechanism, scoring is respectively carried out on each first text to obtain a score corresponding to each first text, wherein the score represents the frequency of the preset search information appearing in the first text;

and sequencing the at least one first text according to the sequence of scores from large to small to obtain at least one second text.

Optionally, the generating module 203 is specifically configured to:

classifying at least one second text according to a preset classifier to generate classification information corresponding to each second text, wherein the classification information is used for representing the category of the second text;

and generating abstract information corresponding to each second text to obtain label information corresponding to each second text, wherein the label information comprises classification information and abstract information, and the abstract information is used for representing the content of the second text.

Optionally, the generating module 203 is specifically configured to:

taking any one of the at least one second text as a target second text;

extracting a target second text according to a text sorting algorithm to obtain a plurality of characteristic sentences, wherein the plurality of characteristic sentences are used for representing the core content of the second text;

taking a plurality of characteristic sentences as abstract information corresponding to the target second text;

and traversing at least one second text to obtain each piece of summary information corresponding to each second text.

It will be apparent to those skilled in the art that the above-described specific operation of the text search apparatus 200 is provided for convenience and brevity of description. Reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.

The present embodiment also provides a computer-readable storage medium, on which a computer program is stored, and the computer program is executed by the processor 102 to implement the text search method disclosed in the above embodiments.

In summary, according to the text search method, the text search device, the electronic device, and the storage medium provided by the embodiments of the present invention, first, all texts including preset search information are screened out according to a plurality of word data, an index library, and the preset search information, so as to obtain at least one first text; then, sequencing at least one first text to obtain at least one second text; generating label information corresponding to each second text according to the category and the content of the second text; and finally, taking the identification and the label information corresponding to each second text as a search result and displaying the search result. The method generates corresponding label information according to the type and content of the screened text, and displays the text and the label information as a search result, thereby saving the storage space and further realizing text search without large-scale equipment.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. The text searching method is applied to electronic equipment, wherein the electronic equipment stores a plurality of texts in advance, and each text has a corresponding identifier; the method comprises the following steps:

2. The method of claim 1, wherein before the step of obtaining at least one first text based on the plurality of word data, the index database and the preset search information, the method further comprises:

and establishing the index library for the word data.

3. The method according to claim 1, wherein the step of ranking the at least one first text according to a predetermined scoring mechanism to obtain at least one second text comprises:

4. The method of claim 1, wherein the step of generating label information corresponding to each of the second texts comprises:

5. The method according to claim 4, wherein the step of generating the summary information corresponding to each second text comprises:

taking any one of the at least one second text as a target second text;

6. A text search device is applied to electronic equipment, wherein a plurality of texts are stored in the electronic equipment in advance, and each text has a corresponding identifier; the text search device includes:

and the display module is used for taking the identification corresponding to each second text and the label information corresponding to each second text as search results and displaying the search results.

7. The apparatus of claim 6, wherein the obtaining module is further configured to:

and establishing the index library for the word data.

8. The apparatus of claim 6, wherein the ordering module is configured to:

9. An electronic device, characterized in that the electronic device comprises:

one or more processors;

memory storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the text search method of any of claims 1-5.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a text search method according to any one of claims 1 to 5.