CN116521729A

CN116521729A - Information classification searching method and device based on elastic search

Info

Publication number: CN116521729A
Application number: CN202310198566.9A
Authority: CN
Inventors: 杨超; 高文飞; 朱宝; 李群; 张�荣
Original assignee: Beijing Wucoded Technology Co ltd
Current assignee: Beijing Wucoded Technology Co ltd
Priority date: 2023-03-03
Filing date: 2023-03-03
Publication date: 2023-08-01

Abstract

The application discloses an information classification searching method and device based on an elastic search, wherein the method firstly collects data to be classified and preprocesses the data to be classified; classifying the preprocessed data to be classified through a preset classification model, extracting corresponding attribute values of all entities in the current category aiming at the data of each category, and storing all the extracted entities and the corresponding attribute values in an elastic search database; finally searching a target entity and a target attribute value in an API interface corresponding to the elastic search database through a target search request; and obtaining information classification search results according to the searched target entity and target attribute values. The invention can improve the storage efficiency of the information resource through the elastic search technology so as to improve the efficiency of searching data by a user.

Description

Information classification searching method and device based on elastic search

Technical Field

The invention relates to the field of data classification and search, in particular to an information classification search method and device based on an elastic search.

Background

With the rapid development of information technologies such as mobile internet, internet of things and cloud computing, people have silently entered into the big data era. The personalized and structured information service mode requires the existing data system to have the capability of coping with the dynamic information requirement and solve the problem of scattered information release.

In conventional databases, the organization of data is transaction-based, with the data being stored in a decentralized manner in the respective databases, and the data being not efficiently integrated. Most of the existing classification systems have no unified specification and comprehensive classification scheme, and the classification method is simple and cannot meet the requirement of overall management.

Disclosure of Invention

Based on the above, the embodiment of the application provides an information classification searching method and device based on the elastic search, which can improve the storage efficiency of information resources through the elastic search technology so as to improve the efficiency of searching data by a user.

In a first aspect, there is provided an information classification search method based on an elastic search, the method comprising:

collecting data to be classified, and preprocessing the data to be classified;

classifying the preprocessed data to be classified through a preset classification model, extracting corresponding attribute values of each entity in the current category aiming at the data of each category, and storing each entity and the corresponding attribute values in each extracted category into an elastic search database;

searching a target entity and a target attribute value in an API interface corresponding to the elastic search database through a target search request;

and obtaining information classification search results according to the searched target entity and target attribute values.

Optionally, preprocessing the data to be classified includes:

performing data deduplication, low-quality data filtering, diversified data unification, fuzzy data conversion and noise data cleaning on the data to be classified, and converting the data to be classified into a unified format; the data to be classified can comprise various modal data such as text, voice, pictures and the like.

Optionally, classifying the preprocessed data to be classified by a preset classification model includes:

constructing a classification model; wherein, the classification model comprises different entity models, and each entity model comprises a plurality of key fields;

and extracting keywords from the preprocessed data to be classified, and matching the extracted keywords by utilizing the key fields in each entity model to finish the classification of the data.

Optionally, the API interface includes at least:

the AI general service interface is used for realizing semantic search, intelligent recommendation, expert suggestion and other service functions according to the general search request;

the AI development service interface is used for realizing content label, preference analysis, data hosting and model hosting functions;

and the AI customization service interface is used for realizing customized semantic search, intelligent recommendation, expert suggestion and other service functions according to the customization search request.

Optionally, the obtaining the information classification search result according to the searched target entity and the target attribute value includes:

and finding out related information through voice search, and feeding back intelligent recommendation or expert recommendation results according to search settings.

Optionally, after obtaining the information classification search result according to the searched target entity and the target attribute value, the method further includes:

and carrying out result display on the information classification search results.

Optionally, performing result display on the information classification search result specifically includes:

acquiring display demand information;

identifying the information classification search results according to the display demand information to obtain identified information classification search results;

and displaying the identified information classification search results.

In a second aspect, there is provided an information classification search apparatus based on an elastic search, the apparatus comprising:

the data acquisition module is used for acquiring data to be classified and preprocessing the data to be classified;

the basic processing module is used for classifying the preprocessed data to be classified through a preset classification model, extracting corresponding attribute values of all entities in the current category aiming at the data of each category, and storing all the extracted entities and the corresponding attribute values in an elastic search database;

the intelligent service module searches a target entity and a target attribute value in an API interface corresponding to the elastic search database through a target search request;

and the intelligent application module is used for classifying the search results according to the searched target entity and target attribute value.

In a third aspect, there is provided an electronic device comprising a memory storing a computer program and a processor implementing the information classification search method according to any of the first aspects above when the processor executes the computer program.

In a fourth aspect, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the information classification search method of any of the first aspects described above.

In the technical scheme provided by the embodiment of the application, firstly, data to be classified are collected, and the data to be classified are preprocessed; classifying the preprocessed data to be classified through a preset classification model, extracting corresponding attribute values of all entities in the current category aiming at the data of each category, and storing all the extracted entities and the corresponding attribute values in an elastic search database; finally searching a target entity and a target attribute value in an API interface corresponding to the elastic search database through a target search request; and obtaining information classification search results according to the searched target entity and target attribute values.

It can be seen that the present invention has the beneficial effects that the storage efficiency of the information resource can be improved by the elastic search technology, so as to improve the efficiency of searching data by the user.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those skilled in the art from this disclosure that the drawings described below are merely exemplary and that other embodiments may be derived from the drawings provided without undue effort.

Fig. 1 is a flowchart of an information classification search method based on an elastic search according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an elastic search architecture provided in an embodiment of the present application;

fig. 3 is a schematic diagram of an information classification search process according to an embodiment of the present application;

FIG. 4 is a schematic diagram of performing information classification search according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a result obtained by performing an information classification search according to an embodiment of the present application;

fig. 6 is a block diagram of an information classification search device based on an elastic search according to an embodiment of the present application;

fig. 7 is a schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In the description of the present invention, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements but may include other steps or elements not expressly listed but inherent to such process, method, article, or apparatus or steps or elements added based on further optimization of the inventive concept.

For the convenience of understanding the present embodiment, a detailed description will be first given of an information classification searching method disclosed in the embodiments of the present application.

Referring to fig. 1, a flowchart of an information classification search method based on an elastic search according to an embodiment of the present application is shown, where the method may include the following steps:

and 101, collecting data to be classified, and preprocessing the data to be classified.

The data to be classified in the application may refer to government information data. In this embodiment of the present application, step 101 specifically includes performing data deduplication, low-quality data filtering, multiple data unification, fuzzy data conversion, noise data cleaning on the data to be classified, and converting the data into a unified format.

Specifically, the data to be classified can be acquired by relying on the entity acquired by the upstream crawler, and enter the webpage encyclopedia API to acquire various attributes of the entity, so as to obtain the structural information under each entity for subsequent processing. Because different entity formats are different and content attributes are different, the cleaning conversion preprocessing operation is performed for the entities with different category attributes.

The data preprocessing comprises four links of data segmentation, data cleaning, data word segmentation and word frequency calculation:

data segmentation: cutting the push text according to a specified time window (day and hour);

data cleaning: screening out low-quality (messy codes and nonsensical) push messages;

data word segmentation: dividing the push text into a plurality of phrases, and reserving the phrases appearing in the wikipedia titles as candidate phrases;

word frequency calculation: and a calculation basis is provided for the extraction of the subsequent keywords by calculating the global occurrence frequency of the candidate phrases.

The restful api of elastic search may be used in alternative embodiments of the present application to obtain data to be classified from external data sources, such as from a website, document structure of a policy file, or from a database.

The method comprises the steps of establishing an elastic search, dividing the content in a text into a plurality of keywords, and creating an index according to the keywords, inquiring the index according to the keywords during inquiry, and finally finding articles containing the keywords, separating the data from the index, storing the index into different nodes in a distributed mode, wherein the nodes can be expanded to hundreds, and structured or unstructured data in PB level can be searched and processed in real time. Meanwhile, the copy backup can be carried out by the fragments, so that the reliability of the data is ensured, the retrieval performance is greatly improved by the cooperative work of the fragments and the copies, and the full text search is enabled to be efficient and simple through a simple RESTful API.

For different kinds of massive data, multiple modes such as text, voice, pictures and the like are usually provided. In order to effectively and uniformly express multi-source knowledge, knowledge expression technology of heterogeneous data such as image data, text data and the like is studied. Explore how to perform representation and fusion technologies of different forms of knowledge (including logic rules, texts, media data, knowledge graphs and the like) such as embedding; features such as contrast of brightness and threshold value according to images are explored, and a multi-mode image representation technology based on a convolutional neural network is realized.

Structured data for wikipedia. The data of the Wikipedia has the characteristics of objectivity, openness and structuring, and is suitable for modeling of characters and general knowledge maps. In order to effectively extract data from wikipedia and conduct modeling research of a knowledge graph, based on structural characteristics of the wikipedia data, firstly, structural knowledge contained in each entity info box is obtained, a general mode of each entity class is built through technologies such as entity cleaning, and modeling is conducted on the class to which each entity belongs.

Step 102, classifying the preprocessed data to be classified through a preset classification model, extracting corresponding attribute values of each entity in the current category aiming at the data of each category, and storing each entity and the corresponding attribute values in each extracted category into an elastic search database.

In the embodiment of the application, based on the above-mentioned structured entity information extraction method, the classification model of the method specifically includes 12 categories including Military militay, character peple, industrial Industry, security, political policies, meteorological technologies, geographic technologies, religious regions, organization, civilization Culture, government optimization, and transportation.

The data specifically included for each category includes entities, field names, data types, meanings, attribute values.

Unifying entity attribute names through preprocessing operation, extracting structural information of each entity after processing, and obtaining attribute and attribute value of each entity. And storing the structured entity information into an elastic search database for the establishment of a subsequent knowledge graph.

The method and the system are used for crawling and constructing a knowledge graph aiming at the categories of military, organizations, people, traffic, geography, government, religion, culture, weather, industry, safety and the like, and finally obtaining 200 ten thousand entry data amounts, wherein the main attributes of the entries comprise category names, father category names, subclass lists and entities.

For each term, in an alternative embodiment of the present application, a classification model is constructed; the classification model comprises different entity models, and each entity model comprises a plurality of key fields; and extracting keywords from the preprocessed data to be classified, and matching the extracted keywords by utilizing the key fields in each entity model to finish the classification of the data.

The scale of the data (knowledge graph) after the data classification of the present application is given in table 1.

TABLE 1 Scale of knowledge graph

The step also comprises preprocessing the captured data by using an elastic search preprocessing function, and searching and screening the data faster by converting the original data into an index. Preprocessing is performed by converting the original text into a searchable format, such as word segmentation, word stemming, disabling word filtering, analysis, and tagging. These steps help the search engine to find relevant results faster, improving search performance. Such as removing noise, removing HTML tags, splitting text, extracting vocabulary, etc. The pre-processing results are stored in an elastic search for subsequent use.

And step 103, searching the target entity and the target attribute value in an API interface corresponding to the elastic search database through a target search request.

As shown in fig. 2, a schematic diagram of an elastic search architecture provided in an embodiment of the present application is given.

In the embodiment of the application, an AI universal service interface is used for realizing semantic search, intelligent recommendation, expert suggestion and other service functions according to a universal search request; the AI development service interface is used for realizing content label, preference analysis, data hosting and model hosting functions; and the AI customization service interface is used for realizing customized semantic search, intelligent recommendation, expert suggestion and other service functions according to the customization search request.

In an alternative embodiment of the present application, in the intelligent retrieval of open source data information based on a knowledge graph, the present application uses the knowledge graph as a means for assisting in event detection, focusing on querying and exhibiting relationships between entities mentioned by certain events. The application uses a MongoDB distributed database to store the map data of the application. The entity mainly stores 12 major categories of information including Military Miltilevel, personage, industry, security, political policies, meteorological, geographic, religious, organization, civilization, government, and traffic transportation.

In the aspect of searching and researching large-scale image data, the application researches related technologies. The existing work is based on distributed platforms Hadoop, spark and the like, a cluster is built by using a common cheap PC, and the requirements of graph data mining on computing resources and memory resources are met. In addition, pregel, blogel and other systems have been proposed successively to solve the data mining problem of graph data. However, these platforms are limited by network bottlenecks in the actual use process and cannot be well applied to the intelligent retrieval problem of the knowledge graph. Meanwhile, the application also explores and designs a frequent sub-graph mining framework for large-scale graph data, and can find out the frequently-occurring sub-structure in the graph data. The frame encapsulates the excavation of the frequent subgraphs into tasks, and the process of the excavation of the frequent subgraphs is accelerated by adopting a parallel processing flow facing the tasks.

The open source data information intelligent retrieval research based on the knowledge graph finally realizes an open source data retrieval system based on the knowledge graph. The system supports a plurality of query types, and the main provided functions include: (1) Supporting fuzzy search, the query delay is within 1s, for example, the user inputs 'Zhang Sanling', the system can search all entity nodes with names 'Zhang Sanling'; (2) Generating an organization association graph for the target node (e.g., for an "industry" relationship, a user may find all people who belong to the same industry as "Zhang Sanj"); and (3) supporting visual display of more than 8 search results.

The query mode supported by the open source data information intelligent retrieval system based on the knowledge graph is as follows:

1. query subject: direct query subject, such as name, supports fuzzy query, e.g. user inputs keyword "Zhang Sanu", all objects related to Zhang Sanu can be found and presented in the form of association graph.

2. Query predicates: the user can directly query a predicate of a certain type, and if the user wants to query the industry, for example, the user can find that the corresponding industry contains member information of the industry. The query does not support ambiguous queries.

3. Querying subjects and objects: querying the subject and object enables the output of an association between the subject and object. For example, a user may find all associations between Zhang san and industry.

4. Query subject and predicate: query subjects and predicates, can output their objects. For example, the user enters Zhang three and tenure, and then enters Zhang three tenure time.

5. Query predicates and objects: querying predicates and objects, and outputting relevant subjects. For example, the user may query for people who get all organizations targeted.

And 104, obtaining information classification search results according to the searched target entity and target attribute value.

Specifically, the application can find relevant information through voice search, and feed back intelligent recommendation or expert suggestion results according to search settings. Fig. 3 is a schematic diagram of an overall information classification search process according to an embodiment of the present application.

In an optional embodiment of the present application, the method further includes performing a result display on the information classification search result after the information classification search result is obtained. Specifically, the process comprises the following steps:

s1, acquiring display demand information.

In this embodiment, the display requirement information is determined according to the user search requirement. Specifically, the display requirement information comprises keyword colors and preset attribute extraction information; this is only schematically illustrated in the present embodiment, but not limited to, and may be reasonably set according to needs in practical applications.

The keywords are search keywords input by a user; the preset attribute is a key attribute, and the key attribute belongs to the service characteristics of a service system, such as public opinion industry, information release time, author character image, information forwarding chain and the like, and the service system processes information according to the service characteristics.

Keyword (burst word) extraction: and calculating the burst property of the word frequency by calculating the occurrence frequency of the word frequency in the current time window and comparing the word frequency obtained in the word frequency calculation step with the global occurrence frequency. Combining word frequency burstiness in each sub-time window in the current time period, comprehensively considering the number of the push words, the number of the transfer push words and the number of the hash labels to obtain keyword weights in the sub-time windows, weighting and summing according to the number of the push words in each sub-time window to obtain keyword weights in the time window, and taking the square root keyword weights of the number of the candidate phrases in the current time window as the keywords in the current window.

And S2, identifying the information classification search results according to the display demand information to obtain the identified information classification search results.

Specifically, corresponding identification is performed on the information classification search result according to the display requirement information, for example, if the color of the keyword in the display requirement information is set to be red, the keyword in the information classification search result is marked red.

And S3, displaying the identified information classification search result.

Specifically, the identified information classification search results are displayed to the user, and the user can more intuitively see the information classification search results.

The steps are carried out to identify the information classification search results according to the display requirement information, and the identified information classification search results are displayed, so that the information classification search results are more visual.

The front-end data visualization module based on JavaScript can provide more concise and clear event information and map display for a user. The visualization module adopts a front-end and back-end separation architecture based on flash and flash. The diagram portion is visualized using E-charts to provide better interactive functionality and use experience. The visualization system provides three pages: the main page, the event analysis sub-page and the map display sub-page.

When information presentation is carried out, the method also comprises information authentication of the user, in particular:

providing a special login control module to identify and authenticate a login user;

the identity authentication of the same user is realized by adopting two or more than two combined authentication technologies;

providing a function of checking the unique user identity and the complexity of the authentication information, and ensuring that repeated user identity is not existed in an application system, and the identity authentication information is not easy to be counterfeited;

providing login failure processing function, and taking measures such as ending session, limiting illegal login times, automatically exiting and the like;

and enabling identity authentication, user identity identification uniqueness checking, user identity authentication information complexity checking and login failure processing functions, and configuring related parameters according to a security policy.

As shown in fig. 4, a schematic diagram of information classification searching according to an embodiment of the present application is provided. The method comprises the steps of calculating the occurrence frequency of word frequency in a current time window, comparing the occurrence frequency with the word frequency obtained in the word frequency calculation step, and calculating the word frequency burst property of the keyword. Combining word frequency burstiness in each sub-time window in the current time period, comprehensively considering the number of the push words, the number of the transfer push words and the number of the hash labels to obtain keyword weights in the sub-time windows, weighting and summing according to the number of the push words in each sub-time window to obtain keyword weights in the time window, and taking the square root keyword weights of the number of the candidate phrases in the current time window as the keywords in the current window.

For example, the user enters "talent" and the system will retrieve all entity nodes with names "talent"; (2) And generating and supporting visual display of the search result aiming at the organization association diagram of the target node. As shown in fig. 5, a schematic diagram of a result obtained by performing an information classification search according to an embodiment of the present application is provided, which shows corresponding contents of a policy repository obtained after a user inputs "talent".

Referring to fig. 6, a block diagram of an information classification search apparatus according to an embodiment of the present application is shown. As shown in fig. 6, the apparatus may include:

the intelligent service module searches a target entity and a target attribute value through an API interface corresponding to the elastic search database by a target search request;

The specific limitation regarding the information classification search apparatus may be referred to the limitation of the information classification search method hereinabove, and will not be described herein. The respective modules in the above information classification search apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, an electronic device is provided, which may be a computer, and the internal structure of which may be as shown in fig. 7. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the device is configured to provide computing and control capabilities. The memory of the device includes a non-volatile storage medium, an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for information classification search data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an information classification search method.

It will be appreciated by those skilled in the art that the structure shown in fig. 7 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment of the present application, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of the above-described information classification search method.

The computer readable storage medium provided in this embodiment has similar principles and technical effects to those of the above method embodiment, and will not be described herein.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in M forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SyMchlimk) DRAM (SLDRAM), memory bus (RaMbus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the claims. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. An information classification search method based on an elastic search, which is characterized by comprising the following steps:

collecting data to be classified, and preprocessing the data to be classified;

2. The information classification search method according to claim 1, wherein preprocessing the data to be classified includes:

3. The information classification search method according to claim 1, wherein classifying the preprocessed data to be classified by a preset classification model comprises:

4. The information classification search method according to claim 1, wherein the API interface comprises at least:

5. The method of claim 1, wherein the searching results based on the information classification according to the found target entity and target attribute values comprise:

6. The information classification search method according to claim 1, wherein after obtaining the information classification search result based on the found target entity and target attribute value, the method further comprises:

7. The method for classifying search according to claim 6, wherein the step of displaying results of the information classifying search comprises:

acquiring display demand information;

and displaying the identified information classification search results.

8. An elastic search-based information classification search device, the device comprising:

9. An electronic device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, implements the information classification search method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the information classification search method of any of claims 1 to 7.