CN111339391B

CN111339391B - Novel searching method based on public data

Info

Publication number: CN111339391B
Application number: CN202010169574.7A
Authority: CN
Inventors: 李林亮; 张盛泽; 王娟
Original assignee: Nanjing Andlinks Data Technology Co ltd
Current assignee: Nanjing Andlinks Data Technology Co ltd
Priority date: 2020-03-12
Filing date: 2020-03-12
Publication date: 2024-03-19
Anticipated expiration: 2040-03-12
Also published as: CN111339391A

Abstract

The invention provides a novel searching method based on public data, which comprises the following steps: according to the pre-arranged entry information, crawling multimedia data related to the content and the entity corresponding to the entry information on the Internet, and analyzing and obtaining an attribute list of the entry information and a multimedia list related to the entity; performing data modeling by using a jena tool to combine the attributes of the vocabulary entries of the same category, and analyzing to obtain the relationship among the vocabulary entries according to the content of the combined attributes; initializing the search service data, carrying out semantic analysis on the query sentence by using an NLP tool, carrying out word segmentation recognition on the fuzzy sentence, carrying out keyword search on a WEB search service page according to a word segmentation result, and returning a search result. The invention not only supports traditional keyword content retrieval, but also supports picture retrieval, semantic retrieval and condition retrieval, and can display the entity and the relation diagram in detail through the map, so that the data retrieval function is stronger.

Description

Novel searching method based on public data

Technical Field

The invention relates to the technical field of electronic information, in particular to a novel searching method based on public data.

Background

With the development of computer technology and information technology, paper encyclopedias gradually exits the line of sight of the invention, and online encyclopedias are slowly known by the invention. The hundred degree company published a formal version of the hundred degree encyclopedia at 21, 4/2008, and up to now, 1600 tens of thousands of entries have been recorded, and 700 tens of thousands of people have participated in editing the hundred degree entries, and the hundred degree encyclopedia covers almost all known knowledge fields. Hundred degrees encyclopedia aims at creating a Chinese information collection platform covering knowledge in various fields, and is often an objective fact because hundred degrees encyclopedia is open to disclosure.

The Baidu encyclopedia is not only an information platform but also a search platform, on one hand, the Baidu encyclopedia can record information, and on the other hand, the recorded information can be searched. The hundred-degree encyclopedia search platform searches only by simple word segmentation when searching, and the search function is not strong enough.

Disclosure of Invention

The object of the present invention is to solve at least one of the technical drawbacks.

To this end, the object of the invention is to propose a novel search method based on public data.

In order to achieve the above object, an embodiment of the present invention provides a novel search method based on public data, including the steps of:

step S1, according to pre-arranged entry information, crawling multimedia data related to content and entities corresponding to the entry information on the Internet, analyzing and obtaining an attribute list of the entry information and a multimedia list related to the entities, and storing the attribute list and the multimedia list in a MYSQL database;

step S2, performing data modeling on the vocabulary entry data stored in the MYSQL database by adopting a jena tool to combine the attributes of the vocabulary entries of the same category, and analyzing and obtaining the relationship among the vocabulary entries according to the content of the combined attributes;

step S3, initializing the search service data, including: initializing data in a MYSQL database, an ES database and a graph database;

and S4, carrying out semantic analysis on the query sentence by using an NLP tool, carrying out word segmentation recognition on the fuzzy sentence, carrying out keyword retrieval on a WEB search service page according to the word segmentation result, and returning a retrieval result.

Further, in the step S1, the content corresponding to the obtained term information and the multimedia data associated with the entity are subjected to analysis of the abstract and the attribute field to obtain an attribute list corresponding to the term.

Further, in the step S2, a jena tool is used to perform data modeling, the entry is used as an entity, the extraction of the model relationship is extracted from the attribute field of the entity, after the attribute field and the attribute value of the entity are obtained, whether the content of the attribute value is another entity is judged, and then whether the two entities have a corresponding relationship is judged, so as to obtain the relationship type and the relationship list between the entries.

Further, in the step S3, initializing the MYSQL database includes: and initializing the mapping relation between the entity and the multimedia, and using the mapping relation for the reverse query entity of the multimedia.

Further, in the step S3, initializing the ES database includes:

and storing the entity content in the MYSQL database, the entry relation list arranged in the jena and the multimedia information into the ES database for retrieving keywords.

Further, in the step S3, the characteristic value of the picture and the cover information of the video are added to the multimedia information, and the association relationship between the multimedia and the entity is stored in MYSQL.

Further, in the step S3, initializing the graph database includes: initializing data of entities and relations in the model, and rendering a map on a WEB page by using a map database.

Further, in the step S4, the performing keyword search includes: searching the entity, the relation, the graph relation and the multimedia corresponding to the keywords, analyzing and processing the query sentence through a nlp tool to segment words, and querying according to the returned segmentation result; and carrying out condition query and picture searching according to the entity attribute and the picture characteristic value.

According to the novel searching method based on the public data, disclosed data are utilized to carry out analysis modeling, query sentences are analyzed by combining nlp tools, characteristic value algorithms are utilized to calculate characteristic values of pictures, the picture data manage entity relations, and the searching function is enhanced, so that the searching can be carried out on keywords in a data range, can also be carried out through semantic analysis, can also be carried out on condition searching according to entity attribute fields, and finally can be carried out on pictures by an entity. In addition, the map technology is used for showing the map relationship through the WEB page corresponding map relationship, so that the map relationship is vivid and vivid, and the data management platform has a management function on the model and the data, so that the maintenance is more convenient.

The invention analyzes the content of the public data (taking hundred degrees encyclopedia data as an example) based on the content of the public data, and combines a developed search platform to perform keyword retrieval, semantic search, condition query and image search functions on the data, and can analyze data relation and semantic condition retrieval data by means of machine learning, artificial intelligence, image database and other technologies, and the image characteristic value is matched with the search related image. According to the invention, under a more detailed modeling mode, not only traditional keyword content retrieval but also picture retrieval, semantic retrieval and condition retrieval are supported, and the entity and relation diagram can be displayed in detail through the map, so that the data retrieval function is stronger.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a flow chart of a novel search method based on public data according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a novel search method based on public data according to an embodiment of the present invention;

FIG. 3 is a flow chart of the jena model creation according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

As shown in fig. 1 and fig. 2, the novel searching method based on public data according to the embodiment of the invention includes the following steps:

step S1, according to the pre-arranged entry information, crawling the multimedia data related to the content and the entity corresponding to the entry information on the Internet, analyzing and obtaining an attribute list of the entry information and a multimedia list related to the entity, and storing the attribute list and the multimedia list in a MYSQL database.

Specifically, first, required entry information of each field is acquired, for example, required entries are collected from hundred degrees encyclopedia. For example, the military related terms are required to be sorted, the types of the terms are determined first, and the terms include domestic key characters, fighters, tanks and the like, and then the term list under the corresponding term types is sorted out. Wherein, the term information may include: entry category, specific entry, etc.

And then, analyzing the abstract and attribute fields of the content corresponding to the obtained entry information and the multimedia data related to the entity so as to obtain an attribute list corresponding to the entry.

Specifically, the crawler software crawls corresponding entry data (for example, crawls data of corresponding entries in hundred degrees encyclopedia) on the internet, and downloads all multimedia data associated with the corresponding entries. For the crawled website data, the method needs to analyze the abstract and the attribute fields through an algorithm to obtain all attribute fields of the entry, sort out a relatively complete attribute list and store the attribute list in a MYSQL database. In addition, the invention sorts out the multimedia list related to the entity, and the summarized data are stored in a MYSQL database in a lasting way.

And S2, carrying out data modeling on the vocabulary entry data stored in the MYSQL database by adopting a jena tool to combine the attributes of the vocabulary entries of the same category, and analyzing and obtaining the relationship among the vocabulary entries according to the content of the combined attributes. Namely, the entry data in MYSQL is analyzed, first, entries of the same category are combined, and relationships among the entries are sorted according to an attribute list of the entries.

Specifically, as shown in fig. 3, a jena tool is used for data modeling, an entry is taken as an entity, the extraction of a model relationship is extracted from an attribute field of the entity, after the attribute field and the attribute value of the entity are obtained, whether the content of the attribute value is another entity or not is judged, and then whether a corresponding relationship exists between the two entities or not is judged, so that a relationship type and a relationship list between the entries are obtained. The model created by jena mainly comprises entities and relations, the entities are of corresponding entity types and specific entities, each term is a specific entity, and the term category which is arranged at the beginning is the entity category. The extraction formula of the model relation is extracted from the attribute field of the entity, and after the attribute field and the attribute value of the entity are obtained, whether the corresponding relation exists between the two entities is judged by judging whether the content of the attribute value is another entity or not. Such as: if the attribute of fighter-10 is that of another vocabulary, chinese and fighter-10 have the relationship of producing area. Finally, modeling the two items sorted by using jena, wherein the term can be called an entity, one term is a specific entity, and the category of the term is the type of the entity. The model is then used in a database of drawings. By this judgment rule, the relationship type and the relationship list are sorted out. And finally, generating a corresponding owl model file.

Step S3, initializing the search service data, including: initializing data in a MYSQL database, an ES database and a graph database.

In this step, data in three of MYSQL, ES, and graph database Neo4j are initialized.

1. Initializing the MYSQL database comprises: and initializing the mapping relation between the entity and the multimedia, and using the mapping relation for the reverse query entity of the multimedia.

In MYSQL, the mapping relation between the entity and the multimedia needs to be initialized, and the mapping relation is used for the reverse query entity of the multimedia,

2. the data initialization of the ES database comprises:

the ES searches for the specific entity and relation content stored in the middleware, and stores the entity content in the MYSQL database and the entry relation list sorted by the jena into the ES database for the retrieval of the keywords.

In addition, the multimedia information is stored in the ES database for retrieval of keywords. The invention processes the multimedia information in the database and increases the characteristic value of the picture and the BASE64 cover information of the video. The image characteristic values are processed through an algorithm to obtain corresponding characteristic values, and a foundation is laid for searching images by using images later; the video cover is used as a presentation by extracting a frame with a picture in the video. Storing the processed multimedia information into an ES, and storing the association relation between the multimedia and the entity into a MYSQL database.

3. Initializing a graph database, comprising: the entity and relation data in the model are initialized, the map on the WEB page is rendered by utilizing the Neo4j map database, the structure of the map database determines the content of the map relation, the quick search can be carried out, and the requirement of the map is just met.

Specifically, specific entity and relation contents are stored in the graph data, and through a model arranged in the jena, the invention writes the complete entity and relation list into the graph database, thereby facilitating the subsequent rendering of the graph.

Specifically, in the initialization nlp service, the contents of the corpus and the word segmentation library are loaded into the corpus, and the entity category, the relation category and specific entity information are loaded into the corpus. The nlp natural language semantic analysis capability is supported, the fuzzy sentences are subjected to word segmentation recognition by adopting the methods of speculation, probability and statistics, and finally, the information convenient for searching and inquiring is analyzed. Such as: after the mortar with the firing rate higher than 200km/h is finally analyzed by natural language to obtain the entity mortar, the firing rate of the attribute and the attribute value of 200km/h are higher than the condition, the invention searches the mortar for the entity with the firing rate attribute higher than 200 km/h.

On a WEB search service page, the invention can search keywords of the analyzed and processed public data, and the scope of the keywords comprises: entity, relationship, graph relationship, and multimedia; semantic retrieval can be performed, query sentences are analyzed and processed through a nlp tool, and query is performed according to returned word segmentation results; and the condition query and the picture characteristic value can be carried out according to the entity attribute to carry out the picture search. Finally, the invention also supports the management of the model structure and the update of the data content on the data management platform.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives, and variations may be made in the above embodiments by those skilled in the art without departing from the spirit and principles of the invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The novel searching method based on the public data is characterized by comprising the following steps:

in the step S3, initializing data of the MYSQL database includes: initializing the mapping relation between the entity and the multimedia, and using the mapping relation for the reverse query entity of the multimedia;

in the step S3, the initializing the ES database includes: storing specific entities and relation contents in the ES searching middleware, and storing the entity contents in the MYSQL database, the entry relation list and the multimedia information which are arranged in the jena into the ES database for searching keywords;

in the step S3, feature values of the pictures and cover information of the video are added to the multimedia information, and association relations between the multimedia and the entities are stored in MYSQL;

in the step S3, initializing a graph database includes: initializing data of entities and relations in the model, and rendering a map on a WEB page by using a map database;

specifically, specific entity and relation content are stored in the graph data;

step S4, carrying out semantic analysis on the query sentence by using an NLP tool, carrying out word segmentation recognition on the fuzzy sentence, and carrying out keyword retrieval on a WEB search service page according to a word segmentation result, wherein the keywords comprise: the entity, the relation, the graph relation and the multimedia are returned to the search result;

in the step S4, the keyword search includes: searching the entity, the relation, the graph relation and the multimedia corresponding to the keywords, analyzing and processing the query sentence through a nlp tool to segment words, and querying according to the returned segmentation result; and carrying out condition query and picture searching according to the entity attribute and the picture characteristic value.

2. The method according to claim 1, wherein in step S1, the content corresponding to the obtained entry information and the multimedia data associated with the entity are subjected to analysis of the abstract and the attribute field to obtain the attribute list corresponding to the entry.

3. The method of claim 1, wherein in step S2, a jena tool is used to perform data modeling, an entry is used as an entity, the extraction of a model relationship is extracted from an attribute field of the entity, after the attribute field and the attribute value of the entity are obtained, by determining whether the content of the attribute value is another entity, and then determining whether there is a corresponding relationship between the two entities, a relationship type and a relationship list between the entries are obtained.