CN110377558B

CN110377558B - Document query method, device, computer equipment and storage medium

Info

Publication number: CN110377558B
Application number: CN201910514985.2A
Authority: CN
Inventors: 叶素兰; 窦文伟; 潘诗韵; 李弘�; 何麒; 徐国强
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-06-14
Filing date: 2019-06-14
Publication date: 2023-06-20
Anticipated expiration: 2039-06-14
Also published as: CN110377558A

Abstract

The present invention relates to the field of data processing, and in particular, to a data query, that is, a document query method, apparatus, computer device, and storage medium. The method comprises the following steps: acquiring a document to be queried and query information, and screening the document to be queried according to the query information to obtain an initial document; extracting initial page data contained in an initial document, and calculating similarity indexes of query information and the initial page data; acquiring a first index weight corresponding to the similarity index, and calculating target similarity between the query information and the initial page data according to the first index weight and the similarity index; judging whether the target similarity exceeds a threshold value, when the target similarity exceeds the threshold value, selecting initial page data corresponding to the target similarity exceeding the threshold value as target page data, and outputting a first document identifier corresponding to the target page data and the target page data. By adopting the method, the inquiry efficiency of the document can be improved.

Description

Document query method, device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and apparatus for querying documents, a computer device, and a storage medium.

Background

With the development of computer technology, more and more actions of users can be implemented by a computer, for example, law staff can search evidence files stored on the computer.

Conventionally, the evidence file searching is usually performed manually to browse different evidence files one by one, so that the required evidence files are searched, and the query efficiency is low.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a document query method, apparatus, computer device, and storage medium that can improve query efficiency.

A document query method, the method comprising:

acquiring a document to be queried and query information, and screening the document to be queried according to the query information to obtain an initial document;

extracting initial page data contained in the initial document, and calculating a similarity index of the query information and the initial page data;

acquiring a first index weight corresponding to the similarity index, and calculating target similarity between the query information and the initial page data according to the first index weight and the similarity index;

judging whether the target similarity exceeds a threshold value, when the target similarity exceeds the threshold value, selecting initial page data corresponding to the target similarity exceeding the threshold value as target page data, and outputting a first document identification corresponding to the target page data and the target page data.

In one embodiment, the step of screening the initial document from the documents to be queried according to the query information includes:

word segmentation is carried out on the query information to obtain query keywords;

carrying out standardization processing on the query keywords to obtain standardized query keywords;

and acquiring a mapping relation of the association of the document to be queried, and screening and obtaining an initial document from the document to be queried according to the standardized query keywords and the mapping relation.

In one embodiment, the calculating the similarity index of the query information and the initial page data includes:

extracting an index type corresponding to the similarity index, and acquiring word segmentation logic associated with the initial page data when the index type is set similarity;

the initial page data is segmented according to the segmentation logic to obtain a first keyword set, and the query keywords are combined to obtain a second keyword set;

and calculating the set similarity according to the first keyword set and the second keyword set.

When the index type is a text matching index, acquiring a first word frequency of the query keyword contained in the query information, and acquiring a second word frequency of the query keyword contained in the initial page data;

counting the number of the initial documents containing the query keywords, and calculating an evaluation weight according to the number of the documents and the query keywords;

and acquiring an adjustment factor, acquiring the total number of the initial documents, and calculating a document matching index according to the adjustment factor, the total number, the first word frequency, the second word frequency and the evaluation weight.

In one embodiment, extracting and calculating a similarity index of the query information and the initial page data includes:

when the index type is an inclusion index, acquiring a term score corresponding to the query keyword;

and when the query keyword is contained in the initial document, calculating to obtain a inclusion index according to the word score.

In one embodiment, said querying whether said target similarity exceeds a threshold value comprises:

when the target similarity does not exceed a threshold value, judging whether the query information has replacement information or not;

When the replacement information exists, calculating a replacement index corresponding to the initial page data according to the replacement information;

acquiring a second index weight corresponding to the similarity index, and calculating the replacement similarity of the initial page data and the replacement information according to the second index weight and the replacement index;

inquiring whether the replacement similarity exceeds a threshold value, when the replacement similarity exceeds the threshold value, selecting initial page data corresponding to the replacement similarity exceeding the threshold value as associated page data, and outputting a second document identification corresponding to the associated page data, the associated page data and the replacement information.

A document querying device, the device comprising:

the acquisition module is used for acquiring the document to be queried and query information, and screening the document to be queried according to the query information to obtain an initial document;

the extraction module is used for extracting initial page data contained in the initial document and calculating a similarity index of the query information and the initial page data;

the calculation module is used for acquiring a first index weight corresponding to the similarity index and calculating the target similarity between the query information and the initial page data according to the first index weight and the similarity index;

The judging module is used for judging whether the target similarity exceeds a threshold value, selecting the initial page data corresponding to the target similarity exceeding the threshold value as target page data when the target similarity exceeds the threshold value, and outputting a first document identifier corresponding to the target page data and the target page data.

In one embodiment, the acquisition module includes:

the word segmentation unit is used for segmenting the query information to obtain query keywords;

the processing unit is used for carrying out standardized processing on the query keywords to obtain standardized query keywords;

and the screening unit is used for acquiring the mapping relation of the association of the documents to be queried, and screening the initial documents from the documents to be queried according to the standardized query keywords and the mapping relation.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the method described above when the processor executes the computer program.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method described above.

According to the document query method, the device, the computer equipment and the storage medium, the documents do not need to be browsed and queried one by one manually to obtain the required documents, the documents to be queried and query information are obtained, the initial documents are summarized and screened from the documents to be queried according to the query information, initial page data contained in the initial documents are extracted, similarity indexes corresponding to the query information and the initial page data are calculated, further first index weights corresponding to the similarity indexes are obtained, target similarity between the query information and the initial page data is calculated according to the first index weights and the similarity indexes, whether the target similarity exceeds a threshold value is queried, and when the target similarity exceeds the threshold value, a first document identifier and associated page data associated with the initial page data exceeding the threshold value are output. Thus, the query efficiency of the document can be improved.

Drawings

FIG. 1 is an application scenario diagram of a Chinese query method in one embodiment;

FIG. 2 is a flow diagram of a method of querying a document in one embodiment;

FIG. 3 is a schematic flow diagram of a screening step in one embodiment;

FIG. 4 is a block diagram of a file querying device in one embodiment;

fig. 5 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The document query method provided by the application can be applied to an application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a network. The server 104 obtains the inquiry documents and inquiry information from the terminal 102, screens the initial documents from the documents to be inquired according to the inquiry information, extracts the initial page data contained in the screened initial documents, calculates the similarity index of the inquiry information and the initial page data, obtains the first index weight corresponding to the similarity index, calculates the target similarity of the inquiry information and the initial page data according to the first index weight and the similarity index, judges whether the target similarity exceeds a threshold value or not, and when the target similarity exceeds the threshold value, the server 104 selects the initial page data corresponding to the target similarity exceeding the threshold value as the target page data and outputs the first document identification and the target page corresponding to the target page data. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices, and the server 104 may be implemented by a stand-alone server or a server cluster composed of a plurality of servers.

In one embodiment, as shown in fig. 2, a document slave query method is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:

s202: and acquiring the document to be queried and query information, and screening the document to be queried according to the query information to obtain an initial document.

Specifically, the document to be queried refers to a document stored in the server, and the document to be queried may refer to a case related document stored in the server, for example, a evidence document that the document to be queried is related, or a case introduction document, etc. The query information refers to the key information of the related document to be stored, and the query information can be statement information, word information and the like. The initial document refers to a related document selected from the documents to be queried according to the query information.

Specifically, the server scans an original document, converts data content contained in the original document into a document picture, further identifies the document picture to obtain corresponding document data, fills the document data into an electronic template to form a document to be queried, stores the document to be queried, acquires a prestored document to be queried, further acquires query information, and screens the document to be queried to obtain an initial document according to the query information, wherein the screening of the query document can be performed by matching the query information with the document to be queried, so as to select the initial document successfully matched, the screening of the query document can also be performed by matching the query information with index keywords in an established index relation, and when the matching is successful, the document associated with the index keywords successfully matched is used as the initial document. For example, the terminal is provided with a selection box or an input box on the display interface, the user inputs key information in the corresponding input box or selects corresponding key information in the selection box according to the input box or the display box on the display interface, the key information can be used as query information, the server obtains the query information from the terminal, obtains a pre-stored document to be queried, and screens the initial document from the document to be queried according to the query information. The server performs word segmentation on the document to be queried by adopting word segmentation logic, matches query information with the word segmented document to be queried, screens the matched document to be queried to obtain an initial document when the matching is successful, and can also acquire an index relation which is established completely, matches the query information with the keyword to be matched in the index relation, queries the document to be queried associated with the keyword to be matched in the index relation when the matching is successful, and takes the document to be queried associated with the keyword to be queried as the initial document.

When the initial document is screened, the relevant document can be queried according to the semantics, namely, the server acquires the query information, the query information and the document to be queried are input into the semantic recognition model for recognition, so that the matching degree of the query information and the document to be queried is calculated according to the semantic recognition model, the document to be queried with the matching degree exceeding the matching degree threshold is selected as a supplementary document, the supplementary document is taken as the initial document, or the supplementary document and the document to be queried selected according to the index relation form the initial document together, and the queried initial document is ensured to be more accurate.

204: and extracting initial page data contained in the initial document, and calculating a similarity index of the query information and the initial page data.

Specifically, the initial page data refers to related data contained in an initial document, and the initial page data may be text data or the like. The similarity index refers to different indexes for evaluating the similarity degree between the query information and the initial page data, the similarity index may be a similarity index obtained by adopting different similarity calculation methods, for example, the similarity index may be a similarity index obtained by adopting a deep neural network model to calculate similarity, may be a similarity index obtained by adopting parameters such as word frequency and the like to calculate, may be a similarity index obtained by adopting keywords in different initial documents as elements to calculate, a similarity index obtained by adopting cosine similarity to calculate, whether the query text contains the similarity index obtained by calculating the keywords, and the like. And the similarity index can be calculated by selecting all the similarity indexes, or one or different similarity indexes can be calculated.

Specifically, the server extracts initial page data from the initial document, detects the initial page data line by line from the prestored initial document, extracts all detected contents to obtain the initial page data, further queries the index type of the similarity index to be calculated by the server, and calculates the similarity index corresponding to different initial page data one by one according to query information by the index type server.

For example, if the calculated similarity index is cosine similarity index and model similarity index, the calculated similarity index is similarity index obtained by cosine similarity calculation and similarity index obtained by similarity calculation by depth neural network model, the server inputs the obtained query information and initial page data into the trained calculation model for calculation, so that the calculation model can output similarity, the similarity can be used as similarity index.

It should be noted that, different query information and similarity calculation methods of the initial page data may also be selected to obtain the similarity index. For example, the similarity index may be selected, the similarity index may be obtained by calculating the similarity by using a deep neural network model, the similarity index may be obtained by calculating parameters such as word frequency, etc., the similarity index may be obtained by calculating by using keywords in different initial documents as elements, the similarity index obtained by calculating by using cosine similarity is used, whether the text contains the similarity index obtained by calculating the keywords is queried, so that one or more of the similarities obtained by calculating by using the different methods are selected and used as the similarity index together, and when the similarity is obtained by calculating by using the five methods, the obtained similarity indexes are 5.

S206: and acquiring a first index weight corresponding to the similarity index, and calculating the target similarity between the query information and the initial page data according to the first index weight and the similarity index.

Specifically, the first index weight refers to a duty ratio of each similarity index when calculating the target similarity according to the similarity index using different similarity indexes. The target similarity refers to the final similarity degree between the query information calculated according to the similarity index and the initial page data. Specifically, when the server calculates the similarity index, inquiring a prestored first index weight, and calculating the target similarity between the inquired information and the initial page data according to the first index weight and the similarity index. When the similarity index calculated by the server is one, the obtained first index weight is 1, the server calculates the product of the similarity index and the first index weight, the product is the target similarity, or when the similarity index calculated by the server is a plurality of similarity indexes, the first index weights corresponding to different similarity indexes are obtained, the products are obtained by the different similarity indexes and the corresponding first index weights, and then all the products are added to obtain the target similarity index.

S208: judging whether the target similarity exceeds a threshold value, when the target similarity exceeds the threshold value, selecting initial page data corresponding to the target similarity exceeding the threshold value as target page data, and outputting a first document identifier corresponding to the target page data and the target page data.

Specifically, the target page data refers to data of a page selected from the initial page data. The first document identification refers to a related mark of a document corresponding to the target page data, and may be a name of the document associated with the target page data, a page number of the document, or the like. Specifically, when the target similarity calculated by the server reaches the target similarity, a threshold value is obtained, the target similarity is compared with the threshold value, when the target similarity exceeds the threshold value, initial page data corresponding to the target similarity exceeding the threshold value is queried, the initial page data is used as target page data, a first document identification corresponding to the target page data is queried, the first document identification of the target page data and the target page data are output, the server compares the obtained target similarity with the threshold value, when the target similarity exceeds the threshold value, the initial page data of which the target similarity exceeds the threshold value is used as target page data, and the document name, the document page number and the target page data corresponding to the target page data are jointly output.

In this embodiment, the server may obtain that the document to be queried has query information, screen from the document to be queried according to the query information to obtain an initial document, further extract initial page data included in the initial document, calculate a similarity index of the query information and the initial page data, obtain a first index weight corresponding to the similarity index, calculate a target similarity of the query information and the initial page data according to the first index weight and the similarity index, determine whether the target similarity exceeds a threshold, select the initial page data corresponding to the target similarity exceeding the threshold as the target page data when the target similarity exceeds the threshold, and output a first document identifier corresponding to the target page data and the target page data, so that the document query efficiency can be improved, and calculate the target similarity by adopting different similarity indexes, so that inaccuracy of the target similarity obtained by adopting a single similarity index can be avoided, thereby improving the accuracy of the queried document.

In one embodiment, please refer to fig. 3, a flowchart of a filtering step is provided, wherein the filtering step, that is, filtering the initial document from the documents to be queried according to the query information, includes: word segmentation is carried out on the query information to obtain query keywords; carrying out standardization processing on the query keywords to obtain standardized query keywords; and obtaining a mapping relation of the association of the document to be queried, and screening and obtaining an initial document from the document to be queried according to the standardized query keywords and the mapping relation.

Specifically, the mapping relationship refers to an association relationship between a query keyword and a corresponding document, that is, a related document may be obtained by querying through the query keyword and the mapping relationship, or a query keyword contained in an initial document may be queried according to the initial document and the mapping relationship, for example, the mapping relationship represents an association relationship between a keyword to be matched and a related first document identifier, and a document name and a document page number containing the keyword to be matched may be queried according to the keyword to be matched. Specifically, the server performs word segmentation on the acquired query information by using word segmentation logic to obtain query keywords, the server acquires standardization logic, processes the obtained query keywords according to the standardization logic to obtain standardized query keywords, further acquires a prestored mapping relation, matches the standardized query keywords with keywords to be matched contained in the mapping relation, and selects documents associated with the keywords to be matched successfully as initial documents when the matching is successful.

The method comprises the steps that a server obtains query information, further obtains word segmentation logic corresponding to the query information, divides the query information into different word segmentation sequences according to the word segmentation logic, calculates splitting accuracy corresponding to the different word segmentation sequences, and can be used for inquiring word probabilities corresponding to word phrases contained in the different word segmentation sequences by the server when calculating the splitting accuracy, calculating products of all word probabilities to serve as splitting accuracy, taking different words contained in the word segmentation sequences with high splitting accuracy as query keywords, further obtaining corresponding standardized logic, and carrying out standardized processing on the query keywords to obtain standardized query keywords, wherein the standardized processing can be used for carrying out format and other processing on the obtained query keywords according to preset standardized logic, for example, deleting special characters in the query keywords, converting complex characters into simplified characters, replacing certain characters with standard characters and the like, further obtaining mapping relations by the server, carrying out matching on the standard query keywords and the keywords to be associated with the mapping relations, and taking the mapped documents as documents to be matched when the mapping relations are successfully matched.

In this embodiment, the server performs word segmentation on the query information to obtain a query keyword, performs standardization processing on the query keyword to obtain a standardized query keyword, obtains a mapping relation associated with a document to be queried, and screens an initial document from the document to be queried according to the standardized query keyword and the mapping relation, so that the step of screening the initial document is simple and easy to implement, and the efficiency of querying a corresponding target document is improved.

In one embodiment, calculating a similarity measure of the query information to the initial page data includes: extracting an index type corresponding to the similarity index, and acquiring word segmentation logic associated with the initial page data when the index type is set similarity; according to the word segmentation logic, the initial page data are segmented to obtain a first keyword set, and query keywords are combined to obtain a second keyword set; and calculating the set similarity according to the first keyword set and the second keyword set.

Specifically, the index type refers to a type of similarity, and may be a type employing a related similarity calculation method. The set similarity refers to a set formed by calculating corresponding keywords by adopting related calculation logic, so that the corresponding similarity is obtained. Specifically, when calculating similarity indexes of query information and initial page data, the server extracts index types corresponding to the similarity indexes, when the extracted index types are set similarity, sets of query keywords and sets of keywords corresponding to the initial page data are established, namely, the server acquires corresponding word segmentation logic, the word segmentation logic is adopted to segment the initial page data, the initial page data after word segmentation of the initial page data is obtained to obtain a first keyword set, then the query keywords after word segmentation of the query information are used as a second keyword set, the first keyword set and the second keyword set are calculated to obtain an intersection set, further, the first keyword set and the second keyword set are calculated to obtain a union set, and the server calculates set similarity according to the obtained intersection set and the union set, so that the similarity indexes are set similarity indexes.

The server extracts an index type corresponding to a similarity index, when the index type is set similarity, word segmentation logic associated with initial page data is obtained, the initial page data is split by adopting the word segmentation logic to obtain different page word segmentation sequences, phrase probabilities of word segmentation phrases contained in the different page word segmentation sequences are calculated, splitting accuracy of the different page word segmentation sequences is calculated according to the phrase probabilities, the page word segmentation sequence with the largest splitting accuracy is used as the initial page data after word segmentation, the segmented initial page data are further combined into a first keyword set, a query keyword is used as a second keyword set by the server, then an intersection set of the first keyword set and the second keyword set is calculated, the server calculates the first quantity of different elements in the intersection set, the server calculates a union set of the first keyword set and the second keyword set, calculates the second quantity of different elements in the union set, and further the server calculates the ratio of the first quantity and the second quantity, so that the similarity index is used as the similarity set.

In this embodiment, the server may calculate a corresponding similarity index, where an index type corresponding to the similarity index may be extracted, when the index type is set similarity, word segmentation logic associated with initial page data is obtained, word segmentation is performed on the initial page data according to the word segmentation logic to obtain a first keyword set, and query keywords are combined to obtain a second keyword set, and set similarity is calculated according to the first keyword set and the second keyword set, so that the calculation is simple, different similarity indexes may be obtained, and when the target similarity is calculated, different similarity indexes are considered later, so that accuracy of the calculated target similarity is improved.

In one embodiment, calculating a similarity measure of the query information to the initial page data includes: when the index type is a text matching index, acquiring a first word frequency of a query keyword contained in the query information, and acquiring a second word frequency of the query keyword contained in the initial page data; counting the number of initial documents containing the query keywords, and calculating an evaluation weight according to the number of documents and the query keywords; and acquiring the adjusting factors, acquiring the total number of the initial documents, and calculating document matching indexes according to the adjusting factors, the total number, the first word frequency, the second word frequency and the evaluation weight.

Specifically, the text matching index refers to a similarity index obtained by calculating a correlation similarity according to the word frequency of the query keyword contained in the initial page data and the index of the importance degree of the query keyword in the initial page data. The first term frequency is the number of times a query keyword appears in the query information. The second term frequency refers to the number of times the query keyword appears in the initial page data, and the evaluation weight refers to an index that evaluates the importance degree of the query keyword in different initial page data. The adjustment factor refers to a preset adjustment parameter, which can be obtained by training according to a preset parameter training model, wherein when the adjustment parameter is trained by adopting the parameter training model, a sample ordering index and a sample relativity are input into the parameter training model for training, so that an adjustment factor is obtained.

Specifically, when the index type is extracted by the server as the text matching index, the server matches the query keywords with the words contained in the query information one by one, then the server counts the number of times of successful matching as a first word frequency, namely as the first word frequency of the query keywords contained in the query information, then the server matches the query keywords with the words contained in the initial page data one by one, then the server counts the number of times of successful matching as a second word frequency, then the server matches the query keywords with the initial page data, thereby counting the number of the initial page data containing the query keywords, the server acquires the data total of the initial page data, calculates word evaluation weights according to the number of the initial page data containing the query keywords and the data total of the initial page data, the server acquires the training to obtain adjustment factors, and calculates the similarity of the ranking information and the initial page data by adopting correlation calculation logic according to the adjustment factors, the data total number of the second secondary participant data, the first word frequency, the second word frequency and the word evaluation weights, so as to serve as the text matching index. It should be noted that, if the query keywords obtained by the server are different query keywords, different sub-similarities between the different query keywords and different initial page data may be calculated, and the different sub-similarities may be summed to obtain the text matching index. The evaluation weight can be calculated by the following formula:

Wherein N represents an initial pageTotal data amount of face data, n (qi) represents data amount of initial page data including query keyword, q _i Representing different query keywords. Calculating the text skin index can be calculated by adopting the following formula;

wherein ,

dl is the basic length of the initial page data, that is, the number of words contained in the initial page data, avgdl is the average length of all the initial page data, that is, the average value of the number of words contained in the initial page data, k1, k2, b represents the adjustment factor, TF1 represents the first word frequency, and TF2 represents the second word frequency. N represents the total amount of data of the initial page data, N (qi) represents the amount of data of the initial page data including the query keyword, q _i Representing different query keywords.

In this embodiment, when the index type is a text matching index, the server calculates a first word frequency of a query keyword included in the query information, and obtains a second word frequency of the query keyword included in the initial page data, calculates the number of the initial documents including the query keyword, calculates an evaluation weight according to the number of the documents including the query keyword, obtains a regulating factor, and obtains the total number of the initial documents, calculates the document matching index according to the regulating factor, the total number, the first word frequency, the second word frequency and the evaluation weight, thereby calculating the document matching index, being simple and easy to implement, obtaining different similarity indexes, and considering different similarity indexes when calculating the target similarity later, and improving the accuracy of the calculated target similarity.

In one embodiment, calculating a similarity index of the query information and the initial page data includes: when the index type is an inclusion index, acquiring a word score corresponding to the query keyword; when the query keyword is contained in the initial document, a inclusion index is calculated according to the word score.

Specifically, the inclusion index is an index for measuring the similarity between the initial page data and the query keyword by using whether the initial page data contains the query keyword. The term score refers to the score of different preset query keywords, so that when the query keywords are contained in the initial page data, a corresponding inclusion index can be calculated according to the term score.

Specifically, when the server extracts that the index type is the inclusion index, the server extracts word scores of preset different query keywords, the query keywords are respectively matched with the initial page data, and when the matching is successful, word scores corresponding to the query keywords which are successfully matched are added, so that the sum obtained by the addition is used as the inclusion index. It should be noted that, the server may also obtain weights corresponding to different query keywords, and when the server matches to obtain that the query keywords are included in the initial page data, calculate the product of the term score and the weight of the query keywords included in the initial page data, and add the different products to obtain the inclusion index.

In this embodiment, when calculating the inclusion index, the server calculates the inclusion index simply by searching the word score corresponding to the keyword and determining whether the query keyword is included in the initial document, and may obtain different similarity indexes, and then considers different similarity indexes when calculating the target similarity, thereby improving accuracy of the calculated target similarity

In one embodiment, after querying whether the target similarity exceeds a threshold, comprising: when the target similarity does not exceed the threshold value, acquiring whether the query information has replacement information or not; when the replacement information exists, calculating a replacement index corresponding to the initial page data and the replacement information according to the replacement information; acquiring a second index weight corresponding to the similarity index, and calculating the replacement similarity of the initial page data and the replacement information according to the second index weight and the replacement index; inquiring whether the replacement similarity exceeds a threshold value, when the replacement similarity exceeds the threshold value, selecting initial page data corresponding to the replacement similarity exceeding the threshold value as associated page data, and outputting a second document identification corresponding to the associated page data, the associated page data and replacement information.

Specifically, the replacement information refers to information associated with the query information, for example, may be information obtained by replacing different characters or terms with synonyms or homophones in the query information. The replacement index refers to different indexes for evaluating the similarity degree between the replacement information and the initial page data, and the replacement index can be a similarity index obtained by adopting different similarity calculation methods, for example, the replacement index can be a replacement index obtained by adopting a deep neural network model to calculate the similarity, can be a replacement index obtained by adopting parameters such as word frequency and the like to calculate, can be a replacement index obtained by adopting keywords in different initial documents as elements to calculate, can be a replacement index obtained by adopting cosine similarity to calculate, and can be a replacement index obtained by inquiring whether the text contains keywords or not. And the replacement index can be calculated by selecting all the replacement indexes, or one or different ones of the replacement indexes can be calculated. The replacement similarity refers to the final degree of similarity between the replacement information calculated from the replacement index and the initial page data. The associated page data refers to data of a page selected from the initial page data according to the replacement information. The second document identifier refers to a relevant flag of the document corresponding to the associated page data, and may be a name of the document corresponding to the associated page data, a page of the document, or the like.

Specifically, when the server calculates the target similarity, if the current query information may have errors, inquiring whether the replaced keywords are replaced, so that the server calculates the related replacement indexes by adopting the replaced keywords, and then calculates the corresponding replacement similarity, namely, when the server calculates the target similarity, the related threshold is obtained, the target similarity is compared with the threshold, when the target similarity does not exceed the threshold, the server judges whether the query keywords have the replacement keywords, when the replacement keywords are present, the replacement keywords replace the query keywords to obtain the replacement information, and then the server inquires the index type of the replacement similarity index required to be calculated, calculates the replacement indexes corresponding to the replacement information and different initial page data one by one according to the index type, when the replacement indexes are calculated, the server calculates the second index weights corresponding to the different replacement indexes, calculates the replacement similarity corresponding to the initial page data and the replacement information according to the replacement index weights, further compares the replacement similarity with the threshold, when the replacement similarity exceeds the threshold, the page data corresponding to the initial page data and the threshold is compared with the threshold, and the page data corresponding to the initial page data is compared with the threshold, and the associated page data can be output as the associated page data. It should be noted that, the specific step of calculating the replacement indicator may refer to the specific step of calculating the similarity indicator, the step of calculating the replacement similarity may refer to the step of calculating the target similarity, and it should be noted that, when the server queries the corresponding replacement keyword, the replacement keyword may be output first and sent to the terminal to display, when the terminal displays, the user selects the corresponding option according to the displayed replacement keyword, may be the option agreeing to replace, then generates the confirmation instruction according to the selected corresponding option, the terminal sends the confirmation instruction to the server, and when the server receives the confirmation instruction, the step of calculating the replacement indicator corresponding to the initial page data and the replacement information according to the replacement information may be performed.

In this embodiment, when the server queries that the target similarity does not exceed the threshold, it is determined whether replacement information exists in the query information, and when the replacement information exists, a replacement index corresponding to the initial page data is calculated according to the replacement information, a second index weight corresponding to the similarity index is obtained, and the replacement similarity between the initial page data and the replacement information is calculated according to the second index weight and the replacement index; inquiring whether the replacement similarity exceeds a threshold value, when the replacement similarity exceeds the threshold value, selecting initial page data corresponding to the replacement similarity exceeding the threshold value as associated page data, and outputting second document identifications, associated page data and replacement information corresponding to the associated page data, so that the inquired documents are irrelevant documents, applicability can be enhanced, and screening efficiency is improved.

It should be understood that, although the steps in the flowcharts of fig. 2-3 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2-3 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily occur sequentially, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or steps.

In one embodiment, as shown in FIG. 4, there is provided a document querying device 400 comprising: an acquisition module 410, an extraction module 420, a calculation module 430, and a determination module 440, wherein:

the obtaining module 410 is configured to obtain a document to be queried and query information, and filter an initial document from the document to be queried according to the query information.

The extracting module 420 is configured to extract initial page data included in the initial document, and calculate a similarity index between the query information and the initial page data.

The calculating module 430 is configured to obtain a first index weight corresponding to the similarity index, and calculate the target similarity between the query information and the initial page data according to the first index weight and the similarity index.

The judging module 440 is configured to judge whether the target similarity exceeds a threshold, and when the target similarity exceeds the threshold, select initial page data corresponding to the target similarity exceeding the threshold as target page data, and output a first document identifier corresponding to the target page data and the target page data.

In one embodiment, the acquisition module 410 includes:

and the word segmentation unit is used for segmenting the query information to obtain query keywords.

And the processing unit is used for carrying out standardization processing on the query keywords to obtain standardized query keywords.

In one embodiment, the extraction module 420 includes:

the extraction unit is used for extracting the index type corresponding to the similarity index, and when the index type is the set similarity, the word segmentation logic associated with the initial document is obtained.

And the combination unit is used for segmenting the initial document according to the word segmentation logic to obtain a first keyword set, and combining the query keywords to obtain a second keyword set.

And the first calculating unit is used for calculating the set similarity according to the first keyword set and the second keyword set.

In one embodiment, the extraction module 420 may further include:

and the word frequency acquisition unit is used for acquiring a first word frequency of the query keyword contained in the query information and acquiring a second word frequency of the query keyword contained in the initial page data when the index type is the text matching index.

And the statistics unit is used for counting the number of the initial documents containing the query keywords and calculating the evaluation weight according to the number of the documents and the query keywords.

And the second calculation unit is used for acquiring the adjustment factors, acquiring the total number of the initial documents, and calculating document matching indexes according to the adjustment factors, the total number, the first word frequency, the second word frequency and the evaluation weight.

In one embodiment, the extraction module 420 may further include: :

and the score acquisition unit is used for acquiring the word score corresponding to the query keyword when the index type is the inclusion index.

And the third calculation unit is used for calculating the inclusion index according to the word score when the query keyword is included in the initial document.

In one embodiment, the document query apparatus 400 may further include:

and the replacement information acquisition module is used for judging whether the query information has the replacement information or not when the target similarity does not exceed the threshold value.

And the replacement similarity index calculation module is used for calculating a replacement index corresponding to the initial page data according to the replacement information when the replacement information exists.

And the replacement similarity calculation module is used for acquiring second index weights corresponding to the replacement indexes and calculating the replacement similarity of the initial page data and the replacement information according to the second index weights and the replacement indexes.

The output module is used for inquiring whether the replacement similarity exceeds a threshold value, selecting initial page data corresponding to the replacement similarity exceeding the threshold value as associated page data when the replacement similarity exceeds the threshold value, and outputting a second document identifier corresponding to the associated page data, the associated page data and the replacement information.

For specific limitations of the document querying device, reference may be made to the above limitation of the document querying method, and the description thereof will not be repeated here. The respective modules in the document searching apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing document query data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a document query method.

It will be appreciated by those skilled in the art that the structure shown in fig. 5 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory storing a computer program and a processor that when executing the computer program performs the steps of: and acquiring the document to be queried and query information, and screening the document to be queried according to the query information to obtain an initial document. And extracting initial page data contained in the initial document, and calculating a similarity index of the query information and the initial page data. And acquiring a first index weight corresponding to the similarity index, and calculating the target similarity between the query information and the initial page data according to the first index weight and the similarity index. Judging whether the target similarity exceeds a threshold value, when the target similarity exceeds the threshold value, selecting initial page data corresponding to the target similarity exceeding the threshold value as target page data, and outputting a first document identification and target page data corresponding to the target page data.

In one embodiment, the processor, when executing the computer program, performs screening from documents to be queried according to query information to obtain an initial document, including: and segmenting the query information to obtain query keywords. And carrying out standardization processing on the query keywords to obtain standardized query keywords. And obtaining a mapping relation of the association of the document to be queried, and screening and obtaining an initial document from the document to be queried according to the standardized query keywords and the mapping relation.

In one embodiment, the processor, when executing the computer program, performs calculating a similarity indicator of the query information and the initial page data, including: and extracting an index type corresponding to the similarity index, and acquiring word segmentation logic associated with the initial document when the index type is the set similarity. And segmenting the initial document according to the segmentation logic to obtain a first keyword set, and combining the query keywords to obtain a second keyword set. And calculating the set similarity according to the first keyword set and the second keyword set.

In one embodiment, the processor, when executing the computer program, performs calculating a similarity indicator of the query information and the initial page data, including: when the index type is a text matching index, acquiring a first word frequency of a query keyword contained in the query information, and acquiring a second word frequency of the query keyword contained in the initial page data. Counting the number of the initial documents containing the query keywords, and calculating the evaluation weight according to the number of the documents and the query keywords. And acquiring the adjusting factors, acquiring the total number of the initial documents, and calculating document matching indexes according to the adjusting factors, the total number, the first word frequency, the second word frequency and the evaluation weight.

In one embodiment, the processor, when executing the computer program, implements extracting and calculating similarity metrics of query information and initial page data, comprising: and when the index type is an inclusion index, acquiring a word score corresponding to the query keyword. When the query keyword is contained in the initial document, a inclusion index is calculated according to the word score.

In one embodiment, after implementing whether the query target similarity exceeds a threshold value when the processor executes the computer program, the method comprises: and when the target similarity does not exceed the threshold value, judging whether the query information has replacement information or not. When the replacement information exists, a replacement index corresponding to the initial page data and the replacement information is calculated according to the replacement information. And obtaining a second index weight corresponding to the similarity index, and calculating the replacement similarity of the initial page data and the replacement information according to the second index weight and the replacement index. Inquiring whether the replacement similarity exceeds a threshold value, when the replacement similarity exceeds the threshold value, selecting initial page data corresponding to the replacement similarity exceeding the threshold value as associated page data, and outputting a second document identification corresponding to the associated page data, the associated page data and replacement information.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of: and acquiring the document to be queried and query information, and screening the document to be queried according to the query information to obtain an initial document. And extracting initial page data contained in the initial document, and calculating a similarity index of the query information and the initial page data. And acquiring a first index weight corresponding to the similarity index, and calculating the target similarity between the query information and the initial page data according to the first index weight and the similarity index. Judging whether the target similarity exceeds a threshold value, when the target similarity exceeds the threshold value, selecting initial page data corresponding to the target similarity exceeding the threshold value as target page data, and outputting a first document identification and target page data corresponding to the target page data.

In one embodiment, a computer program, when executed by a processor, performs filtering from documents to be queried for an initial document based on query information, comprising: and segmenting the query information to obtain query keywords. And carrying out standardization processing on the query keywords to obtain standardized query keywords. And obtaining a mapping relation of the association of the document to be queried, and screening and obtaining an initial document from the document to be queried according to the standardized query keywords and the mapping relation.

In one embodiment, a computer program, when executed by a processor, performs computing a similarity measure for query information and initial page data, comprising: and extracting an index type corresponding to the similarity index, and acquiring word segmentation logic associated with the initial document when the index type is the set similarity. And segmenting the initial document according to the segmentation logic to obtain a first keyword set, and combining the query keywords to obtain a second keyword set. And calculating the set similarity according to the first keyword set and the second keyword set.

In one embodiment, a computer program, when executed by a processor, implements calculating a similarity measure for query information and initial page data, comprising: when the index type is a text matching index, acquiring a first word frequency of a query keyword contained in the query information, and acquiring a second word frequency of the query keyword contained in the initial page data. Counting the number of the initial documents containing the query keywords, and calculating the evaluation weight according to the number of the documents and the query keywords. And acquiring the adjusting factors, acquiring the total number of the initial documents, and calculating document matching indexes according to the adjusting factors, the total number, the first word frequency, the second word frequency and the evaluation weight.

In one embodiment, a computer program, when executed by a processor, performs extracting a similarity measure for computing query information and initial page data, comprising: and when the index type is an inclusion index, acquiring a word score corresponding to the query keyword. When the query keyword is contained in the initial document, a inclusion index is calculated according to the word score.

In one embodiment, after implementing whether the query target similarity exceeds a threshold when the computer program is executed by the processor, the method comprises: and when the target similarity does not exceed the threshold value, judging whether the query information has replacement information or not. When the replacement information exists, a replacement index corresponding to the initial page data is calculated according to the replacement information. And obtaining a second index weight corresponding to the similarity index, and calculating the replacement similarity of the initial page data and the replacement information according to the second index weight and the replacement index. Inquiring whether the replacement similarity exceeds a threshold value, when the replacement similarity exceeds the threshold value, selecting initial page data corresponding to the replacement similarity exceeding the threshold value as associated page data, and outputting a second document identification corresponding to the associated page data, the associated page data and replacement information.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A document query method, the method comprising:

acquiring a document to be queried and query information, and segmenting the query information to obtain query keywords;

acquiring a mapping relation of the association of the document to be queried, and screening an initial document from the document to be queried according to the standardized query keyword and the mapping relation;

Extracting initial page data contained in the initial document, and calculating a similarity index of the query information and the initial page data; the similarity index is different indexes for evaluating the similarity degree between the query information and the initial page data; the similarity index is obtained by adopting different similarity calculation methods;

acquiring a first index weight corresponding to the similarity index, and calculating target similarity between the query information and the initial page data according to the first index weight and the similarity index; the first index weight is the duty ratio of each similarity index when calculating the target similarity according to the similarity indexes with different degrees; the target similarity is the final similarity degree between the query information calculated according to the similarity index and the initial page data;

judging whether the target similarity exceeds a threshold value, when the target similarity exceeds the threshold value, selecting initial page data corresponding to the target similarity exceeding the threshold value as target page data, and outputting a first document identifier corresponding to the target page data and the target page data;

the calculating the similarity index of the query information and the initial page data comprises the following steps:

When the index type is a text matching index, acquiring a first word frequency of the query keyword contained in the query information, and acquiring a second word frequency of the query keyword contained in the initial page data; the index type is a type of similarity;

counting the number of the initial documents containing the query keywords, and calculating an evaluation weight according to the number of the initial documents and the query keywords; the evaluation weight calculation formula is as follows:

wherein N represents the number of initial page dataAccording to the total amount, n (qi) represents an initial page containing query keywords; the data quantity of the data, qi, represents different query keywords;

acquiring an adjustment factor, acquiring the total number of the initial documents, and acquiring the total number of the initial documents according to the adjustment factor

Calculating text matching indexes according to the quantity, the first word frequency, the second word frequency and the evaluation weight;

the text matching index calculation formula is as follows:

wherein ,

dl is the number of words contained in the initial page data, avgdl is an average value of the number of words contained in the initial page data, k1, k2, b represents an adjustment factor, TF1 represents a first word frequency, TF2 represents a second word frequency, N represents the total amount of data of the initial page data, N (qi) represents the number of data of the initial page data containing the query keyword, qi represents different query keywords.

2. The method of claim 1, wherein the calculating a similarity measure of the query information and the initial page data comprises:

extracting an index type corresponding to a similarity index, and acquiring word segmentation logic associated with the initial page data when the index type is set similarity;

3. The method of claim 1, wherein extracting and calculating a similarity index of the query information and the initial page data comprises:

4. A method according to any one of claims 1 to 3, wherein said querying whether said target similarity exceeds a threshold value comprises:

5. A document querying device, the device comprising:

the extraction module is used for extracting initial page data contained in the initial document and calculating a similarity index of the query information and the initial page data; the similarity index is different indexes for evaluating the similarity degree between the query information and the initial page data; the similarity index is obtained by adopting different similarity calculation methods;

The calculation module is used for acquiring a first index weight corresponding to the similarity index and calculating the target similarity between the query information and the initial page data according to the first index weight and the similarity index; the first index weight is the duty ratio of each similarity index when calculating the target similarity according to the similarity indexes with different degrees; the target similarity is the final similarity degree between the query information calculated according to the similarity index and the initial page data;

the judging module is used for judging whether the target similarity exceeds a threshold value, selecting initial page data corresponding to the target similarity exceeding the threshold value as target page data when the target similarity exceeds the threshold value, and outputting a first document identifier corresponding to the target page data and the target page data;

the acquisition module comprises:

the screening unit is used for acquiring a mapping relation of the association of the document to be queried, and screening an initial document from the document to be queried according to the standardized query keyword and the mapping relation;

The extraction module comprises:

a word frequency obtaining unit, configured to obtain a first word frequency of the query keyword included in the query information and obtain the initial page data including the query keyword when the index type is a text matching index

A second word frequency of the query keyword; the index type is a type of similarity;

a statistics unit, configured to count the number of the initial documents including the query keyword, and calculate an evaluation weight according to the number of the documents including the query keyword and the query keyword; the evaluation weight calculation formula is as follows:

where N represents the total amount of data of the initial page data,

n (qi) represents an initial page containing query keywords; the data quantity of the data, qi, represents different query keywords;

the second calculation unit is used for acquiring the adjustment factors, acquiring the total number of the initial documents and calculating text matching indexes according to the adjustment factors, the total number, the first word frequency, the second word frequency and the evaluation weight; the text matching index calculation formula is as follows:

wherein ,/>

6. The apparatus of claim 5, wherein the extraction module comprises:

the extraction unit is used for extracting an index type corresponding to the similarity index, and acquiring word segmentation logic associated with the initial document when the index type is the set similarity;

the combination unit is used for segmenting the initial document according to the segmentation logic to obtain a first keyword set, and combining the query keywords to obtain a second keyword set;

7. The apparatus of claim 5, wherein the extraction module comprises:

the score acquisition unit is used for acquiring the word score corresponding to the query keyword when the index type is the inclusion index;

8. The apparatus according to any one of claims 5-7, further comprising:

the replacement information acquisition module is used for judging whether the query information has replacement information or not when the target similarity does not exceed a threshold value;

The replacement similarity index calculation module is used for calculating a replacement index corresponding to the initial page data according to the replacement information when the replacement information exists;

the replacement similarity calculation module is used for acquiring second index weights corresponding to the replacement indexes and calculating the replacement similarity of the initial page data and the replacement information according to the second index weights and the replacement indexes;

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 4 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 4.