CN110377558A

CN110377558A - Document searching method, device, computer equipment and storage medium

Info

Publication number: CN110377558A
Application number: CN201910514985.2A
Authority: CN
Inventors: 叶素兰; 窦文伟; 潘诗韵; 李弘�; 何麒; 徐国强
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-06-14
Filing date: 2019-06-14
Publication date: 2019-10-25
Anticipated expiration: 2039-06-14
Also published as: CN110377558B

Abstract

This application involves data processing fields, in particular to data query namely a kind of document searching method, device, computer equipment and storage medium.Method includes: to obtain document to be checked and query information, is screened from document to be checked according to query information and obtains original document；The initial page data for including in original document are extracted, the index of similarity of query information and initial page data is calculated；Corresponding first index weights of index of similarity are obtained, the target similarity of query information and initial page data is calculated according to the first index weights and index of similarity；Judge whether target similarity is more than threshold value, when target similarity is more than threshold value, the corresponding initial page data of target similarity more than threshold value are then chosen to export as target pages data, and by corresponding first document identification of target pages data and target pages data.Search efficiency to document can be improved using this method.

Description

Document query method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a document query method, an apparatus, a computer device, and a storage medium.

Background

With the development of computer technology, more and more actions of users can be realized by computers, for example, legal staff can search evidence files stored on the computers.

Conventionally, the evidence file searching process usually involves manually browsing different evidence files one by one to search for a required evidence file, which results in low query efficiency.

Disclosure of Invention

In view of the above, it is necessary to provide a document query method, apparatus, computer device and storage medium capable of improving query efficiency.

A method of document querying, the method comprising:

acquiring a document to be queried and query information, and screening the document to be queried according to the query information to obtain an initial document;

extracting initial page data contained in the initial document, and calculating a similarity index between the query information and the initial page data;

acquiring a first index weight corresponding to the similarity index, and calculating the target similarity of the query information and the initial page data according to the first index weight and the similarity index;

and judging whether the target similarity exceeds a threshold value, selecting the initial page data corresponding to the target similarity exceeding the threshold value as target page data when the target similarity exceeds the threshold value, and outputting a first document identifier corresponding to the target page data and the target page data.

In one embodiment, the screening the documents to be queried according to the query information to obtain an initial document includes:

performing word segmentation on the query information to obtain query keywords;

standardizing the query keywords to obtain standardized query keywords;

and acquiring a mapping relation associated with the document to be queried, and screening the document to be queried to obtain an initial document according to the standardized query keyword and the mapping relation.

In one embodiment, the calculating the similarity index between the query information and the initial page data includes:

extracting an index type corresponding to the similarity index, and when the index type is set similarity, acquiring word segmentation logic associated with the initial page data;

performing word segmentation on the initial page data according to the word segmentation logic to obtain a first keyword set, and combining the query keywords to obtain a second keyword set;

and calculating the set similarity according to the first keyword set and the second keyword set.

when the index type is a text matching index, acquiring a first word frequency of the query keyword contained in the query information, and acquiring a second word frequency of the query keyword contained in the initial page data;

counting the containing quantity of the initial documents containing the query keywords, and calculating evaluation weight according to the containing quantity of the documents and the query keywords;

and acquiring an adjusting factor, acquiring the total amount of the initial documents, and calculating a document matching index according to the adjusting factor, the total amount, the first word frequency, the second word frequency and the evaluation weight.

In one embodiment, extracting and calculating the similarity index between the query information and the initial page data includes:

when the index type is an inclusion degree index, acquiring a term score corresponding to the query keyword;

and when the query keyword is contained in the initial document, calculating according to the term score to obtain a content index.

In one embodiment, the querying whether the target similarity exceeds a threshold comprises:

when the target similarity does not exceed a threshold value, judging whether the query information has replacement information or not;

when the replacement information exists, calculating a replacement index corresponding to the initial page data according to the replacement information;

acquiring a second index weight corresponding to the similarity index, and calculating the replacement similarity of the initial page data and the replacement information according to the second index weight and the replacement index;

and inquiring whether the replacement similarity exceeds a threshold value, if so, selecting initial page data corresponding to the replacement similarity exceeding the threshold value as associated page data, and outputting a second document identifier corresponding to the associated page data, the associated page data and the replacement information.

A document querying device, the device comprising:

the acquisition module is used for acquiring documents to be inquired and inquiry information and screening the documents to be inquired according to the inquiry information to obtain initial documents;

the extraction module is used for extracting initial page data contained in the initial document and calculating a similarity index between the query information and the initial page data;

the calculation module is used for acquiring a first index weight corresponding to the similarity index, and calculating the target similarity of the query information and the initial page data according to the first index weight and the similarity index;

and the judging module is used for judging whether the target similarity exceeds a threshold value, selecting the initial page data corresponding to the target similarity exceeding the threshold value as target page data when the target similarity exceeds the threshold value, and outputting the first document identifier corresponding to the target page data and the target page data.

In one embodiment, the obtaining module includes:

the word segmentation unit is used for segmenting the query information to obtain query keywords;

the processing unit is used for carrying out standardization processing on the query keywords to obtain standardized query keywords;

and the screening unit is used for acquiring a mapping relation associated with the document to be queried and screening the document to be queried to obtain an initial document according to the standardized query keyword and the mapping relation.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.

According to the document query method, the document query device, the computer equipment and the storage medium, the document to be queried and the query information are acquired without manually browsing and querying the document one by one to obtain the required document, the initial document is obtained by summarizing and screening the document to be queried according to the query information, the initial page data contained in the initial document is extracted, the similarity index corresponding to the query information and the initial page data is calculated, the first index weight corresponding to the similarity index is further acquired, the target similarity of the query information and the initial page data is calculated according to the first index weight and the similarity index, whether the similarity of the query target exceeds the threshold value or not is queried, and when the target similarity exceeds the threshold value, the first document identifier associated with the initial page data exceeding the threshold value and the associated page data are output. Thereby, the query efficiency of the document can be improved.

Drawings

FIG. 1 is a diagram showing an application scenario of a document query method in one embodiment;

FIG. 2 is a flowchart illustrating a document query method according to an embodiment;

FIG. 3 is a schematic flow chart of the screening step in one embodiment;

FIG. 4 is a block diagram showing the construction of a document searching apparatus in one embodiment;

FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The document query method provided by the application can be applied to the application environment shown in FIG. 1. Wherein the terminal 102 and the server 104 communicate via a network. The server 104 obtains a document with query and query information from the terminal 102, obtains an initial document from a document to be queried by screening according to the query information, the server 104 extracts initial page data contained in the screened initial document, calculates a similarity index between the query information and the initial page data, the server 104 obtains a first index weight corresponding to the similarity index, calculates a target similarity between the query information and the initial page data according to the first index weight and the similarity index, the server 104 judges whether the target similarity exceeds a threshold value, when the target similarity exceeds the threshold value, the server 104 selects the initial page data corresponding to the target similarity exceeding the threshold value as target page data, and outputs a first document identifier corresponding to the target page data and a target page. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In one embodiment, as shown in fig. 2, a document slave query method is provided, which is exemplified by the application of the method to the server in fig. 1, and includes the following steps:

s202: and obtaining the document to be queried and query information, and screening the document to be queried according to the query information to obtain an initial document.

Specifically, the document to be queried refers to a document stored in the server, and the document to be queried may refer to a case-related document stored in the server, for example, the document to be queried is a related evidence document, or a case introduction document, and the like. The query information refers to key information of a related document to be stored, and the query information may be statement information, term information, and the like. The initial document refers to a related document selected from documents to be queried according to the query information.

Specifically, the server scans an original document, converts data content contained in the original document into a document picture, identifies corresponding document data from the document picture, fills the document data into an electronic template to form a document to be queried, stores the document to be queried, acquires a pre-stored document to be queried, acquires query information, and screens the document to be queried according to the query information to obtain an initial document, wherein the screening of the query document can be matching the query information with the document to be queried, so that a successfully matched initial document is selected, the screening of the query document can be matching the query information with an index keyword in an established index relationship, and when the matching is successful, the successfully matched index keyword is associated with the document to serve as the initial document. For example, the terminal is provided with an option box or an input box on the display interface, the user inputs key information in the corresponding input box according to the input box or the display box set on the display interface, or selects corresponding key information in the option box, the key information may be used as query information, the server acquires the query information from the terminal, acquires a pre-stored document to be queried, and filters the document to be queried according to the query information to obtain an initial document. The method includes the steps that a server divides a document to be inquired into words by word division logic, matches inquiry information with the document to be inquired after word division, and screens the successfully matched document to be inquired to obtain an initial document when matching is successful.

It should be noted that, when the initial document is screened, related documents may also be queried according to semantics, that is, the server obtains query information, and then the query information and the documents to be queried are input into the semantic recognition model for recognition, so that the matching degree between the query information and the documents to be queried is calculated according to the semantic recognition model, the documents to be queried, the matching degree of which exceeds the threshold value of the matching degree, are selected as supplementary documents, and the supplementary documents are used as the initial documents, or the supplementary documents and the documents to be queried, which are selected according to the index relationship, jointly form the initial documents, thereby ensuring that the queried initial documents are more accurate.

204: extracting initial page data contained in the initial document, and calculating similarity indexes of the query information and the initial page data.

Specifically, the initial page data refers to related data contained in the initial document, and the initial page data may be text data or the like. The similarity index refers to different indexes for evaluating similarity between the query information and the initial page data, and the similarity index may be obtained by adopting different similarity calculation methods, for example, the similarity index may be obtained by calculating similarity by adopting a deep neural network model, may be obtained by calculating parameters such as word frequency, may be obtained by calculating similarity indexes by adopting keywords in different initial documents as elements, and may be obtained by calculating cosine similarity indexes, and whether the query text contains the similarity indexes obtained by calculating the keywords, or the like. And the similarity index can be calculated by selecting all the similarity indexes, or by selecting one or different ones.

Specifically, the server extracts initial page data from an initial document, detects the initial page data line by line from a pre-stored initial document, extracts all detected contents to obtain the initial page data, further queries an index type of a similarity index required to be computed by the server, and computes the similarity indexes corresponding to different initial page data one by one according to query information by the server according to the index type.

For example, if the server queries that the calculated similarity indexes are cosine similarity indexes and model similarity indexes, the calculated similarity indexes are similarity indexes obtained by calculating cosine similarity and similarity indexes obtained by calculating similarity by using a deep neural network model, the server inputs the obtained query information and initial page data into a trained calculation model for calculation, so that the calculation model can output similarity, the similarity can be used as the similarity indexes, in addition, the server performs word segmentation on the obtained query information to obtain the query information after word segmentation, further converts the query information after word segmentation into the query information vector after word segmentation, and obtains a first average vector from the obtained query information vector after word segmentation, further performs word segmentation on the initial page data to obtain the initial page data after word segmentation, and converting the initial page data after word segmentation into an initial page data vector after word segmentation, and solving a second average vector from the initial page data vector, so as to calculate the pre-similarity according to the first average vector and the second average vector, thereby obtaining the similarity index.

It should be noted that different query information and similarity calculation methods of the initial page data may also be selected to obtain the similarity index. For example, the similarity index may be selected to be a similarity index obtained by calculating similarity using a deep neural network model, may be a similarity index obtained by calculating parameters such as word frequency, and the like, may be a similarity index obtained by calculating using keywords in different initial documents as elements, and may be a similarity index obtained by calculating using cosine similarity, and query whether the text includes the similarity index obtained by calculating using the keywords, so that one or more of the similarities obtained by calculating using the different methods are selected to be used as the similarity index, and when the similarity obtained by calculating using the five methods is selected, the obtained similarity indexes are 5.

S206: and acquiring a first index weight corresponding to the similarity index, and calculating the target similarity of the query information and the initial page data according to the first index weight and the similarity index.

Specifically, the first index weight is a ratio of each similarity index when calculating the target similarity according to different similarity indexes. The target similarity refers to the final similarity between the query information and the initial page data calculated according to the similarity index. Specifically, when the server calculates the similarity index, the server queries a pre-stored first index weight, and calculates the target similarity between the query information and the initial page data according to the first index weight and the similarity index. The target similarity index may be obtained by calculating, by the server, a product of the similarity index and the first index weight, where the product is the target similarity, where the obtained first index weight is 1 when the similarity index calculated by the server is one, or by obtaining first index weights corresponding to different similarity indexes when the similarity index calculated by the server is multiple similarity indexes, obtaining products of the different similarity indexes and the corresponding first index weights, and adding all the products to obtain the target similarity index.

S208: and judging whether the target similarity exceeds a threshold value, selecting initial page data corresponding to the target similarity exceeding the threshold value as target page data when the target similarity exceeds the threshold value, and outputting a first document identifier corresponding to the target page data and the target page data.

Specifically, the target page data refers to data of a page selected from the initial page data. The first document identifier is a related flag of the document corresponding to the target page data, and may be a name of the document associated with the target page data, a page number of the document, or the like. Specifically, when the target similarity calculated by the server is higher than the threshold, the threshold is obtained, the target similarity is compared with the threshold, when the target similarity is higher than the threshold, initial page data corresponding to the target similarity higher than the threshold is queried, the initial page data is used as target page data, a first document identifier corresponding to the target page data is queried, and the first document identifier of the target page data and the target page data are output.

In this embodiment, the server may obtain query information of a document to be queried, screen the document to be queried according to the query information to obtain an initial document, further extract initial page data included in the initial document, calculate a similarity index between the query information and the initial page data, obtain a first index weight corresponding to the similarity index, calculate a target similarity between the query information and the initial page data according to the first index weight and the similarity index, determine whether the target similarity exceeds a threshold, select the initial page data corresponding to the target similarity exceeding the threshold as the target page data when the target similarity exceeds the threshold, and output a first document identifier corresponding to the target page data and the target page data, so as to improve document query efficiency, and calculate the target similarity by using different similarity indexes, the target similarity inaccuracy obtained by adopting a single similarity index can be avoided, and the accuracy of the searched document is improved.

In an embodiment, please refer to fig. 3, which provides a flow chart of a screening step, that is, screening an initial document from the document to be queried according to the query information, including: segmenting the query information to obtain query keywords; standardizing the query keywords to obtain standardized query keywords; and acquiring a mapping relation associated with the document to be queried, and screening the document to be queried according to the standardized query keyword and the mapping relation to obtain an initial document.

Specifically, the mapping relationship indicates an association relationship between a query keyword and a corresponding document, that is, a related document can be obtained through query by querying the query keyword and the mapping relationship, or a query keyword included in an initial document can be queried according to the initial document and the mapping relationship, for example, the mapping relationship indicates an association relationship between a keyword to be matched and a related first document identifier, and a document name and a document page number including the keyword to be matched can be queried according to the keyword to be matched. Specifically, the server performs word segmentation on the obtained query information by adopting word segmentation logic to obtain query keywords, the server obtains standardized logic, processes the obtained query keywords according to the standardized logic to obtain standardized query keywords, further the server obtains a pre-stored mapping relation, matches the standardized query keywords with keywords to be matched contained in the mapping relation, and selects a document associated with the successfully matched keywords as an initial document when matching is successful.

The server may obtain the query information and further obtain the segmentation logic corresponding to the query information, divide the query information into different segmentation sequences according to the segmentation logic, calculate the splitting accuracy corresponding to the different segmentation sequences, when calculating the splitting accuracy, the server may query the term probabilities corresponding to the segmentation word groups included in the different segmentation sequences, calculate the product of all the term probabilities as the splitting accuracy, use different terms included in the segmentation sequences with high splitting accuracy as query keywords, further obtain corresponding normalization logic, and perform normalization processing on the query keywords to obtain normalized query keywords, wherein the normalization processing may be processing of formatting and the like on the obtained query keywords according to preset normalization logic, such as deleting special characters in the query keywords, or converting traditional characters into simplified characters, or replacing some characters with standard characters and the like, further acquiring the mapping relation by the server, matching the standard query keyword with the keywords to be matched associated with the mapping relation, and selecting the document associated with the keywords to be matched in the mapping relation when the matching is successful, wherein the associated document is used as the initial document.

In the embodiment, the server divides the query information into words to obtain the query keywords, standardizes the query keywords to obtain the standardized query keywords, obtains the mapping relation associated with the document to be queried, and screens the document to be queried according to the standardized query keywords and the mapping relation to obtain the initial document, so that the step of screening the initial document is simple and easy, and the efficiency of querying the corresponding target document is improved.

In one embodiment, calculating a similarity index of the query information and the initial page data includes: extracting an index type corresponding to the similarity index, and acquiring word segmentation logic associated with the initial page data when the index type is set similarity; performing word segmentation on the initial page data according to word segmentation logic to obtain a first keyword set, and combining the query keywords to obtain a second keyword set; and calculating set similarity according to the first keyword set and the second keyword set.

Specifically, the index type refers to a type of similarity, and may be a type using a related similarity calculation method. The set similarity refers to calculating a set formed by corresponding keywords by adopting related calculation logic so as to obtain corresponding similarity. Specifically, when the server calculates the similarity index between the query information and the initial page data, the server extracts the index type corresponding to the similarity index, when the extracted index type is set similarity, a set of query keywords and a set of keywords corresponding to the initial page data are established, that is, the server obtains corresponding word segmentation logic, carries out word segmentation on the initial page data by adopting the word segmentation logic, obtains a first keyword set from the initial page data after word segmentation of the initial page data, further takes the query keywords after word segmentation of the query information as a second keyword set, calculates the first keyword set and the second keyword set to obtain an intersection set, further calculates the first keyword set and the second keyword set to obtain a union set, and the server obtains the intersection set and the union set according to the obtained intersection set and the union set, and calculating the set similarity, so that the similarity index is the set similarity index.

The server may extract the index type corresponding to the similarity index, when the index type is set similarity, obtain the segmentation logic associated with the initial page data, split the initial page data by the segmentation logic to obtain different page segmentation sequences, calculate the phrase probability of the segmentation phrases contained in the different page segmentation sequences, calculate the split accuracy of the different page segmentation sequences according to the phrase probability, use the page segmentation sequence with the highest split accuracy as the initial page data after segmentation, further combine the initial page data after segmentation into a first keyword set, the server uses the query keyword as a second keyword set, further calculate the intersection of the first keyword set and the second keyword set to obtain an intersection set, the server statistically obtains the first number of different elements in the intersection set of the intersection, the server further calculates the union set of the first keyword set and the second keyword set, and counting the second number of different elements in the union set of the union set, and calculating the set similarity by the server according to the ratio of the first number to the second number, so as to serve as a set similarity index.

In this embodiment, the server may calculate a corresponding similarity index, wherein an index type corresponding to the similarity index may be extracted, when the index type is set similarity, a segmentation logic associated with the initial page data is obtained, the initial page data is segmented according to the segmentation logic to obtain a first keyword set, the query keywords are combined to obtain a second keyword set, and set similarity is calculated according to the first keyword set and the second keyword set.

In one embodiment, calculating a similarity index of the query information and the initial page data includes: when the index type is a text matching index, acquiring a first word frequency of a query keyword contained in the query information, and acquiring a second word frequency containing the query keyword in the initial page data; counting the containing quantity of initial documents containing the query keywords, and calculating evaluation weight according to the containing quantity of the documents and the query keywords; and acquiring an adjusting factor, acquiring the total amount of the initial documents, and calculating a document matching index according to the adjusting factor, the total amount, the first word frequency, the second word frequency and the evaluation weight.

Specifically, the text matching index is a similarity index that is calculated according to the word frequency of the query keyword contained in the initial page data and an index of the importance degree of the query keyword in the initial page data to obtain a correlation similarity. The first word frequency is the number of times the query keyword appears in the query information. The second word frequency refers to the number of times of occurrence of the query keyword in the initial page data, and the evaluation weight refers to an index for evaluating the importance degree of the query keyword in different initial page data. The adjustment factor refers to a preset adjustment parameter, and the adjustment parameter can be obtained by training according to a preset parameter training model, wherein when the adjustment parameter is trained by adopting the parameter training model, the sample sequencing index and the sample correlation degree can be input into the parameter training model for training, so that the adjustment factor is obtained.

Specifically, when the index type extracted by the server is a text matching index, the server matches the query keywords with the terms contained in the query information one by one, then the server counts the number of times of successful matching as a first term frequency, namely, the first term frequency of the query keywords contained in the query information, then the server matches the query keywords with the terms contained in the initial page data one by one, then the server counts the number of times of successful matching as a second term frequency, then the server matches the query keywords with the initial page data, so that the number of the initial page data containing the query keywords is counted, the server obtains the total data amount of the initial page data, calculates term evaluation weights according to the number of the initial page data containing the query keywords and the total data amount of the initial page data, and obtains adjustment factors by training, and calculating the similarity between the sequencing information and the initial page data by adopting a relevancy calculation logic according to the adjustment factor, the total data amount of the second arbitrator data, the first word frequency, the second word frequency and the word evaluation weight, so as to serve as a text matching index. It should be noted that, if the query keywords obtained by the server are different query keywords, different sub-similarities between the different query keywords and different initial page data may be respectively calculated, and the different sub-similarities are summed to obtain the text matching index. Wherein, the evaluation weight can be calculated by adopting the following formula:

wherein N represents the total data amount of the initial page data, N (qi) represents the data amount of the initial page data including the query keyword, q_iRepresenting different query terms. The calculation of the text sheet index can be calculated by the following formula;

wherein,dl is a basic length of the initial page data, that is, the number of words included in the initial page data, avgdl is an average length of all the initial page data, that is, an average value of the number of words included in the initial page data, k1, k2, b denotes a scale factor, TF1 denotes a first word frequency, and TF2 denotes a second word frequency. N denotes the total amount of data of the initial page data, N (qi) denotes the amount of data of the initial page data including the query keyword, q_iRepresenting different query terms.

In this embodiment, when the server counts that the obtained index type is a text matching index, a first word frequency of a query keyword included in query information is obtained, a second word frequency of the query keyword included in initial page data is obtained, the included number of initial documents including the query keyword is counted, an evaluation weight is calculated according to the included number of the documents and the query keyword, an adjustment factor is obtained, the total number of the initial documents is obtained, and a document matching index is calculated according to the adjustment factor, the total number, the first word frequency, the second word frequency and the evaluation weight, so that the document matching index is calculated simply and easily, different similarity indexes can be obtained, different similarity indexes are considered in the subsequent calculation of the target similarity, and the accuracy of the calculated target similarity is improved.

In one embodiment, calculating a similarity index of the query information and the initial page data includes: when the index type is the inclusion index, acquiring a term score corresponding to the query keyword; and when the query keyword is contained in the initial document, calculating according to the term score to obtain a content index.

Specifically, the inclusion degree index is an index for measuring similarity between the initial page data and the query keyword by using whether the initial page data contains the query keyword. The term score refers to the scores of different preset query keywords, so that when the query keywords are contained in the initial page data, the corresponding inclusion degree index can be calculated according to the term scores.

Specifically, when the index type extracted by the server is the inclusion index, the server extracts the word scores of preset different query keywords, matches the query keywords with the initial page data respectively, and when the matching is successful, adds the word scores corresponding to the successfully matched query keywords, so that the sum obtained by the addition is used as the inclusion index. It should be noted that, the server may also obtain weights corresponding to different query keywords, and when the server matches the query keywords contained in the initial page data, the product of the term score and the weight of the query keywords contained in the initial page data is calculated, and the different products are added to obtain the inclusion degree index.

In this embodiment, when calculating the inclusion degree index, the server calculates the corresponding inclusion degree index by querying the term score corresponding to the keyword and determining whether the query keyword is included in the initial document, the calculation of the inclusion degree index is simple, and different similarity indexes can be obtained, and subsequently, the different similarity indexes are considered when calculating the target similarity, so as to improve the accuracy of the calculated target similarity

In one embodiment, after querying whether the target similarity exceeds a threshold, the method includes: when the target similarity does not exceed the threshold, acquiring whether the query information has the replacement information; when the replacement information exists, calculating a replacement index corresponding to the initial page data and the replacement information according to the replacement information; acquiring a second index weight corresponding to the similarity index, and calculating the replacement similarity of the initial page data and the replacement information according to the second index weight and the replacement index; and inquiring whether the replacement similarity exceeds a threshold value, when the replacement similarity exceeds the threshold value, selecting initial page data corresponding to the replacement similarity exceeding the threshold value as associated page data, and outputting a second document identifier, the associated page data and replacement information corresponding to the associated page data.

Specifically, the replacement information refers to information associated with the query information, and for example, the replacement information may be information obtained by replacing different characters or terms with synonyms or homophones in the query information. The replacement index refers to different indexes for evaluating the similarity between the replacement information and the initial page data, and the replacement index may be a similarity index obtained by different similarity calculation methods, for example, the replacement index may be a replacement index obtained by calculating the similarity using a deep neural network model, may be a replacement index obtained by calculating parameters such as word frequency, and the like, may be a replacement index obtained by calculating using keywords in different initial documents as elements, and may be a replacement index obtained by calculating using cosine similarity, and may be a replacement index obtained by calculating whether the query text includes the keywords, and the like. And the calculation of the replacement index can be performed by selecting all the replacement indexes, or by selecting one or different ones of the replacement indexes. The replacement similarity refers to a final degree of similarity between the replacement information calculated from the replacement index and the initial page data. The associated page data refers to data of a page selected from the initial page data according to the replacement information. The second document identifier refers to a related mark of the document corresponding to the associated page data, and may be a name of the document corresponding to the associated page data, a page of the document, or the like.

Specifically, when the server calculates the target similarity, if the current query information is wrong, the server queries whether the replaced keyword exists, so as to calculate the related replacement index by using the replaced keyword, and then calculates the corresponding replacement similarity, that is, when the server calculates the target similarity, the server obtains the related threshold value, compares the target similarity with the threshold value, when the target similarity does not exceed the threshold value, the server judges whether the query keyword has the replacement keyword, when the replacement keyword exists, the replacement keyword replaces the query keyword to obtain the replacement information, further the server queries the index type of the replacement similarity index required to be calculated, according to the index type, the server calculates the replacement index corresponding to the replacement information and different initial page data one by one, and when the replacement index is calculated, when the server obtains the replacement indexes through calculation, second index weights corresponding to different replacement indexes are obtained, replacement similarity corresponding to the initial page data and the replacement information is calculated according to the replacement indexes and the second index weights, then the replacement similarity is compared with a threshold value, when the replacement similarity exceeds the threshold value, the initial page data corresponding to the replacement similarity exceeding the threshold value is selected as associated page data, second document identifications corresponding to the associated page data and the associated page data are output, and the replacement information can be output together. It should be noted that, the specific step of calculating the replacement index may refer to the specific step of calculating the similarity index, and the step of calculating the replacement similarity may refer to the step of calculating the target similarity, and it should be noted that, when the server queries the corresponding replacement keyword, the replacement keyword may be output first, and the replacement keyword is sent to the terminal to be displayed, when the terminal displays the replacement keyword, the user selects a corresponding option according to the displayed replacement keyword, or the option agrees to the replacement, a confirmation instruction is generated according to the selection of the corresponding option, the terminal sends the confirmation instruction to the server, and when the server receives the confirmation instruction, the server may perform the step of calculating the replacement index corresponding to the initial page data and the replacement information according to the replacement information.

In this embodiment, when the server queries that the target similarity does not exceed the threshold, it is determined whether the query information has the replacement information, and when the replacement information exists, a replacement index corresponding to the initial page data is calculated according to the replacement information, a second index weight corresponding to the similarity index is obtained, and the replacement similarity between the initial page data and the replacement information is calculated according to the second index weight and the replacement index; and inquiring whether the replacement similarity exceeds a threshold value, if so, selecting the initial page data corresponding to the replacement similarity exceeding the threshold value as associated page data, and outputting the second document identifier, the associated page data and the replacement information corresponding to the associated page data, so that the inquired document is avoided to be an irrelevant document, the applicability can be enhanced, and the screening efficiency is improved.

It should be understood that although the various steps in the flow charts of fig. 2-3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-3 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in FIG. 4, there is provided a document querying device 400, comprising: an obtaining module 410, an extracting module 420, a calculating module 430 and a judging module 440, wherein:

the obtaining module 410 is configured to obtain a document to be queried and query information, and filter an initial document from the document to be queried according to the query information.

The extracting module 420 is configured to extract initial page data included in the initial document, and calculate a similarity index between the query information and the initial page data.

The calculating module 430 is configured to obtain a first index weight corresponding to the similarity index, and calculate a target similarity between the query information and the initial page data according to the first index weight and the similarity index.

The determining module 440 is configured to determine whether the target similarity exceeds a threshold, and when the target similarity exceeds the threshold, select initial page data corresponding to the target similarity exceeding the threshold as target page data, and output a first document identifier corresponding to the target page data and the target page data.

In one embodiment, the obtaining module 410 includes:

and the word segmentation unit is used for segmenting the query information to obtain query keywords.

And the processing unit is used for carrying out standardization processing on the query key words to obtain standardized query key words.

And the screening unit is used for acquiring the mapping relation associated with the document to be queried and screening the document to be queried according to the standardized query keyword and the mapping relation to obtain an initial document.

In one embodiment, the extraction module 420 includes:

and the extracting unit is used for extracting the index type corresponding to the similarity index, and acquiring word segmentation logic associated with the initial document when the index type is set similarity.

And the combination unit is used for carrying out word segmentation on the initial document according to word segmentation logic to obtain a first keyword set and combining the query keywords to obtain a second keyword set.

And the first calculating unit is used for calculating the set similarity according to the first keyword set and the second keyword set.

In one embodiment, the extracting module 420 may further include:

and the word frequency acquiring unit is used for acquiring a first word frequency of the query keyword contained in the query information and acquiring a second word frequency of the query keyword contained in the initial page data when the index type is a text matching index.

And the statistic unit is used for counting the containing quantity of the initial documents containing the query keywords and calculating the evaluation weight according to the containing quantity of the documents and the query keywords.

And the second calculating unit is used for acquiring the adjusting factor, acquiring the total number of the initial documents, and calculating the document matching index according to the adjusting factor, the total number, the first word frequency, the second word frequency and the evaluation weight.

In one embodiment, the extracting module 420 may further include: :

and the score acquisition unit is used for acquiring the term score corresponding to the query keyword when the index type is the inclusion index.

And the third calculating unit is used for calculating to obtain the inclusion degree index according to the term score when the query keyword is contained in the initial document.

In one embodiment, the document querying device 400 may further include:

and the replacement information acquisition module is used for judging whether the query information has the replacement information or not when the target similarity does not exceed the threshold.

And the replacement similarity index calculation module is used for calculating a replacement index corresponding to the initial page data according to the replacement information when the replacement information exists.

And the replacement similarity calculation module is used for acquiring a second index weight corresponding to the replacement index and calculating the replacement similarity of the initial page data and the replacement information according to the second index weight and the replacement index.

And the output module is used for inquiring whether the replacement similarity exceeds a threshold value, selecting the initial page data corresponding to the replacement similarity exceeding the threshold value as the associated page data when the replacement similarity exceeds the threshold value, and outputting the second document identifier, the associated page data and the replacement information corresponding to the associated page data.

For the specific definition of the document querying device, reference may be made to the above definition of the document querying method, which is not described herein again. The modules in the document inquiry apparatus can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing document query data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a document query method.

Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, there is provided a computer device comprising a memory storing a computer program and a processor implementing the following steps when the processor executes the computer program: and obtaining the document to be queried and query information, and screening the document to be queried according to the query information to obtain an initial document. Extracting initial page data contained in the initial document, and calculating similarity indexes of the query information and the initial page data. And acquiring a first index weight corresponding to the similarity index, and calculating the target similarity of the query information and the initial page data according to the first index weight and the similarity index. And judging whether the target similarity exceeds a threshold value, selecting initial page data corresponding to the target similarity exceeding the threshold value as target page data when the target similarity exceeds the threshold value, and outputting a first document identifier corresponding to the target page data and the target page data.

In one embodiment, the processor, when executing the computer program, implements screening of an initial document from documents to be queried according to query information, including: and performing word segmentation on the query information to obtain query keywords. And carrying out standardization processing on the query keywords to obtain standardized query keywords. And acquiring a mapping relation associated with the document to be queried, and screening the document to be queried according to the standardized query keyword and the mapping relation to obtain an initial document.

In one embodiment, the processor, when executing the computer program, implements the calculation of a similarity index between the query information and the initial page data, including: and extracting an index type corresponding to the similarity index, and acquiring word segmentation logic associated with the initial document when the index type is set similarity. And performing word segmentation on the initial document according to word segmentation logic to obtain a first keyword set, and combining the query keywords to obtain a second keyword set. And calculating set similarity according to the first keyword set and the second keyword set.

In one embodiment, the processor, when executing the computer program, implements the calculation of a similarity index between the query information and the initial page data, including: and when the index type is a text matching index, acquiring a first word frequency of the query keyword contained in the query information, and acquiring a second word frequency containing the query keyword in the initial page data. And counting the content number of the initial documents containing the query key words, and calculating the evaluation weight according to the content number of the documents and the query key words. And acquiring an adjusting factor, acquiring the total amount of the initial documents, and calculating a document matching index according to the adjusting factor, the total amount, the first word frequency, the second word frequency and the evaluation weight.

In one embodiment, the processor, when executing the computer program, implements extracting similarity indicators of the computed query information and the initial page data, including: and when the index type is the inclusion index, acquiring the term score corresponding to the query keyword. And when the query keyword is contained in the initial document, calculating according to the term score to obtain a content index.

In one embodiment, the processor, when executing the computer program, after determining whether the query target similarity exceeds the threshold, comprises: and when the target similarity does not exceed the threshold, judging whether the query information has the replacement information. And when the replacement information exists, calculating a replacement index corresponding to the initial page data and the replacement information according to the replacement information. And acquiring a second index weight corresponding to the similarity index, and calculating the replacement similarity of the initial page data and the replacement information according to the second index weight and the replacement index. And inquiring whether the replacement similarity exceeds a threshold value, when the replacement similarity exceeds the threshold value, selecting initial page data corresponding to the replacement similarity exceeding the threshold value as associated page data, and outputting a second document identifier, the associated page data and replacement information corresponding to the associated page data.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: and obtaining the document to be queried and query information, and screening the document to be queried according to the query information to obtain an initial document. Extracting initial page data contained in the initial document, and calculating similarity indexes of the query information and the initial page data. And acquiring a first index weight corresponding to the similarity index, and calculating the target similarity of the query information and the initial page data according to the first index weight and the similarity index. And judging whether the target similarity exceeds a threshold value, selecting initial page data corresponding to the target similarity exceeding the threshold value as target page data when the target similarity exceeds the threshold value, and outputting a first document identifier corresponding to the target page data and the target page data.

In one embodiment, a computer program, when executed by a processor, implements screening of an initial document from documents to be queried according to query information, comprising: and performing word segmentation on the query information to obtain query keywords. And carrying out standardization processing on the query keywords to obtain standardized query keywords. And acquiring a mapping relation associated with the document to be queried, and screening the document to be queried according to the standardized query keyword and the mapping relation to obtain an initial document.

In one embodiment, a computer program, when executed by a processor, performs computing a similarity indicator of query information and initial page data, comprising: and extracting an index type corresponding to the similarity index, and acquiring word segmentation logic associated with the initial document when the index type is set similarity. And performing word segmentation on the initial document according to word segmentation logic to obtain a first keyword set, and combining the query keywords to obtain a second keyword set. And calculating set similarity according to the first keyword set and the second keyword set.

In one embodiment, a computer program, when executed by a processor, performs the calculation of a similarity index of query information and initial page data, comprising: and when the index type is a text matching index, acquiring a first word frequency of the query keyword contained in the query information, and acquiring a second word frequency containing the query keyword in the initial page data. And counting the content number of the initial documents containing the query key words, and calculating the evaluation weight according to the content number of the documents and the query key words. And acquiring an adjusting factor, acquiring the total amount of the initial documents, and calculating a document matching index according to the adjusting factor, the total amount, the first word frequency, the second word frequency and the evaluation weight.

In one embodiment, a computer program, when executed by a processor, performs extracting similarity indicators of computed query information and initial page data, comprising: and when the index type is the inclusion index, acquiring the term score corresponding to the query keyword. And when the query keyword is contained in the initial document, calculating according to the term score to obtain a content index.

In one embodiment, the computer program when executed by the processor, after implementing whether the query target similarity exceeds a threshold, comprises: and when the target similarity does not exceed the threshold, judging whether the query information has the replacement information. And when the replacement information exists, calculating a replacement index corresponding to the initial page data according to the replacement information. And acquiring a second index weight corresponding to the similarity index, and calculating the replacement similarity of the initial page data and the replacement information according to the second index weight and the replacement index. And inquiring whether the replacement similarity exceeds a threshold value, when the replacement similarity exceeds the threshold value, selecting initial page data corresponding to the replacement similarity exceeding the threshold value as associated page data, and outputting a second document identifier, the associated page data and replacement information corresponding to the associated page data.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of document querying, the method comprising:

2. The method according to claim 1, wherein the screening of the initial document from the documents to be queried according to the query information comprises:

performing word segmentation on the query information to obtain query keywords;

standardizing the query keywords to obtain standardized query keywords;

3. The method of claim 1, wherein the calculating a similarity measure between the query information and the initial page data comprises:

extracting an index type corresponding to the similarity index, and acquiring word segmentation logic associated with the initial page data when the index type is set similarity;

4. The method of claim 1, wherein the calculating a similarity measure between the query information and the initial page data comprises:

5. The method of claim 1, wherein extracting and calculating the similarity index between the query information and the initial page data comprises:

6. The method according to any one of claims 1 to 5, wherein said querying whether the target similarity exceeds a threshold value comprises:

7. A document querying device, the device comprising:

8. The apparatus of claim 7, wherein the obtaining module comprises:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.