CN113434710A

CN113434710A - Document retrieval method, document retrieval device, server and storage medium

Info

Publication number: CN113434710A
Application number: CN202110732780.9A
Authority: CN
Inventors: 陈晟豪
Original assignee: Ping An Puhui Enterprise Management Co Ltd
Current assignee: Ping An Puhui Enterprise Management Co Ltd
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-09-24

Abstract

The embodiment of the application provides a document retrieval method, a document retrieval device, a server and a storage medium. The method comprises the following steps: and compressing and converting the pictures in the document to extract the features so as to obtain corresponding picture feature signature values. And extracting the characteristics of the text information in the document to obtain a corresponding text characteristic signature value. And obtaining a document characteristic signature according to the picture characteristic signature value and the character characteristic signature value, and establishing an index relation according to the document characteristic signature and an address path of the stored document. By the method, when the document is searched, the document can be searched according to the picture, the text and the image.

Description

Document retrieval method, document retrieval device, server and storage medium

Technical Field

The present application relates to the field of document processing and searching technologies, and in particular, to a document retrieval method, an apparatus, a server, and a storage medium.

Background

With the development of the computer industry, electronic documents using computer storage devices as carriers have emerged. However, in the prior art, after an index is generally established by keywords, document search is performed by the keywords; or a document search is performed through keyword whole-text review. The searching modes not only take more time, but also have limited requirements on keywords; in addition, the searching modes are all searching through keywords, and the picture cannot be included in the searching range to obtain a more accurate searching result.

Disclosure of Invention

The embodiment of the application mainly aims to provide a document retrieval method, a document retrieval device, a server and a storage medium, and aims to establish an index relation according to a signature and a document address by extracting features of pictures and characters in a document. When the document is searched, the search can be carried out according to the picture, the search can be carried out according to the characters, and the search can also be carried out according to the pictures and the texts.

In a first aspect, an embodiment of the present application provides a document retrieval method, including:

acquiring a picture in a document, and compressing the picture to acquire a first picture;

acquiring color attribute information of the first picture, matching a picture conversion mode of the first picture according to the color attribute information, and converting the first picture into a second picture according to the picture conversion mode, wherein the picture attributes of the first picture and the second picture are different;

acquiring the average gray value of the second picture and the pixel gray value of each pixel point of the second picture, and determining the picture characteristic signature value of the second picture according to the average gray value and the pixel gray value;

acquiring character information in the document, and extracting a character characteristic signature value corresponding to the document according to the character information;

determining a document characteristic signature of the document according to the picture characteristic signature value and the character characteristic signature value;

acquiring an address path for storing the document, establishing a document retrieval model according to the document characteristic signature and the address path, and establishing a document retrieval database according to the document retrieval model;

when a document retrieval instruction is received, determining information to be retrieved according to the document retrieval instruction, and matching files corresponding to the information to be retrieved from the document retrieval database.

In a second aspect, an embodiment of the present application further provides a document retrieval apparatus, including: the image compression module is used for acquiring images in a document and compressing the images to acquire a first image;

the image conversion module is used for acquiring color attribute information of the first image, matching an image conversion mode of the first image according to the color attribute information and converting the first image into a second image according to the image conversion mode, wherein the image attributes of the first image and the second image are different;

the image signature module is used for acquiring the average gray value of the second image and the pixel gray value of each pixel point of the second image, and determining the image characteristic signature value of the second image according to the average gray value and the pixel gray value;

the character signature module is used for acquiring character information in the document and extracting a character characteristic signature value corresponding to the document according to the character information;

the document signature module is used for determining a document characteristic signature of the document according to the picture characteristic signature value and the character characteristic signature value;

the index establishing module is used for acquiring an address path for storing the document, establishing a document retrieval model according to the document characteristic signature and the address path, and establishing a document retrieval database according to the document retrieval model;

and the retrieval module is used for determining the information to be retrieved according to the document retrieval instruction when receiving the document retrieval instruction, and matching the file corresponding to the information to be retrieved from the document retrieval database.

In a third aspect, embodiments of the present application further provide a server, where the server includes a processor, a memory, a computer program stored on the memory and executable by the processor, and a data bus for implementing connection communication between the processor and the memory, where the computer program, when executed by the processor, implements the steps of any of the document retrieval methods provided in this specification.

In a fourth aspect, the present application provides a storage medium for a computer-readable storage, where the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the steps of any one of the document retrieval methods provided in this specification.

The embodiment of the application provides a document retrieval method, a document retrieval device, a server and a storage medium, wherein pictures and character information in a document are extracted, and the pictures are compressed and converted to extract features so as to obtain corresponding picture feature signature values. And extracting the characteristics of the character information to obtain a corresponding character characteristic signature value. And obtaining a document characteristic signature according to the picture characteristic signature value and the character characteristic signature value, and establishing an index relation according to the document characteristic signature and an address path for storing the document. By the method, when the document is searched, the document can be searched according to the picture, the text and the image.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating a document retrieval process according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating steps corresponding to one embodiment of step S1 of FIG. 1;

FIG. 3 is a flowchart illustrating steps corresponding to one embodiment of step S14 of FIG. 2;

FIG. 4 is a flowchart illustrating steps corresponding to one embodiment of step S2 of FIG. 1;

FIG. 5 is a flowchart illustrating steps corresponding to one embodiment of step S4 of FIG. 1;

FIG. 6 is a flowchart illustrating steps corresponding to one embodiment of step S7 of FIG. 1;

FIG. 7 is a flowchart illustrating steps corresponding to one embodiment of step S72 of FIG. 6;

FIG. 8 is a block diagram of a document retrieval apparatus according to an embodiment of the present application;

fig. 9 is a block diagram illustrating a structure of a server according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

The embodiment of the application provides a document retrieval method, a document retrieval device, a server and a storage medium. Wherein, the document retrieval method can be applied to a server. In addition, the document retrieval method can also be applied by matching the mobile terminal and the server, wherein the mobile terminal is responsible for extracting the document features to obtain the document feature signature, and the server is responsible for establishing an index according to the document feature signature and responding to a retrieval request. It can be understood that the mobile terminal may be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device, and the server may be an independent server or a server cluster.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a flowchart illustrating a document retrieval method according to an embodiment of the present disclosure.

As shown in fig. 1, the document retrieval method includes steps S1 to S7.

And step S1, acquiring pictures in the document, and compressing the pictures to obtain a first picture.

It can be understood that the first picture is a picture obtained by compressing a picture in a document. And the size of the first picture obtained after compression is smaller than that of the picture in the document. When the picture in the document is obtained, the picture is compressed and then the feature extraction is carried out, so that the complexity of carrying out the feature extraction on the picture is reduced, and the feature extraction efficiency is improved.

For example, assuming that the size of the picture in the document is 1600 pixels by 1600 pixels, the size of the first picture obtained after compressing the picture may be reduced, and in some embodiments, the size of the first picture may be 16 pixels by 16 pixels. Specifically, the zoom ratio is not limited, and may be set as needed.

Referring to fig. 2, in some embodiments, step S1 includes: step S11 to step S14.

Step S11: acquiring pictures in the document, and detecting corresponding picture attribute information of the pictures;

step S12: judging whether the storage space required by the picture storage exceeds a preset value or not according to the picture attribute information;

step S13: when the required storage space does not exceed the preset value during the picture storage, compressing the picture to obtain the first picture;

step S14: when the required storage space exceeds the preset value during the picture storage, the picture is cut into a plurality of sub-pictures, and the plurality of sub-pictures are compressed to obtain the first picture.

It is understood that the picture attribute information includes a storage space required when the picture is stored. The preset value is used for judging whether the compression process of the picture needs to be subjected to fragment compression processing. When the required storage space does not exceed a preset value during the storage of the picture, the picture can be directly compressed to obtain a first picture. When the required storage space is larger than a preset value during the storage of the picture, the picture needs to be firstly cut into a plurality of sub-pictures, and then the plurality of sub-pictures are compressed to obtain the first picture.

Because the process of compressing the picture occupies a large amount of memory of the compression device, the larger the storage space during the picture storage is, the larger the memory occupied during the picture compression is. Therefore, when the storage space required for storing the picture is large to a certain extent, the compression device may occupy too much memory during the process of compressing the picture, thereby causing a performance problem of the compression device. Therefore, for such a picture, it is necessary to perform compression processing on the sub-views after being divided into a plurality of sub-views.

In some embodiments, the preset value is set to 5M, and may also be set according to the configuration of the compression device and the application scenario.

For example, when the preset value is set to 5MB, assuming that the storage space required for storing the picture is 2MB, because the storage space required for storing the picture is smaller than the preset value, the picture may be directly compressed to obtain a compressed picture with a size of 16 pixels by 16 pixels, i.e. the first picture. When the preset value is set to 5MB, assuming that the storage space required for storing the picture is 8MB, because the storage space required for storing the picture is greater than the preset value, the picture needs to be firstly divided into a plurality of sub-views, and then compression processing is performed on the sub-views respectively, so as to obtain a compressed picture with a size of 16 pixels by 16 pixels, i.e. a first picture.

Referring to fig. 3, in some embodiments, step S14 includes: step S141 to step S143.

Step S141, cutting the picture into a plurality of sub-pictures, and acquiring size information of the sub-pictures;

step S142, compressing the sub-picture according to the size information and preset size information to obtain a corresponding target fragment;

step S143, the first picture is obtained according to the plurality of target fragments.

The process from step S141 to step S143 may be understood as a process of obtaining a plurality of sub-pictures after performing region segmentation on the picture, and then compressing the plurality of sub-pictures respectively.

It is understood that the size information of the sub-picture, i.e., the width information and the height information of the sub-picture. The preset size information comprises a target slicing width and a target slicing height, and corresponds to width information and height information of a target picture obtained after the sub-picture is compressed respectively. The target picture is the target slice. And obtaining a first picture according to a plurality of target fragments obtained by compressing a plurality of sub-pictures.

Specifically, the compression process may be divided into the following steps:

first, the size information of the sub-picture is calculated according to the zoom ratio and the screen magnification.

Sub-picture width (original width) screen multiple

Sub-picture height (zoom ratio) screen multiple of original picture height

Then, the preset size information is calculated according to the zoom ratio and the screen magnification.

Target slice width being target picture width

Target fragment height ═ original image fragment height ═ zoom ratio

Wherein, the zoom ratio is the height of the first picture/(the height of the original image screen multiple).

And acquiring a sub-picture to be compressed from an original picture according to the sub-picture width and the sub-picture height in the size information of the sub-picture, sequentially compressing the sub-picture to be compressed into a preset size according to a target fragment width and a target fragment height in the preset size, and drawing the sub-picture into a target fragment area to obtain the target picture. The target picture is the first picture.

When the compression device is a terminal device and a screen of the terminal device is a Retain screen (retina screen), the screen multiple is resolution corresponding to the screen width/the number of points corresponding to the screen width, for example, the resolution corresponding to the screen width of iphone6 is 750, and the number of points corresponding to the screen width is 375, that is, the screen multiple corresponding to iphone6 is 750/375-2. When the compression device is not the terminal device or the terminal device but the screen of the terminal device is not the retina screen, the screen multiple is 1. It is understood that when the compression device is a server, the corresponding screen multiple is 1.

The target picture is obtained by sequentially compressing the picture after the picture is cut, so that the pressure occupied by the memory resource of the compression equipment in the picture compression process can be reduced, and the performance problem caused by the compression process is avoided.

Step S2, obtaining color attribute information of the first picture, and matching a picture conversion mode of the first picture according to the color attribute information to convert the first picture into a second picture according to the picture conversion mode, where picture attributes of the first picture and the second picture are different.

It is understood that the first picture determines the picture conversion mode according to its color attribute information. And converting the first picture into the second picture according to the picture conversion mode. Wherein, the picture attribute of first picture and second picture is different, includes: the second picture is a grayscale picture obtained from the first picture, and the size of the second picture is smaller than the size of the first picture.

Referring to fig. 4, in some embodiments, step S2 includes: step S21 to step S22.

Step S21, when the color attribute information of the first picture meets a first condition, dividing the first picture into a plurality of first sub-pictures, carrying out graying processing on the plurality of first sub-pictures to obtain a plurality of grayed pictures, and selecting the grayscale picture with the largest grayscale variance as a second picture;

and step S22, when the color attribute information of the first picture does not meet a first condition, compressing the first picture and carrying out graying treatment to obtain a second picture.

It is understood that the conversion mode is determined according to the color attribute information of the first picture.

It is understood that the color attribute information of the first picture is obtained for determining the picture conversion mode. When the color attribute information of the first picture satisfies the first condition, the picture conversion mode corresponds to step S21. When the color attribute information of the first picture does not satisfy the first condition, the picture conversion mode corresponds to step S22.

In some embodiments, whether the color attribute information of the first picture satisfies a first condition includes:

extracting an HSV histogram of the first picture;

quantizing the H component and the S component and the V component in the HSV histogram into 16 levels and 4 levels to synthesize a one-dimensional feature vector;

obtaining a maximum feature vector value according to the feature vector, and calculating according to the feature vector value to obtain a feature vector mean value;

if a division value obtained by dividing the maximum feature vector value by the feature vector mean value is larger than a vector preset threshold value, the color attribute information of the first picture meets a first condition;

and if the division value obtained by dividing the maximum characteristic vector value by the characteristic vector mean value is not greater than the preset vector threshold value, the color attribute information of the first picture does not meet a first condition.

In some embodiments, the vector preset threshold is set to 1.35, and may be set as needed. It is understood that the above method can be used to determine whether the first picture is a picture close to a solid color.

In some embodiments, the dividing the first picture into a plurality of first sub-pictures when the color attribute information of the first picture satisfies a first condition includes:

dividing the first picture into N sub-regions, and respectively acquiring a picture in each sub-region as a first sub-picture, wherein N is a positive integer greater than or equal to 2.

It is to be understood that when the color attribute information of the first picture satisfies the first condition, N sampling regions are set on the first picture to divide the first picture into N regions, where N is a positive integer, and N is greater than or equal to 2, for example, taking N equal to 5 as an example, upper-left, upper-right, lower-left, lower-right, and central 5 regions are set in the first picture as sampling points to further select a sampling picture for picture feature extraction.

Illustratively, the upper left corner is taken as the origin of coordinates, the direction right of the origin of coordinates is taken as the positive x-axis direction, and the direction right below the origin of coordinates is taken as the positive y-axis direction. Let the coordinate system be represented by { x, y, width, height }, where x is the x-axis coordinate value, y is the y-axis coordinate value, width is the width, and height is the height. Let the coordinates of the first picture be 0,0,16, 16. Assuming that the size of each divided region is 8 pixels by 8 pixels, regions of 5 upper left, upper right, lower left, lower right and center are set as sampling points, and the obtained coordinates of the 5 first sub-pictures are respectively:

the first sub-picture coordinate corresponding to the upper left area: {0,0,8,8}.

The first sub-picture coordinate corresponding to the upper right area: {8,0,8,8}.

First sub-picture coordinates corresponding to the lower left area: {0,8,8,8}.

First sub-picture coordinates corresponding to the lower right area: {8,8,8,8}.

The first sub-picture coordinate corresponding to the central area is as follows: {4,4,8,8}.

It is understood that all of the above five pictures cover all of the area of the first picture. The first picture is divided through the method, the obtained multiple first sub-pictures are grayed, and the gray picture with the largest gray variance is selected as the second picture for subsequent extraction of the picture feature signature, so that a better picture feature sampling effect can be achieved.

Step S3, obtaining the average gray value of the second picture and the pixel gray value of each pixel point of the second picture, and determining the picture characteristic signature value of the second picture according to the average gray value and the pixel gray value.

It can be understood that the average gray value is equal to the sum of the gray values of the pixels of the second picture divided by the number of the pixels of the second picture. The image characteristic signature value is used as a characteristic recording result of the second image and is obtained by comparing the gray value of each pixel point of the second image with the average gray value.

In some embodiments, determining the picture characteristic signature value of the second picture according to the average gray value and the pixel gray value includes the following steps:

comparing the pixel gray value of each pixel point in the second picture with the average gray value in sequence to obtain a comparison result;

when the comparison result is that the pixel gray value is greater than the average gray value, storing 1 in the array, otherwise, storing 0 in the array;

and sequentially taking out the element values in the array to splice into a character string and convert the character string into a binary number, wherein the binary number is the picture characteristic signature value.

For example, assume that the example picture is a 2pixel by 2pixel sized grayscale picture, i.e., the example picture has only 4 pixels. Assuming that the gray-scale values of the pixels in the example picture are 10, 20, 40, and 50, respectively, the calculated average gray-scale value corresponding to the example picture is (10+20+40+ 50)/4-30. Comparing the pixel gray value of the pixel point in the example picture with the average gray value in sequence, and if the pixel gray value is greater than the average gray value, storing 1 in the array; otherwise, 0 is stored in the array. That is, the elements in the final array are: [0,0,1,1]. And splicing the elements in the array to obtain a character string '0011', converting the character string into a binary system, and obtaining a binary number of 0b0011, wherein the binary number is the picture characteristic signature value corresponding to the example picture.

And step S4, acquiring the text information in the document, and extracting the text characteristic signature value corresponding to the document according to the text information.

It is understood that the text information is the text in the document. The character characteristic signature value is obtained according to the character information.

Referring to fig. 5, in some embodiments, step S4 includes: step S41 to step S46.

Step S41, extracting characters from the text information according to a preset character set to obtain first text information;

step S42, splitting the first character information according to a preset digit number to obtain a first character splitting array;

s43, coding the elements of the first character split array to obtain a first coded array;

step S44, encrypting the elements in the first coding array, and intercepting the encryption result to obtain a first encryption array;

step S45, binary conversion is carried out on the elements in the first encrypted array to obtain a first characteristic array;

and step S46, obtaining the character feature signature according to the first feature array.

It can be understood that the preset character set is a feature extraction set of the text information. In some embodiments, the preset character set includes letters, numbers and symbols, and in addition, the preset character set can be set as required. The character extraction is performed on the text information according to the character set, and it can be understood that the content of the text information except for the elements of the preset character set is removed, and the obtained result is the first text information. In some embodiments, the above process may be implemented by processing using regular expressions.

The preset number of bits is a positive integer, and in some embodiments, the preset number of bits may be set to 3, and the specific use may be set by itself as needed. Splitting the first character information according to a preset digit number to obtain a character string array, namely a first character splitting array. The encoding in step S43 may be utf8 encoding, and the specific use may be selected as needed. The encryption in step S44 may be md5 encryption, and the specific use may be selected as needed. It can be understood that, after MD5 encryption is performed on the elements in the first encoding array, an encryption result string is obtained, and the encryption result string is intercepted, and the array formed according to the interception result is the first encryption array. It will be appreciated that the elements in the first signature array are obtained by binary conversion of the elements in the first encryption array.

Illustratively, assume that the preset character set includes letters, numbers, and symbols, assume that the preset number of digits is 3, and assume that the example word "binary" is a numbering system in which 2 is the base in mathematical and numerical circuits ". Then, after extracting the example characters according to the preset character set, the obtained first character information is (binary) 2", and after splitting the first character information according to the preset digit number, the obtained corresponding first character splitting array is [" (bi "," nar "," y)2 "]. And the character feature signature is obtained by splitting the array according to the first character.

In some embodiments, the first encrypted array consists of 8-bit bytes after the encryption result string is truncated. It will be appreciated that, assuming that each element in the first encryption array is 8 bytes, because each byte corresponds to an 8-bit binary number, the binary number obtained after binary conversion of each element has 64 bits, i.e. each element in the first signature array is a 64-bit binary number.

The text feature signature is obtained according to the first feature array, and specifically includes the following steps:

and initializing a corresponding characteristic result value for each bit binary number, and recording a summation result. Sequentially taking out the elements of the first characteristic array and carrying out bitwise summation, if the value of the corresponding binary digit is 1, adding 1 to the characteristic result value, if the value of the corresponding binary digit is 0, subtracting 1 from the characteristic result value, and after the summation is finished, constructing a first characteristic result array according to the characteristic result value;

and sequentially converting the elements in the first characteristic result array into binary numbers, and obtaining the character characteristic signature value according to the binary numbers.

In some embodiments, assuming that the first feature array is composed of 64-bit binary numbers, 64 feature result values are correspondingly generated, and the summation result of each bit binary number is recorded, and the first feature result array is generated according to the 64 feature result values.

In some embodiments, the rule for converting the elements in the first feature result array to binary numbers is: if the element is greater than 0, it is converted to binary 1, otherwise it is converted to binary 0.

For example, assuming that the first feature array is [0b0001,0b0011,0b1111,0b01111], the process of obtaining the literal feature signature value according to the first feature array is as follows:

first characteristic array ═ 0b0001,0b0011,0b1111,0b01111]

The first characteristic result array [ -1-1+1-1, -1-1+1+1, 1+1+1] [ -2,0,2,4]

The character signature value is 0b 0011.

And step S5, determining the document characteristic signature of the document according to the picture characteristic signature value and the character characteristic signature value.

In some embodiments, the document feature signature is a combination of a picture feature signature value and a text feature signature value.

For example, assume that the picture feature signature value is 0x1011 and the text feature signature value is 0x 0111. If the signature value is combined according to the picture characteristic and the character characteristic, the signature name of the document obtained after combination is 0x 10110111. If the signature value is in front according to the character characteristic and the picture characteristic signature value is combined later, the signature name of the document characteristic obtained after combination is 0x 01111011.

It is understood that when the picture feature signature value is a 64-bit binary number and the text feature signature value is also a 64-bit binary number, the document feature signature obtained by the combination is a 128-bit binary number.

It is understood that the process of obtaining the corresponding document feature signature according to the document in steps S1 to S5 may be implemented by a server or a terminal device.

And step S6, acquiring an address path for storing the document, and establishing a document retrieval model according to the document characteristic signature and the address path so as to establish a document retrieval database according to the document retrieval model.

It will be appreciated that the document retrieval model is recorded with a document feature signature, and the address path of the stored document. The document retrieval model establishes a mapping relation between the document feature signature and the address path. And establishing a document retrieval database according to the document retrieval model, namely storing the document retrieval model into the database.

It can be understood that if the contents of two documents are similar, the document feature signatures obtained by the two documents may be the same, and therefore, the document retrieval model has a case where one document feature signature corresponds to a plurality of address paths.

It is understood that the process of step S6 may be executed by the terminal device to obtain the address path of the saved document and extract the document feature signature of the corresponding document, and make the server establish the document retrieval database by initiating a network request. It is understood that the process of step S6 may also be performed by the server independently.

Step S7, when receiving a document retrieval instruction, determining the information to be retrieved according to the document retrieval instruction, and matching the file corresponding to the information to be retrieved from the document retrieval database.

In some embodiments, the document retrieval instruction may be a document retrieval request initiated by the terminal device, and after receiving the document retrieval instruction, the server acquires the information to be retrieved according to the document retrieval instruction. The information to be retrieved comprises at least one of text information to be retrieved and picture information to be retrieved. It should be understood that the document retrieval request initiated by the terminal device is not limited, and the document retrieval request may also be an instruction received by the server in another manner, such as a retrieval instruction sent by a maintenance developer.

Referring to fig. 6, in some embodiments, step S7 includes: step S71 to step S72.

And step S71, when a document retrieval instruction is received, determining information to be retrieved according to the document retrieval instruction, and determining a search matching signature value according to the information to be retrieved.

It can be understood that the process of determining the information to be retrieved according to the document retrieval instruction and determining the matching signature value according to the information to be retrieved may be executed by the terminal device that initiated the document retrieval instruction, or may be executed by the server that received the document retrieval instruction.

In some embodiments, the search matching signature value is obtained from the information to be retrieved.

When the information to be retrieved only includes the picture information, the corresponding picture feature signature value is obtained as the search matching signature value according to the steps S1 to S3.

When the information to be retrieved only contains text information, the corresponding text feature signature value is obtained as a search matching signature value according to step S4.

And when the information to be retrieved contains the picture information and the character information, obtaining the corresponding document feature signature as a search matching signature value according to the steps S1 to S5.

Step S72, matching files corresponding to the search match signature value from the document retrieval database.

It is understood that the document retrieval model is recorded in the document retrieval database. The document feature signature is obtained from the document retrieval model and is matched with the search matching signature value to obtain a corresponding file.

Referring to fig. 7, in some embodiments, step S72 includes: step S721 to step S726.

Step S721, sequentially extracting the document retrieval model from the document retrieval database, and extracting a document feature signature value in the document retrieval model according to the information to be retrieved to obtain a first signature value;

step S722, carrying out XOR operation on the first signature value and the search matching signature value to obtain a first matching value;

step S723, performing recursion and operation according to the first matching value to obtain a second matching value;

step S724, when the second matching value is smaller than a preset matching threshold value, acquiring an address path corresponding to the document feature signature to form an address path array;

step S725, arranging the elements of the address path array according to a preset arrangement mode to obtain a result document address array;

and step S726, acquiring a corresponding file according to the result document address array.

It can be understood that the document feature signature value includes an image feature signature value and a text feature signature value, and the first signature value needs to be extracted from the document feature signature value according to the information to be retrieved. It can be understood that, if the information to be retrieved only contains the picture information, the picture feature signature value is extracted from the document feature signature value as the first signature value. And if the information to be retrieved only contains text information, extracting a text characteristic signature value from the document characteristic signature value as a first signature value. And if the information to be retrieved contains the image-text information, directly using the document characteristic signature value as a first signature value.

In some embodiments, the recursive and operation is performed by subtracting one from the first matching value, and then performing the recursive and operation on the first matching value to obtain an operation result. And then, after the operation result is assigned to the first matching value again, the operation is repeatedly carried out until the first matching value is equal to 0. The number of operations in this process, i.e., the number of recursions, is also the second matching value. It is understood that the smaller the second matching value, the higher the degree of matching.

In some embodiments, the preset matching threshold is set to 20, and may be set by itself as needed.

In some embodiments, the elements of the address path array are arranged according to a preset arrangement mode, that is, the elements of the address path array are arranged according to the second matching value from small to large. The preset arrangement mode can be set according to the requirement.

It will be appreciated that from the resulting array of document addresses, a corresponding file may be obtained.

For example, assuming that the first matching value is 0b1110, the process of calculating the second matching value according to the first matching value is as follows:

0b1110&0b1110-1＝0b1110&0b1101＝0b1100

0b1100&0b1100-1＝0b1100&0b1011＝0b1000

0b1000&0b1000-1＝0b1000&0b0111＝0

it can be understood that the recursion times of the above process are 3 times, that is, the second matching value is 3, and assuming that the preset matching threshold is 20, the second matching value is smaller than the preset matching threshold, and the document address path corresponding to the first matching value is accommodated in the address path array for subsequent processing and obtaining of the matching file.

According to the document retrieval method, the image and the character information in the document are extracted, and the image is compressed and converted to extract the characteristics so as to obtain the corresponding image characteristic signature value; and performing feature extraction on the character information to obtain a corresponding character feature signature value. And obtaining a document characteristic signature according to the picture characteristic signature value and the character characteristic signature value, and establishing an index relation according to the document characteristic signature and an address path stored by the document. By the method, when the document is searched, the document can be searched according to the picture, the text and the image.

When the storage space occupied by the picture is judged to be larger than the preset value in the process of compressing the picture to obtain the first picture, the picture is subjected to slice compression to obtain the first picture, and the memory occupation condition of compression equipment in the picture compression process can be optimized.

In addition, in the process of converting the first picture to obtain the second picture for feature extraction, when the color attribute information of the first picture is judged to meet the first condition, namely the first picture is a picture close to a pure color, the picture in the area with the largest gray variance is selected as the second picture for feature extraction, so that a better picture feature sampling effect can be achieved.

Referring to fig. 8, the present application further provides a document retrieving apparatus 200, including:

the image compression module 201 is configured to obtain an image in a document, and compress the image to obtain a first image;

the picture conversion module 202 is configured to acquire color attribute information of the first picture, match a picture conversion mode of the first picture according to the color attribute information, and convert the first picture into a second picture according to the picture conversion mode, where picture attributes of the first picture and the second picture are different;

the picture signature module 203 is configured to obtain an average gray value of the second picture and a pixel gray value of each pixel of the second picture, and determine a picture feature signature value of the second picture according to the average gray value and the pixel gray value;

the text signature module 204 is configured to obtain text information in the document, and extract a text feature signature value corresponding to the document according to the text information;

the document signature module 205 is configured to determine a document feature signature of the document according to the picture feature signature value and the text feature signature value;

an index establishing module 206, configured to obtain an address path for storing the document, and establish a document retrieval model according to the document feature signature and the address path, so as to establish a document retrieval database according to the document retrieval model;

and the retrieval module 207 is configured to, when a document retrieval instruction is received, determine information to be retrieved according to the document retrieval instruction, and match a file corresponding to the information to be retrieved from the document retrieval database.

In some embodiments, when the picture compression module 201 acquires a picture in a document and compresses the picture to obtain a first picture, the method includes:

acquiring pictures in the document, and detecting corresponding picture attribute information of the pictures;

judging whether the storage space required by the picture storage exceeds a preset value or not according to the picture attribute information;

when the required storage space does not exceed the preset value during the picture storage, compressing the picture to obtain the first picture;

when the required storage space exceeds the preset value during the picture storage, the picture is cut into a plurality of sub-pictures, and the plurality of sub-pictures are compressed to obtain the first picture.

In some embodiments, the picture compression module 201, when dividing the picture into a plurality of sub-pictures and compressing the plurality of sub-pictures to obtain the first picture, includes:

cutting the picture into a plurality of sub-pictures, and acquiring size information of the sub-pictures;

compressing the sub-picture according to the size information and preset size information to obtain a corresponding target fragment;

and acquiring the first picture according to the plurality of target fragments.

In some embodiments, the picture conversion module 202, when matching the picture conversion mode of the first picture according to the color attribute information to convert the first picture into the second picture according to the picture conversion mode, comprises:

when the color attribute information of the first picture meets a first condition, dividing the first picture into a plurality of first sub-pictures, carrying out graying processing on the plurality of first sub-pictures to obtain a plurality of grayed pictures, and selecting the grayscale picture with the largest grayscale variance as a second picture;

and when the color attribute information of the first picture does not meet a first condition, compressing the first picture and carrying out graying treatment to obtain a second picture.

In some embodiments, the text signature module 204, when extracting the text feature signature value corresponding to the document according to the text information, includes:

according to a preset character set, performing character extraction on the text information to obtain first text information;

splitting the first character information according to a preset digit to obtain a first character splitting array;

coding elements of the first character splitting array to obtain a first coding array;

encrypting the elements in the first coding array, and intercepting an encryption result to obtain a first encryption array;

binary conversion is carried out on elements in the first encrypted array to obtain a first characteristic array;

and obtaining the character feature signature according to the first feature array.

In some embodiments, when receiving a document retrieval instruction, the retrieving module 207 determines information to be retrieved according to the document retrieval instruction, and matches a file corresponding to the information to be retrieved from the document retrieval database, includes:

when a document retrieval instruction is received, determining information to be retrieved according to the document retrieval instruction, and determining a search matching signature value according to the information to be retrieved;

matching files from the document retrieval database that correspond to the search matching signature values.

In some embodiments, the retrieving module 207, when retrieving a document from the document database that matches a file corresponding to the search match signature value, comprises:

sequentially taking out the document retrieval model from the document retrieval database, and extracting a document characteristic signature value in the document retrieval model according to the information to be retrieved to obtain a first signature value;

carrying out XOR operation on the first signature value and the search matching signature value to obtain a first matching value;

performing recursion and operation according to the first matching value to obtain a second matching value;

when the second matching value is smaller than a preset matching threshold value, acquiring an address path corresponding to the document feature signature to form an address path array;

arranging the elements of the address path array according to a preset arrangement mode to obtain a result document address array;

and acquiring a corresponding file according to the result document address array.

Referring to fig. 9, fig. 9 is a schematic block diagram of a server according to an embodiment of the present disclosure.

As shown in fig. 9, the server 300 includes a processor 301 and a memory 302, and the processor 301 and the memory 302 are connected by a bus 303 such as an I2C (Inter-integrated Circuit) bus.

In particular, processor 301 is configured to provide computational and control capabilities, supporting the operation of the entire server. The Processor 301 may be a Central Processing Unit (CPU), and the Processor 301 may also be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Specifically, the Memory 302 may be a Flash chip, a Read-Only Memory (ROM) magnetic disk, an optical disk, a usb disk, or a removable hard disk.

Those skilled in the art will appreciate that the architecture shown in fig. 9 is a block diagram of only a portion of the architecture associated with embodiments of the present application and does not constitute a limitation on the servers to which embodiments of the present application may be applied, and that a particular server may include more or less components than those shown, or some components may be combined, or have a different arrangement of components.

The processor 301 is configured to run a computer program stored in a memory, and when executing the computer program, implement any one of the document retrieval methods provided in the embodiments of the present application.

In an embodiment, the processor is configured to run a computer program stored in the memory and to implement the following steps when executing the computer program:

In some embodiments, when the processor 301 obtains a picture in a document and performs a compression process on the picture to obtain a first picture, the method includes:

In some embodiments, when the processor 301 cuts the picture into a plurality of sub-pictures and performs compression processing on the plurality of sub-pictures to obtain the first picture, the method includes:

and acquiring the first picture according to the plurality of target fragments.

In some embodiments, the processor 301, when matching a picture conversion mode of the first picture according to the color attribute information to convert the first picture into a second picture according to the picture conversion mode, comprises:

In some embodiments, when extracting the text feature signature value corresponding to the document according to the text information, the processor 301 includes:

In some embodiments, when receiving a document retrieval instruction, the processor 301 determines information to be retrieved according to the document retrieval instruction, and matches a file corresponding to the information to be retrieved from the document retrieval database, includes:

In some embodiments, processor 301, when retrieving a file from the document database that matches the file corresponding to the search match signature value, comprises:

It should be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the server described above may refer to the corresponding process in the foregoing embodiment of the document retrieval method, and is not described herein again.

Embodiments of the present application also provide a storage medium for a computer-readable storage, where the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the steps of any one of the document retrieval methods provided in the specification of the embodiments of the present application.

The storage medium may be an internal storage unit of the server described in the foregoing embodiment, for example, a hard disk or a memory of the server. The storage medium may also be an external storage device of the server, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the server.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware embodiment, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

It should be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. The above description is only for the specific embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A document retrieval method, comprising:

2. The method of claim 1, wherein the obtaining a picture in a document and compressing the picture to obtain a first picture comprises:

3. The method according to claim 2, wherein the dividing the picture into a plurality of sub-pictures and compressing the plurality of sub-pictures to obtain the first picture comprises:

and acquiring the first picture according to the plurality of target fragments.

4. The method according to claim 1, wherein matching the picture conversion mode of the first picture according to the color attribute information to convert the first picture into a second picture according to the picture conversion mode comprises:

5. The method of claim 1, wherein the extracting a text feature signature value corresponding to the document according to the text information comprises:

6. The method according to claim 1, wherein when receiving a document retrieval instruction, determining information to be retrieved according to the document retrieval instruction, and matching a file corresponding to the information to be retrieved from the document retrieval database, comprises:

7. The method of claim 6, wherein the retrieving a file from the document database that matches the file corresponding to the search match signature value comprises:

8. A document retrieval apparatus, comprising:

the image compression module is used for acquiring images in a document and compressing the images to acquire a first image;

9. A server, characterized in that the server comprises a processor, a memory, a computer program stored on the memory and executable by the processor, and a data bus for enabling a connection communication between the processor and the memory, wherein the computer program, when executed by the processor, implements the steps of the document retrieval method according to any of claims 1 to 7.

10. A storage medium for computer-readable storage, wherein the storage medium stores one or more programs which are executable by one or more processors to implement the steps of the document retrieval method of any one of claims 1 to 7.