CN113850073A

CN113850073A - Document identification method, engineering word stock construction method, electronic device and storage medium

Info

Publication number: CN113850073A
Application number: CN202110788090.5A
Authority: CN
Inventors: 李鹏程; 杨琼; 李东来; 曾云霞
Original assignee: Zhongzhi Chengdu Technology Co ltd
Current assignee: Zhongzhi Chengdu Technology Co ltd
Priority date: 2021-07-13
Filing date: 2021-07-13
Publication date: 2021-12-28

Abstract

The embodiment of the application provides a document identification method, a project word stock construction method, electronic equipment and a storage medium, and relates to the technical field of communication, wherein the method comprises the following steps: acquiring an image to be identified; performing character recognition on the image to be recognized; based on a preset engineering word bank, segmenting words obtained by recognition; and replacing the phrases obtained by the segmentation based on the preset engineering word stock. The method provided by the embodiment of the application can improve the accuracy of document identification.

Description

Document identification method, engineering word stock construction method, electronic device and storage medium

Technical Field

The embodiment of the application relates to the technical field of communication, in particular to a document identification method, an engineering word stock construction method, electronic equipment and a storage medium.

Background

At present, the field of building engineering is a big data industry with the largest data volume and the largest business scale, but is also the industry which has no data in all the industries at present. The construction project is often characterized by large investment amount, long construction period, many uncertain factors, large risk, many participants and the like.

Currently, the identification of engineering documents in the construction industry is usually achieved by Optical Character Recognition (OCR). The engineering documents are generally high in speciality and many in professional vocabularies, so that the engineering documents are recognized in the OCR mode, and the accuracy is low.

Disclosure of Invention

The embodiment of the application provides a document identification method, a project word stock construction method, electronic equipment and a storage medium, and aims to provide a method for identifying a project document, so that the accuracy of document identification can be improved.

In a first aspect, an embodiment of the present application provides a document identification method, including:

acquiring an image to be identified;

performing character recognition on the image to be recognized;

based on a preset engineering word bank, segmenting words obtained by recognition;

and replacing the phrases obtained by the segmentation based on the preset engineering word stock.

In one possible implementation manner, after the image to be recognized is acquired, the method further includes:

and preprocessing the image to be recognized.

In one possible implementation, the preprocessing includes one or more of stamp removal, watermark removal, or image correction.

In one possible implementation manner, based on a preset engineering lexicon, performing word segmentation on the characters obtained by recognition includes:

based on a preset engineering word stock, performing word segmentation on characters of each sentence in the image to be recognized to obtain a plurality of candidate phrase sets corresponding to the sentences; each candidate phrase set comprises a plurality of phrases, and each candidate phrase set corresponds to a confidence coefficient;

and determining a phrase set corresponding to each sentence in a plurality of candidate phrase sets corresponding to the sentences based on the confidence.

In one possible implementation manner, replacing a phrase obtained by segmenting the word based on the preset engineering lexicon includes:

calculating a context semantic vector of each phrase in each phrase set;

and replacing the phrases obtained by the participles based on the context semantic vector of each phrase and the semantic vector of the phrase in the preset engineering word stock.

In one possible implementation manner, replacing the phrases obtained by the segmentation based on the context semantic vector of each phrase and the semantic vector of the phrase in the preset engineering lexicon includes:

and calculating the similarity between the context semantic vector of each phrase and the semantic vector of the phrases in the preset engineering word stock, and replacing the phrases obtained by the segmentation based on the similarity.

and calculating the similarity and the editing distance between the context semantic vector of each phrase and the semantic vector of the phrases in the preset engineering word stock, and replacing the phrases obtained by word segmentation based on the similarity and the editing distance.

In one possible implementation manner, calculating a context semantic vector of each phrase in each phrase set includes:

obtaining the confidence of each phrase set;

determining a first phrase set based on the confidence degrees, wherein the first phrase set comprises a plurality of phrase sets with the confidence degrees lower than a preset first threshold value;

and calculating a context semantic vector of each phrase set in the first phrase set.

The embodiment of the application further provides a method for constructing the engineering word stock, which comprises the following steps:

acquiring a plurality of engineering documents;

identifying each engineering document, and segmenting words obtained by identification;

screening phrases obtained by word segmentation in a first engineering document to obtain a first engineering phrase set, wherein the first engineering phrase set comprises one or more engineering phrases, and the engineering phrases are used for constructing an engineering word stock;

and screening the phrases obtained by word segmentation in the rest engineering documents based on the first engineering phrase set to obtain one or more engineering phrases.

In one possible implementation manner, the method further includes:

and labeling each engineering document, wherein the label is used for identifying the document type, the document name and the text position.

In one possible implementation, the screening of the phrase obtained by word segmentation in the first engineering document includes:

calculating the weight of each phrase;

and screening in the phrases based on the weight.

In one possible implementation manner, the method further includes:

and updating the weight based on the labeling position of the phrase, wherein the labeling position of the phrase comprises one or more of a document category, a document name or a body text.

In one possible implementation manner, the method further includes:

and updating the first engineering phrase set based on the engineering phrases obtained by screening the phrases in the rest engineering documents.

In one possible implementation manner, based on the first engineering phrase set, the phrase obtained by word segmentation in the remaining engineering documents is filtered, and obtaining one or more engineering phrases includes:

calculating semantic vectors of phrases in the first engineering phrase set and context semantic vectors of phrases obtained by word segmentation in the rest engineering documents;

and screening the phrases obtained by word segmentation in the residual engineering documents based on the similarity between the semantic vector of the phrase in the first engineering phrase set and the context semantic vector of the phrase obtained by word segmentation in the residual engineering documents to obtain one or more engineering phrases.

In a second aspect, an embodiment of the present application provides an electronic device, including:

a memory, wherein the memory is used for storing a computer program code, and the computer program code includes instructions, and when the electronic device reads the instructions from the memory, the electronic device executes the following steps:

acquiring an image to be identified;

performing character recognition on the image to be recognized;

In one possible implementation manner, when the instruction is executed by the electronic device, the electronic device further executes the following steps after the step executed by the electronic device:

and preprocessing the image to be recognized.

In one possible implementation manner, when the instruction is executed by the electronic device, the step of performing word segmentation on the recognized characters based on a preset engineering lexicon by the electronic device includes:

In one possible implementation manner, when the instruction is executed by the electronic device, the step of causing the electronic device to perform word group replacement based on the preset engineering thesaurus and obtained by dividing words includes:

calculating a context semantic vector of each phrase in each phrase set;

In one possible implementation manner, when the instruction is executed by the electronic device, the electronic device executes a step of replacing a phrase obtained by a participle based on a context semantic vector of each phrase and a semantic vector of a phrase in the preset engineering thesaurus, where the step includes:

In one possible implementation manner, when executed by the electronic device, the instructions cause the electronic device to perform the step of calculating a context semantic vector of each phrase in each phrase set, including:

obtaining the confidence of each phrase set;

An embodiment of the present application further provides an electronic device, including:

acquiring a plurality of engineering documents;

In one possible implementation manner, when the instruction is executed by the electronic device, the electronic device further performs the following steps:

In one possible implementation manner, when the instruction is executed by the electronic device, the step of the electronic device performing the filtering on the word group obtained by word segmentation in the first engineering document includes:

calculating the weight of each phrase;

and screening in the phrases based on the weight.

In one possible implementation manner, when the instruction is executed by the electronic device, the electronic device may perform a process of screening phrases obtained by word segmentation in the remaining engineering documents based on the first engineering phrase set, where the process of obtaining one or more engineering phrases includes:

In a third aspect, embodiments of the present application provide a computer-readable storage medium, in which a computer program is stored, which, when run on a computer, causes the computer to perform the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a computer program, which, when executed by a computer, is configured to perform the method of the first aspect.

In a possible design, the program in the fourth aspect may be stored in whole or in part on a storage medium packaged with the processor, or in part or in whole on a memory not packaged with the processor.

Drawings

FIG. 1 is a flowchart illustrating a document identification method according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of an image rectification effect according to an embodiment of the present application;

fig. 3 is a schematic flow chart of a method for constructing a project thesaurus according to an embodiment of the present application;

fig. 4 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. In the description of the embodiments herein, "/" means "or" unless otherwise specified, for example, a/B may mean a or B; "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone.

In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present application, "a plurality" means two or more unless otherwise specified.

Based on the above problems, the embodiment of the application provides a document recognition method, which uses an OCR technology and combines with a word stock in the engineering field to realize the recognition of the engineering document, so that the accuracy of the recognition of the engineering document can be improved.

The document identification method provided by the embodiment of the present application will now be described with reference to fig. 1 and 2.

Fig. 1 is a schematic flowchart illustrating an embodiment of a document identification method according to an embodiment of the present application, including:

step 101, acquiring an image to be identified, and preprocessing the document to be identified to obtain a preprocessed image.

Specifically, the image to be recognized may be an image obtained by scanning the engineering document paper or an image obtained by photographing the engineering document paper, and the manner of obtaining the image to be recognized is not particularly limited in the embodiment of the present application.

The preprocessing may include removing a stamp, removing a watermark, and correcting an image.

In specific implementation, the stamp removing mode may position the stamp in the image to be recognized by using the color of the stamp (for example, the color of a common stamp is red or blue) and the shape of the stamp (for example, the shape of a common stamp is circular or square), and after positioning the stamp, an area corresponding to the stamp may be obtained, and pixels in the area of the stamp may be removed, so that the stamp removing effect may be achieved, and further, the recognition of the whole image to be recognized may be facilitated, and the interference of the stamp on the recognition may be avoided.

In addition, the watermark removing mode can be that the watermark content in the image to be identified is positioned according to the watermark picture, and the pixel corresponding to the watermark picture in the image to be identified can be removed, so that the watermark removing effect can be realized, the identification of the whole image to be identified can be facilitated, and the interference of the watermark to the identification can be avoided.

Further, since the text content in the image to be recognized may be inclined, for example, when the engineering document is photographed or scanned, the text content in the electronic document obtained by photographing or scanning may be inclined due to the incorrect placement of the engineering document, which may affect the recognition of the text in the image to be recognized, and thus may affect the accuracy of the content recognition of the image to be recognized. Therefore, the image to be recognized may be corrected, and the image correction may be performed by correcting directions of characters in the image to be recognized (for example, the directions may include a row direction and a column direction), where the image correction may be performed by correcting the image to be recognized by an edge projection method. Fig. 2 is a schematic diagram of image rectification, as shown in fig. 2, a page 200 is an image of an image to be recognized before image rectification, and referring to the page 200, a row direction of characters in the page 200 is not a horizontal direction in the page 200, and a column direction of characters in the page 200 is not a vertical direction in the page 200, thereby affecting recognition of characters. After image rectification, a page 201 can be obtained, and with reference to the page 201, the direction of the characters in the page 201 is rectified, the row direction of the characters in the page 201 is in the horizontal direction in the page 200, and the column direction of the characters in the page 201 is in the layer vertical direction in the page 201. Therefore, the influence of the inclination of the characters on character recognition can be avoided.

And 102, identifying the preprocessed image.

In particular, the above-described manner of recognition may be by OCR technology. It is understood that the OCR is an existing character recognition technology, and reference may be made to related documents or patents, which are not described herein again. The recognition result may be obtained by recognizing the preprocessed image through an OCR technology, where the recognition result may include characters recognized from the preprocessed image. The recognition result may further include information such as a position of the character, a font of the character, and a size of the character.

And 103, segmenting the characters through a preset engineering word stock to obtain segmentation combinations, wherein each segmentation combination comprises a plurality of word groups.

Specifically, the word group may be obtained by segmenting the word by a Natural Language Processing (NLP) model.

It should be noted that the NLP is a commonly used natural language processing model, and therefore, when the commonly used NLP model is used to perform word segmentation, commonly used phrases can be obtained. In this embodiment of the application, a preset engineering word stock may be added to the above-mentioned commonly-used NLP model, where the preset engineering word stock may include a preset engineering phrase, and the preset engineering phrase may include a special phrase in the building engineering industry, so that when the NLP model to which the preset engineering word stock is added performs word segmentation, the NLP model may perform word segmentation according to the characteristics in the building engineering industry, thereby avoiding splitting a professional phrase belonging to the building engineering industry, and further identifying the above-mentioned engineering document more accurately. Illustratively, taking "engineering management" as an example, when the above "engineering management" is participated by using a common NLP model, two phrases of "engineering" and "management" can be obtained, and "engineering management" is a special professional word in the building industry, that is, the above preset engineering word library includes the special phrase "engineering management", so when the "engineering management" is participated by using an NLP model added with the preset engineering word library, the phrase "engineering management" can be obtained.

In the word segmentation process, word segmentation can be performed based on the dimension of a sentence, for example, word segmentation can be performed on each sentence in the preprocessed image, that is, one sentence can correspond to one word segmentation combination. In a specific implementation, the above-mentioned manner of determining the word segmentation combination corresponding to the sentence may be as follows: when any statement is participated, different participatory combinations corresponding to the statement can be obtained, wherein different participatory combinations of the statement can correspond to different confidence levels, and the confidence level of each participatory combination is used for representing the statistical probability of each participatory combination. Wherein, the statistical probability can be obtained according to the prior probability of each phrase in the word segmentation combination. Illustratively, the sentence "i love china" is taken as an example, wherein the prior probability that the word "i" appears in a word in daily life is P1, the prior probability that the word "love" appears in a word in daily life is P2, and the prior probability that the word "china" appears in two words in daily life is P3, so by counting the prior probabilities P1, P2 and P3, the statistical probability P (confidence) of the participle combination { i, love, china } can be obtained. In a specific implementation, the statistical probability P may be obtained by addition, for example, P — P1+ P2+ P3, and the statistical probability P may be obtained by multiplication, for example, P1 × P2 × P3. By performing different segmentation combinations on a sentence, the statistical probability P of different segmentation combinations in the sentence can be calculated, that is, the confidence of the segmentation combination corresponding to each sentence can be obtained. It will be appreciated that the highest statistical probability (i.e., the highest confidence) corresponds to the most reasonable combination of participles. Taking the above-mentioned "i love in china" as an example, after word segmentation, a plurality of word segmentation combinations such as { i, china }, { i, china } and { i, china } can be obtained, and if the statistical probability (confidence) of the word segmentation combination of { i, china } is the highest, it indicates that the word segmentation combination of { i, i. After the confidence degrees of the multiple word segmentation combinations of any statement are obtained, the word segmentation combination corresponding to the highest confidence degree can be used as the word segmentation combination of the statement.

And 104, calculating a context semantic vector of each phrase in the participle combination.

Specifically, after the word segmentation combination corresponding to each sentence in the processed image is obtained, the context semantic vector of each phrase in the word segmentation combination may be calculated. Preferably, in order to improve the calculation efficiency, partial word segmentation combinations can be screened from the plurality of word segmentation combinations. In a specific implementation, the filtering may be to obtain a word segmentation combination with a confidence lower than a preset first threshold. Illustratively, taking a plurality of sentences as an example, the participle combination corresponding to the sentence a, the participle combination B corresponding to the sentence B, the participle combination C corresponding to the sentence C, and the participle combination D corresponding to the sentence D. The confidence of the segmentation combination A is 96%, the confidence of the segmentation combination B is 94%, the confidence of the segmentation combination C is 93%, and the confidence of the segmentation combination D is 90%. It will be appreciated that the confidence level of any one of the above participle combinations is the highest confidence level in the sentence. Taking the sentence a as an example, the sentence a may include other participle combinations, but the confidence of the other participle combinations is lower than the confidence of the participle combination a by 96%, so the participle combination a is taken as the participle combination of the sentence a. Similarly, the word segmentation combinations of other sentences (e.g., sentence B, sentence C, and sentence D) can refer to the word segmentation combination of sentence a, and are not described herein again. If the preset first threshold is 95%, because the confidence degrees of the participle combination B, the participle combination C and the participle combination D are all lower than the preset first threshold of 95%, the participle combination B, the participle combination C and the participle combination D can be used as the participle combination obtained after screening.

Next, a context semantic vector for each phrase in the above-mentioned word-segment combination may be calculated. It is understood that the above-mentioned word segmentation combination of the computation context semantic vector may be a word segmentation combination which is not filtered, or a word segmentation combination which is filtered. The context semantic vector of each phrase may be calculated in a manner that, when calculating a context semantic vector of any phrase, semantic vectors of two phrases before and after the phrase is adjacent to the phrase may be obtained, where the semantic vectors may be obtained through calculation using a Bidirectional encoding representation from transforms (BERT) model based on transforms, and it is understood that the BERT model is only an exemplary illustration, and in a specific implementation, the semantic vectors may also be calculated through other types of neural network models, and the manner of calculating the semantic vectors is not particularly limited in this application. Then, the mean of the semantic vectors of the preceding and following phrases adjacent to the phrase can be used as the context speech vector of the phrase. For example, if the previous phrase adjacent to the phrase Y is the phrase X, the next phrase adjacent to the phrase Y is the phrase Z, the semantic vector of the phrase X is obtained as S1 through calculation of the BERT model, and the semantic vector of the phrase Z is obtained as S2 through calculation of the BERT model, then the context semantic vector S of the phrase Y is (S1+ S2)/2.

It should be noted that, when there is no adjacent preceding phrase or following phrase in any phrase, the context semantic vector of the phrase may be half of the semantic vector of the preceding phrase or following phrase adjacent to the preceding phrase or following phrase. For example, since the phrase X is the first phrase in the segmentation group, there is no preceding phrase adjacent to the phrase X, the following phrase adjacent to the phrase X is the phrase Y, and if the semantic vector of the phrase Y is S1, the context semantic vector S of the phrase X is S1/2. Similarly, since the phrase Z is the last phrase in the segmentation group, there is no next adjacent phrase in the phrase Z, the previous adjacent phrase in the phrase Z is the phrase Y, and if the semantic vector of the phrase Y is S1, the context semantic vector S of the phrase Z is S1/2.

And 105, replacing the phrases based on the context semantic vector.

Specifically, after the context semantic vector of the phrase is obtained, the context semantic vector of the phrase may be compared with the phrases in the preset engineering thesaurus. If the context semantic vector of the phrase is close to the semantic vector of any phrase in the preset engineering word stock, the phrase can be replaced by the phrase in the preset engineering word stock. The above manner of determining that the context semantic vector of the phrase is close to the semantic vector of any phrase in the preset engineering lexicon may be to calculate a similarity between two semantic vectors, for example, the similarity may be a cosine similarity or a euclidean distance. The embodiment of the present application does not specially limit the above way for calculating the similarity. Then, if the similarity is higher than a preset second threshold, the context semantic vector of the phrase is determined to be close to the semantic vector of any phrase in a preset engineering word stock; if the similarity is lower than or equal to a preset second threshold, it can be determined that the context semantic vector of the phrase is not close to the semantic vector of any phrase in the preset engineering word stock. For example, if the context semantic vector of the phrase M is S1, the semantic vector of a phrase N in the preset engineering lexicon is S2, and the similarity between S1 and S2 is 99%, assuming that the preset second threshold is 98%, since the similarity between S1 and S2 is 99% greater than the preset second threshold 98%, the phrase N may be replaced by the phrase M.

Optionally, in addition to the above replacement of the phrases by calculating the similarity, it may also be determined whether to replace the phrases by calculating the similarity in combination with the edit distance. The editing distance may include a word-level editing distance or a pinyin-level editing distance. The word-level edit distance is calculated by taking a word as a dimension, and can be used for representing the number of replaced words. Illustratively, taking two phrases of "high team" and "high pendant" as an example, if "high pendant" is replaced by "high team", only "pendant" needs to be replaced by "team", that is, only one word needs to be replaced, that is, the edit distance is 1. The pinyin level editing distance takes pinyin as a dimension to calculate the editing distance, and the pinyin level editing distance can be used for representing the number of letters in the replaced pinyin. Illustratively, taking two phrases of "high team" and "high pendant" as an example, the pinyin of "high team" is Gao Dui, and the pinyin of "high pendant" is Gao Zhui, and if "high pendant" is replaced by "high team", only "Zh" needs to be replaced by "D", that is, only 2 letters need to be replaced, that is, the edit distance is 2. Preferably, the pinyin-level edit distance may also be used to represent the proportion of the number of replaced letters to the total number of letters. When a word group is replaced by using the pinyin-level editing distance, if the total number of the replaced letters is counted to calculate the editing distance, a larger total number of the replaced letters generally causes a larger error, for example, when a word is replaced, if the pinyin of the word has a longer letter, more letters may be replaced, but only one word is actually replaced, the editing distance should be smaller, and from the total number of the replaced letters, the editing distance is larger, thereby causing a larger error. And counting the proportion of the total number of the replaced letters to the total number of the letters to obtain the relative editing distance, thereby avoiding the error caused by counting the total number of the replaced letters. For example, taking "alpine" and "soup-stock" as examples, the pinyin for "alpine" is Gao Zhuang, and the pinyin for "soup-stock" is Gao Tang, and if "alpine" is replaced by "soup-stock", only "Zhu" needs to be replaced by "T", that is, the proportion of the total number of letters of "Zhu" to the total number of letters of "Zhuang" is 3/6 to 50%.

After the editing distance (for example, the editing distance at a word level or the editing distance at a pinyin level) is obtained, the editing distance may be compared with a preset third threshold, and if the editing distance is smaller than the preset third threshold, it may be determined that the editing distance meets the requirement; if the edit distance is greater than or equal to the preset third threshold, it may be determined that the edit distance does not meet the requirement. At this time, if it is determined that the similarity and the edit distance satisfy the requirement at the same time, that is, the similarity is greater than the preset second threshold, and the edit distance is smaller than the preset third threshold, the phrase replacement may be performed. Thus, the recognition of the electronic text (i.e. the image to be recognized) of the whole project document can be completed.

The embodiment of the application also provides a construction method of the engineering word stock.

Fig. 3 is a schematic flow chart of an embodiment of the above engineering thesaurus construction method, including:

step 301, obtaining a plurality of sample data.

Specifically, the sample data may be an image of the engineering document paper, which may be obtained by scanning the engineering document paper, or may be obtained by taking a picture of the engineering document paper, and the embodiment of the present application does not specifically limit the manner of obtaining the image of the engineering document paper.

It is understood that one sample data may be an image of a project document paper, and thus, in the embodiment of the present application, a plurality of sample data may be acquired to construct the above-mentioned project thesaurus in a self-learning manner.

Step 302, performing character recognition on each sample data.

Specifically, it is possible to recognize each sample data, that is, an image of each engineering document paper, whereby the letters in the image can be obtained. The above-mentioned character recognition mode can adopt OCR technology. Other identification manners may also be used, and the embodiment of the present application does not specifically limit the manner of the character identification.

Step 303, labeling each sample data.

Specifically, the labeling may include labeling a document type, a document name, and a text position of the sample data. The document category may be used to characterize the category of the engineering document, for example, the document category may include a contract document, an economic document, a return record, an engineering document, and a review record. The document name may be used to characterize a file name of the project document. The body location may be used to characterize the location of the body in the engineering document, for example, the body location may include a start location that may be identified by a start identifier and an end location that may be identified by an end identifier.

And step 304, performing word segmentation on the characters in the first sample data to obtain a word group.

Specifically, one sample data may be arbitrarily acquired as the first sample data from the plurality of sample data, that is, an image of an engineering document paper may be arbitrarily acquired, and the characters in the image may be segmented, thereby obtaining a phrase. The word segmentation mode may be through NLP technology or through other word segmentation tools, and the word segmentation mode is not particularly limited in the embodiments of the present application.

Step 305, calculating the weight of each phrase.

Specifically, after the phrases of the current sample data are acquired, the weight of each phrase may be calculated. The weight may be used to represent the degree of correlation between the word group and the engineering lexicon, that is, the higher the degree of correlation is, the more likely the word group is to be entered into the engineering lexicon, thereby constructing the engineering lexicon. In a specific implementation, the weight may be obtained by calculating a keyword frequency (TF) and an Inverse text frequency (IDF), for example, the weight may be a product of the TF and the IDF.

And step 306, updating the weight based on the labeling position of each phrase.

Specifically, the labeling position may be a position where a phrase is located in the engineering document, and for example, the labeling position may include a document category, a document name, and a text. Then, the weights can be updated according to the different labeling positions of each phrase. For example, if the phrase is located in the document name, the weight of the phrase may be increased by n × cumulative occurrence number, wherein the cumulative occurrence number may be obtained by counting the occurrence number of the phrase in the engineering document. If the phrase is located in a preset first position of the text, for example, the preset first position includes the first 100 words or the last 100 words, the weight of the phrase may be increased by m × the cumulative occurrence number, it should be noted that the first 100 words or the last 100 words of the text are only the positions where the exemplary phrase appears in the text, and do not constitute a limitation to the embodiments of the present application, and in some embodiments, the phrase may also be other values. If the phrase is located in the document category or in a preset second position of the text, for example, the preset second position includes characters except the first 100 characters and the last 100 characters, the weight of the phrase may be increased by j × the cumulative occurrence number. Wherein n > m > j > 1.

And 307, sorting the phrases in the sample data based on the weight, and selecting a preset number of phrases.

Specifically, after the weight of each phrase in the sample data in step 306 is obtained, the phrases may be sorted based on the weight, and the top k phrases with the highest weight may be selected as the first candidate phrase set. Wherein the value k may be predetermined.

And 308, selecting a phrase from the first candidate phrase set as a phrase of the engineering word stock, and constructing a second candidate phrase set.

Specifically, the first candidate phrase set may be provided to an expert, so that the expert may select a phrase related to the engineering field from the first candidate phrase set, and may use the selected phrase as a phrase of the engineering thesaurus, thereby constructing the engineering thesaurus. It is understood that, after the expert selects a phrase from the first candidate phrase set, a second candidate phrase set may be obtained, where the second candidate phrase set may include one or more phrases selected from the first candidate phrase set.

Step 309, calculating a context semantic vector of each phrase in the second candidate phrase set at one or more positions.

Specifically, the second candidate phrase set may include one or more phrases, and thus, a context semantic vector of each phrase in the second candidate phrase set at one or more positions may be calculated. In a specific implementation, all positions of any word group in the engineering document may be obtained, where the position may be searched in a full-text search manner, and the total number of the positions may be one or more. After any position of the phrase is determined, a context semantic vector of the phrase at the position can be calculated, so that context semantic vectors of one or more positions corresponding to the phrase can be obtained. Similarly, the above-mentioned manner may be referred to as a context semantic vector obtaining manner of other phrases at one or more positions, and is not described herein again. The context semantic vector can be obtained by BERT model calculation. It can be understood that the context semantic vector may be obtained through calculation by other neural network models, and the method for obtaining the context semantic vector through calculation is not particularly limited in the embodiment of the present application. Table 1 is a summary table of context semantic vectors of different positions of each phrase in the second candidate phrase set.

TABLE 1

And 310, acquiring residual sample data, acquiring a phrase from the residual sample data, and taking the phrase as the phrase of the engineering word stock, thereby constructing the engineering word stock.

Specifically, as can be seen from the above steps, the above steps 304 to 308 complete the selection of a sample data phrase, and the selected phrase from the sample data is used as the phrase of the engineering lexicon. Then, a phrase can be selected from the residual sample data, and the phrase selected from the residual sample data is used as the phrase of the engineering word stock, so that the construction of the engineering word stock can be completed.

In specific implementation, when a phrase is selected from the remaining sample data, each sample data in the remaining sample data can be sequentially acquired. Taking any sample data in the remaining sample data (for convenience of explanation, the sample data processed in the step 304 is referred to as "first sample data" and the sample data processed in the step 310 is referred to as "second sample data") as an example, after the second sample data is subjected to the character recognition and word segmentation in the steps 301 to 304, the first phrase set of the second sample data can be obtained. Then, the phrases in the first phrase set of the second sample data may be compared with the phrases in the second candidate phrase set of the first sample data, so as to obtain a second phrase set, where the second phrase set may be obtained by selecting a phrase different from the phrases in the second candidate phrase set of the first sample data from the first phrase set of the second sample data. Then, one or more positions of each phrase in the second phrase set in the second sample data may be obtained, and a context semantic vector of each phrase in the second phrase set at the one or more positions in the second sample data may be calculated. If the context semantic vector of any phrase in the second phrase set at any position in the second sample data is close to one of the context semantic vectors in table 1, the phrase may be used as a phrase of the engineering document, and the context semantic vector of the phrase at the current position may be recorded in table 1, so as to update table 1. It is understood that when the remaining sample data is subjected to the context semantic vector comparison, the context semantic vector of the phrase in the remaining sample data may be compared with the updated context semantic vector in table 1. In addition, the above-mentioned manner of determining whether the two context semantic vectors are close to each other may be to calculate cosine similarity or euclidean distance of the two context semantic vectors, and if the cosine similarity or euclidean distance of the two context semantic vectors is higher than a preset fourth threshold, it may be determined that the two context semantic vectors are close to each other. It is to be understood that the above-mentioned calculation of the cosine similarity or the euclidean distance is only an exemplary way to determine whether two context semantic vectors are close to each other, and does not constitute a limitation to the embodiments of the present application.

Then, the remaining sample data may be sequentially processed in a manner of referring to the second sample data until a phrase cannot be selected from any sample data in the remaining sample data as a phrase of the engineering thesaurus.

In addition, in an actual application scenario, a proper phrase may not be found in one sample data as a phrase of the engineering thesaurus, but it is not described that a proper phrase may not be found in the next sample data as well as a phrase of the engineering thesaurus. Therefore, in order to avoid such a scene of missing samples, it is preferable to set a condition for ending the construction of the engineering thesaurus, for example, a plurality of continuous sample data (for example, 10 sample data) may be detected, and if a phrase cannot be selected from the plurality of continuous sample data as a phrase of the engineering thesaurus, the task of constructing the engineering thesaurus is ended, that is, the construction of the engineering thesaurus is completed.

Exemplary electronic devices provided in the following embodiments of the present application are further described below in conjunction with fig. 4. Fig. 4 shows a schematic structural diagram of an electronic device 400, and the electronic device 400 may be used to execute the document identification method or the engineering thesaurus construction method.

The electronic device 400 may include: at least one processor; and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, and the processor calls the program instructions to execute the method provided by the embodiments of fig. 1-3.

Fig. 4 shows a block diagram of an exemplary electronic device 400 suitable for implementing embodiments of the present application. The electronic device 400 shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 4, electronic device 400 is embodied in the form of a general purpose computing device. The components of electronic device 400 may include, but are not limited to: one or more processors 410, a memory 420, a communication bus 440 connecting the various system components (including the memory 420 and the processors 410), and a communication interface 430.

Communication bus 440 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.

Electronic device 400 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by the electronic device and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 420 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) and/or cache Memory. The electronic device may further include other removable/non-removable, volatile/nonvolatile computer system storage media. Although not shown in FIG. 4, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to the communication bus 440 by one or more data media interfaces. Memory 420 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.

A program/utility having a set (at least one) of program modules, including but not limited to an operating system, one or more application programs, other program modules, and program data, may be stored in memory 420, each of which examples or some combination may include an implementation of a network environment. The program modules generally perform the functions and/or methodologies of the embodiments described herein.

Electronic device 400 may also communicate with one or more external devices (e.g., keyboard, pointing device, display, etc.), one or more devices that enable a user to interact with the electronic device, and/or any devices (e.g., network card, modem, etc.) that enable the electronic device to communicate with one or more other computing devices. Such communication may occur via communication interface 430. Furthermore, the electronic device 400 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via a Network adapter (not shown in FIG. 4) that may communicate with other modules of the electronic device via the communication bus 440. It should be appreciated that although not shown in FIG. 4, other hardware and/or software modules may be used in conjunction with electronic device 400, including but not limited to: microcode, device drivers, Redundant processing units, external disk drive Arrays, disk array (RAID) systems, tape Drives, and data backup storage systems, among others.

The processor 410 executes various functional applications and data processing, for example, implementing the methods provided by the embodiments of the present application, by executing programs stored in the memory 420.

It should be understood that the interfacing relationship between the modules illustrated in the embodiments of the present application is only an illustration, and does not limit the structure of the electronic device 400. In other embodiments of the present application, the electronic device 400 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

It is to be understood that the electronic device 400 and the like described above include corresponding hardware structures and/or software modules for performing the respective functions in order to realize the functions described above. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

In the embodiment of the present application, the electronic device 400 and the like may be divided into functional modules according to the method example, for example, each functional module may be divided according to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.

Through the above description of the embodiments, it is clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions. For the specific working processes of the system, the apparatus and the unit described above, reference may be made to the corresponding processes in the foregoing method embodiments, and details are not described here again.

Each functional unit in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or make a contribution to the prior art, or all or part of the technical solutions may be implemented in the form of a software product stored in a storage medium and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: flash memory, removable hard drive, read only memory, random access memory, magnetic or optical disk, and the like.

The above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of document identification, the method comprising:

acquiring an image to be identified;

performing character recognition on the image to be recognized;

2. The method of claim 1, wherein after the acquiring the image to be identified, the method further comprises:

and preprocessing the image to be recognized.

3. The method of claim 2, wherein the pre-processing comprises one or more of de-stamping, de-watermarking, or image correction.

4. The method according to any one of claims 1 to 3, wherein the segmenting the recognized characters based on the preset engineering lexicon comprises:

5. The method of claim 4, wherein the replacing the phrases obtained from the participles based on the preset engineering thesaurus comprises:

calculating a context semantic vector of each phrase in each phrase set;

6. The method according to claim 5, wherein the replacing the phrases obtained by the participle based on the context semantic vector of each phrase and the semantic vectors of the phrases in the preset engineering thesaurus comprises:

7. The method according to claim 5, wherein the replacing the phrases obtained by the participle based on the context semantic vector of each phrase and the semantic vectors of the phrases in the preset engineering thesaurus comprises:

8. The method according to any one of claims 5-7, wherein said computing a context semantic vector for each phrase in each said set of phrases comprises:

obtaining the confidence of each phrase set;

9. A method for constructing an engineering word stock is characterized by comprising the following steps:

acquiring a plurality of engineering documents;

10. The method of claim 9, further comprising:

11. The method of claim 10, wherein the filtering phrases obtained by word segmentation in the first engineering document comprises:

calculating the weight of each phrase;

and screening in the phrases based on the weight.

12. The method of claim 11, further comprising:

13. The method according to any one of claims 9-12, further comprising:

14. The method according to any one of claims 9-13, wherein the filtering the phrases obtained by word segmentation in the remaining engineering documents based on the first set of engineering phrases, and obtaining one or more engineering phrases comprises:

15. An electronic device, comprising: a memory for storing computer program code, the computer program code comprising instructions that, when read from the memory by the electronic device, cause the electronic device to perform the method of any of claims 1-14.

16. A computer-readable storage medium comprising computer instructions that, when executed on the electronic device, cause the electronic device to perform the method of any of claims 1-14.