CN112507968B - Document text recognition method and device based on feature association - Google Patents

Document text recognition method and device based on feature association Download PDF

Info

Publication number
CN112507968B
CN112507968B CN202011551817.XA CN202011551817A CN112507968B CN 112507968 B CN112507968 B CN 112507968B CN 202011551817 A CN202011551817 A CN 202011551817A CN 112507968 B CN112507968 B CN 112507968B
Authority
CN
China
Prior art keywords
text
recognition
identification
vector
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011551817.XA
Other languages
Chinese (zh)
Other versions
CN112507968A (en
Inventor
李巧
朱永强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Wangan Technology Development Co ltd
Original Assignee
Chengdu Wangan Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Wangan Technology Development Co ltd filed Critical Chengdu Wangan Technology Development Co ltd
Priority to CN202011551817.XA priority Critical patent/CN112507968B/en
Publication of CN112507968A publication Critical patent/CN112507968A/en
Application granted granted Critical
Publication of CN112507968B publication Critical patent/CN112507968B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The application provides a document text recognition method and device based on feature association, and relates to the technical field of text recognition. In the present application, first, a document to be recognized is recognized based on recognition elements included in a document text, and a recognition result of each recognition element is obtained. And secondly, constructing a target text vector based on the obtained recognition result. And then, respectively updating the target text vector based on the target position information and the weight coefficient to obtain a first text vector and a second text vector, wherein the target position information comprises the position information of the recognition element corresponding to each first recognition value in the target text vector in the text to be recognized, and the weight coefficient is obtained by processing the document text sample. And finally, determining whether the text to be recognized belongs to the document text or not based on the first text vector, the second text vector and the text probability threshold. Based on the method, the problem that the document text is difficult to effectively identify based on the prior art can be solved.

Description

Document text recognition method and device based on feature association
Technical Field
The application relates to the technical field of text recognition, in particular to a document text recognition method and device based on feature association.
Background
The briefing refers to files of national authorities, enterprises and public institutions and people groups for processing the public affairs, and is an important tool for conveying implementation guidelines and policies, issuing regulations, requesting and replying questions, guiding and negotiating work, reporting conditions, exchanging experiences and the like. And has the characteristics of more varieties and huge quantity.
Among them, in the existing text recognition technology, most neural networks are capable of implementing text classification, for example, classification of financial, sports, entertainment, games, etc. However, the inventor researches that the neural network cannot well judge the document text, and has no interpretability, so that the document text is difficult to effectively identify.
Disclosure of Invention
In view of the foregoing, an object of the present application is to provide a method and an apparatus for recognizing document text based on feature association, so as to solve the problem that it is difficult to recognize document text effectively based on the prior art.
In order to achieve the above purpose, the embodiment of the present application adopts the following technical scheme:
a document text recognition method based on feature association comprises the following steps:
identifying the text to be identified based on a plurality of identification elements of the document text to obtain an identification result corresponding to each identification element, wherein the identification result comprises a first identification value or a second identification value, the first identification value is used for representing that the text to be identified has the corresponding identification element, and the second identification value is used for representing that the text to be identified does not have the corresponding identification element;
Constructing a target text vector based on the obtained multiple recognition results, wherein the number of dimensions of the target text vector is the number of the multiple recognition elements;
updating the target text vector based on pre-obtained target position information and a weight coefficient respectively to obtain a corresponding first text vector and a corresponding second text vector, wherein the target position information comprises the position information of identification elements corresponding to each first identification value in the target text vector in the text to be identified, and the weight coefficient is obtained by processing a document text sample;
and determining whether the text to be recognized belongs to a document text or not based on the first text vector, the second text vector and a predetermined text probability threshold.
In a preferred option of the embodiment of the present application, in a document text recognition method based on feature association, the step of performing recognition processing on a text to be recognized based on a plurality of recognition elements included in the document text to obtain a recognition result corresponding to each recognition element includes:
creating at least one corresponding text recognition thread for each recognition element of a plurality of recognition elements of the document text;
And aiming at each text recognition thread, carrying out recognition processing on the corresponding recognition element in the text to be recognized through the text recognition thread to obtain a recognition result corresponding to the recognition element.
In a preferred option of the embodiment of the present application, in the document text recognition method based on feature association, the plurality of recognition elements include a share number, a secret level, a secret term, an emergency degree, a document issuing organization flag, a document issuing number, a title, and an attachment description, and the step of performing recognition processing on a corresponding recognition element in a text to be recognized by the text recognition thread for each text recognition thread to obtain a recognition result corresponding to the recognition element includes:
the identification process is carried out on the number in the head area of each line of the text to be identified according to a predetermined first regular expression by the text identification thread corresponding to the number, and an identification result corresponding to the number is obtained;
the security class is identified in a line head area of each line of the text to be identified according to a second predetermined regular expression through a text identification thread corresponding to the security class, and an identification result corresponding to the security class is obtained;
The text recognition thread corresponding to the security deadline carries out recognition processing on the security deadline in a line head area of each line of the text to be recognized according to a third predetermined regular expression, so as to obtain a recognition result corresponding to the security deadline;
the emergency degree is identified in a line head area of each line of the text to be identified according to a fourth predetermined regular expression through a text identification thread corresponding to the emergency degree, and an identification result corresponding to the emergency degree is obtained;
the text recognition thread corresponding to the text issuing authority mark is used for recognizing the text issuing authority mark in a line head area of each line of the text to be recognized according to a fifth predetermined regular expression, so that a recognition result corresponding to the text issuing authority mark is obtained;
the text recognition thread corresponding to the text number is used for recognizing the text number in the head area of each line of the text to be recognized according to a sixth predetermined regular expression, so that a recognition result corresponding to the text number is obtained;
the title is identified in a head area of each line of the text to be identified according to a seventh predetermined regular expression through a text identification thread corresponding to the title, and an identification result corresponding to the title is obtained;
And carrying out recognition processing on the attachment description in a line head area of each line of the text to be recognized according to a predetermined eighth regular expression through a text recognition thread corresponding to the attachment description, and obtaining a recognition result corresponding to the attachment description.
In a preferred option of the embodiment of the present application, in the document text recognition method based on feature association, the plurality of recognition elements include a document issuing authority flag, a main sending authority, a sending authority signature, a copying authority, a sender, a text date and a print date, and for each text recognition thread, the step of performing recognition processing on a corresponding recognition element in a text to be recognized based on the text recognition thread to obtain a recognition result corresponding to the recognition element includes:
the text recognition thread corresponding to the text issuing organization mark is used for recognizing the text issuing organization mark in the text to be recognized according to the organization name, and a recognition result corresponding to the text issuing organization mark is generated in response to the identification operation of the user on the recognition result;
the text recognition thread corresponding to the main delivery mechanism is used for recognizing the main delivery mechanism in the text to be recognized according to the mechanism name, and a recognition result corresponding to the main delivery mechanism is generated in response to the identification operation of the user on the recognition result;
The method comprises the steps that through a text recognition thread corresponding to the sender signature, recognition processing is carried out on the sender signature in the text to be recognized according to a mechanism name, and a recognition result corresponding to the sender signature is generated in response to the identification operation of a user on the recognition processing result;
the identification processing is carried out on the copying mechanism in the text to be identified according to the mechanism name through the text identification thread corresponding to the copying mechanism, and an identification result corresponding to the copying mechanism is generated in response to the identification operation of the user on the identification processing result;
the text recognition thread corresponding to the sender is used for recognizing the sender in the text to be recognized according to the name of the person to obtain a recognition result corresponding to the sender;
the text recognition thread corresponding to the text date carries out recognition processing on the text date to be recognized according to the date, and a recognition result corresponding to the text date is generated in response to the identification operation of a user on the recognition processing result;
and carrying out recognition processing on the date to be recognized in the text to be recognized according to the date by a text recognition thread corresponding to the date to be recognized, and responding to the identification operation of a user on the result of the recognition processing to generate a recognition result corresponding to the date to be recognized.
In a preferred option of the embodiment of the present application, in the document text recognition method based on feature association, the step of updating the target text vector based on the pre-obtained target position information and the weight coefficient to obtain a corresponding first text vector and second text vector includes:
for each first recognition value in the target text vector, obtaining the position information of the recognition element corresponding to the first recognition value in the text to be recognized;
for the position information of each identification element, obtaining a corresponding Gaussian distribution value based on the position information and a Gaussian distribution formula corresponding to the identification element, wherein the mean value parameter and standard deviation parameter of the Gaussian distribution formula are determined based on the position information of the identification element in a plurality of document text samples;
and updating the first recognition value corresponding to the Gaussian distribution value based on the Gaussian distribution value for each obtained Gaussian distribution value to obtain a corresponding first text vector.
In a preferred option of the embodiment of the present application, in the document text recognition method based on feature association, the step of updating the target text vector based on the pre-obtained target position information and the weight coefficient to obtain a corresponding first text vector and second text vector includes:
Processing a plurality of document text samples to obtain weight coefficients;
and updating the target text vector based on the weight coefficient to obtain a corresponding second text vector, wherein the updating comprises multiplying the weight coefficient and the target text vector.
In a preferred option of the embodiment of the present application, in the method for identifying a document text based on feature association, the step of processing a plurality of document text samples to obtain a weight coefficient includes:
for each document text sample, constructing an element list corresponding to the document text sample based on identification elements included in the document text sample, wherein the document text sample is a plurality of;
constructing a frequent n-term set based on a plurality of identification elements included in the constructed plurality of requirement lists to obtain a plurality of frequent n-term sets, wherein n comprises each integer between 1 and the number of the plurality of identification elements;
for each frequent n-item set, obtaining the support degree of the frequent n-item set based on the frequency of occurrence of the frequent n-item set in a plurality of element lists and the number of the element lists;
determining a target frequent n-term set based on each first recognition value in the target text vector in the plurality of frequent n-term sets;
And carrying out summation processing based on the support degree of the target frequent n item sets to obtain a weight coefficient.
In a preferred option of the embodiment of the present application, in the feature-association-based document text recognition method, the step of determining whether the text to be recognized belongs to a document text based on the first text vector, the second text vector and a predetermined text probability threshold includes:
vector merging processing is carried out on the basis of the first text vector and the second text vector, and a third text vector is obtained;
obtaining a probability value of the text to be identified belonging to a document text based on the third text vector;
and judging whether the text to be identified belongs to the document text or not based on the probability value and a predetermined text probability threshold value, wherein if the probability value is greater than or equal to the text probability threshold value, the text to be identified is judged to belong to the document text.
In a preferred option of the embodiment of the present application, in the method for identifying a document text based on feature association, the step of obtaining a probability value of the text to be identified belonging to the document text based on the third text vector includes:
determining whether each vector value in the third text vector is smaller than a preset threshold value;
Updating each vector value smaller than the preset threshold value to 0, and updating each vector value larger than or equal to the preset threshold value;
and calculating the sum of the updated vector values, and taking the sum as the probability value that the text to be identified belongs to the document text.
The embodiment of the application also provides a document text recognition device based on characteristic association, which comprises:
the text recognition module to be recognized is used for recognizing the text to be recognized based on a plurality of recognition elements of the document text to obtain a recognition result corresponding to each recognition element, wherein the recognition result comprises a first recognition value or a second recognition value, the first recognition value is used for representing that the text to be recognized has the corresponding recognition element, and the second recognition value is used for representing that the text to be recognized does not have the corresponding recognition element;
the text vector construction module is used for constructing a target text vector based on the obtained multiple recognition results, wherein the number of dimensions of the target text vector is the number of the multiple recognition elements;
the text vector updating module is used for updating the target text vector based on the pre-obtained target position information and weight coefficients respectively to obtain corresponding first text vectors and second text vectors, wherein the target position information comprises the position information of identification elements corresponding to each first identification value in the target text vector in the text to be identified, and the weight coefficients are obtained by processing document text samples;
And the document text determining module is used for determining whether the text to be identified belongs to document text or not based on the first text vector, the second text vector and a predetermined text probability threshold.
According to the document text recognition method and device based on characteristic association, the recognition element of the document text is used for recognizing the text to be recognized to obtain the corresponding recognition result, so that a target text vector can be constructed based on the recognition result, the target text vector is processed based on the position information of the recognition element in the text to be recognized and the weight coefficient obtained based on the document text sample, so that a first text vector and a second text vector are obtained, and then whether the text to be recognized belongs to the document text can be determined based on the first text vector and the second text vector and in combination with a predetermined text probability threshold. Based on the method, the feature association between the text to be identified and the document text sample can be realized due to the adoption of the weight coefficient obtained based on the document text sample, so that the document text is effectively identified, the problem that the document text is difficult to effectively identify based on the prior art is solved, and the method has higher practical value.
In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
Fig. 1 is a block diagram of an electronic device according to an embodiment of the present application.
Fig. 2 is a flow chart of a document text recognition method based on feature association according to an embodiment of the present application.
Fig. 3 is a flow chart illustrating the sub-steps included in step S120 in fig. 2.
Fig. 4 is a flow chart illustrating the sub-steps included in step S130 in fig. 2.
Fig. 5 is a flow chart illustrating other sub-steps included in step S130 in fig. 2.
Fig. 6 is a flow chart illustrating the sub-steps included in step S140 in fig. 2.
Fig. 7 is a schematic block diagram of a document text recognition device based on feature association according to an embodiment of the present application.
Icon: 10-an electronic device; 12-memory; 14-a processor; 100-a document text recognition device based on feature association; 110, a text recognition module to be recognized; 120-a text vector construction module; 130-a text vector update module; 140-document text determination module.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
As shown in fig. 1, an embodiment of the present application provides an electronic device 10 that may include a memory 12, a processor 14, and a feature-based association-based document text recognition apparatus 100.
Wherein, the memory 12 and the processor 14 are directly or indirectly electrically connected to each other to realize data transmission or interaction. For example, electrical connection may be made to each other via one or more communication buses or signal lines. The feature-based association of document text recognition device 100 includes at least one software functional module that may be stored in the memory 12 in the form of software or firmware (firmware). The processor 14 is configured to execute an executable computer program stored in the memory 12, for example, a software function module and a computer program included in the feature-based associated document text recognition device 100, so as to implement the feature-based associated document text recognition method provided in the embodiments of the present application.
Alternatively, the Memory 12 may be, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc.
Also, the processor 14 may be a general-purpose processor including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), a System on Chip (SoC), etc.; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
It will be appreciated that the architecture shown in fig. 1 is merely illustrative, and that the electronic device 10 may also include more or fewer components than those shown in fig. 1, or may have a different configuration than that shown in fig. 1, for example, and may also include a communication unit for information interaction with other devices (e.g., a mobile phone, a computer, etc., when the electronic device 10 is a server).
Referring to fig. 2, an embodiment of the present application further provides a document text recognition method based on feature association, which is applicable to the electronic device 10 described above. Wherein the method steps defined by the flow related to the feature-association-based document text recognition method may be implemented by the electronic device 10.
The specific flow shown in fig. 2 will be described in detail.
Step S110, based on a plurality of recognition elements of the document text, the text to be recognized is recognized, and a recognition result corresponding to each recognition element is obtained.
In this embodiment, after obtaining the text to be recognized, the electronic device may perform recognition processing on the text to be recognized based on a plurality of recognition elements included in the document text, to obtain a recognition result corresponding to each of the plurality of recognition elements, so that a plurality of recognition results may be obtained.
The recognition result comprises a first recognition value or a second recognition value, wherein the first recognition value is used for representing that the text to be recognized has corresponding recognition elements, and the second recognition value is used for representing that the text to be recognized does not have the corresponding recognition elements.
That is, for each of the plurality of recognition elements, if the text to be recognized has the recognition element, the corresponding recognition result may be a first recognition value, and if the text to be recognized does not have the recognition element, the corresponding recognition result may be a second recognition value.
And step S120, constructing a target text vector based on the obtained multiple recognition results.
In this embodiment, after the plurality of recognition results are obtained based on step S110, the electronic device may construct a target text vector based on the plurality of recognition results.
The number of dimensions of the target text vector is the number of the plurality of recognition elements.
And step S130, updating the target text vector based on the pre-obtained target position information and the weight coefficient respectively to obtain a corresponding first text vector and a corresponding second text vector.
In this embodiment, after the target text vector is obtained in step S120, the target text vector may be updated based on the target position information and the weight coefficient obtained in advance, so that a corresponding first text vector and a corresponding second text vector may be obtained.
That is, the target text vector may be updated based on the target position information to obtain the first text vector. And updating the target text vector based on the weight coefficient to obtain a second text vector.
The target position information includes position information of an identification element corresponding to each first identification value in the target text vector (i.e., an identification element in the text to be identified) in the text to be identified (an identification element not in the text to be identified, and no position information), and the weight coefficient is obtained based on processing a document text sample.
Step S140, determining whether the text to be recognized belongs to a document text based on the first text vector, the second text vector and a predetermined text probability threshold.
In this embodiment, after the first text vector and the second text vector are obtained based on step S130, it may be determined whether the text to be recognized belongs to a document text based on the first text vector and the second text vector and based on a predetermined text probability threshold.
Based on the method, the feature association between the text to be identified and the document text sample can be realized due to the adoption of the weight coefficient obtained based on the document text sample, so that the document text is effectively identified, and the problem that the document text is difficult to effectively identify based on the prior art is solved.
In the first aspect, it should be noted that, in step S110, a specific manner of performing the recognition processing on the text to be recognized is not limited, and may be selected according to actual application requirements.
For example, in an alternative example, each recognition element may be queried (traversed) in the text to be recognized in turn, so as to implement the recognition processing of the text to be recognized.
For another example, in another alternative example, in order to improve the efficiency of the identification process, in conjunction with fig. 3, step S110 may include step S111 and step S112, which are described in detail below.
Step S111, for each of a plurality of recognition elements included in a document text, creating at least one corresponding text recognition thread for the recognition element.
In this embodiment, before the recognition processing is performed based on the recognition element, at least one corresponding text recognition thread may be created for each of a plurality of recognition elements included in the document text for the recognition element.
Thus, for a plurality of recognition elements, a plurality of text recognition threads (the thread is the minimum unit of operation scheduling of an operating system, is contained in a process and is the actual operation unit in the process) can be obtained, wherein one thread refers to a single-order control flow in the process, a plurality of threads can be concurrent in one process, and each thread can execute different tasks in parallel, and the text recognition thread is one configured thread for text recognition.
Step S112, for each text recognition thread, performing recognition processing on the corresponding recognition element in the text to be recognized through the text recognition thread to obtain a recognition result corresponding to the recognition element.
In this embodiment, after the text recognition thread is created based on step S111, for each text recognition thread, recognition processing may be performed on a recognition element corresponding to the text recognition thread in the text to be recognized by the text recognition thread, so as to obtain a recognition result corresponding to the recognition element. In this way, a plurality of recognition results can be obtained for a plurality of recognition elements.
Alternatively, in the above example, the specific manner of creating the text recognition thread based on step S111 is not limited, and may be selected according to actual application requirements.
For example, in an alternative example, a text recognition thread may be created for each of the recognition elements. That is, there is a one-to-one correspondence between the recognition elements and the text recognition threads.
For another example, in another alternative example, one or more text recognition threads may be created for each of the recognition elements.
That is, one of the recognition elements corresponds to one or more text recognition threads (wherein when one of the recognition elements corresponds to a plurality of text recognition threads, a plurality of recognition results of the recognition element are obtained, and thus, a final recognition result can be obtained by performing a comprehensive determination based on the plurality of recognition results, for example, when the recognition result is 0 or 1 and the recognition result corresponding to one recognition element is two, an exclusive or result of the two recognition results can be used as the final recognition result).
Alternatively, in the above example, the specific manner of performing the recognition processing on the recognition element in the text to be recognized based on step S112 is not limited, and may be selected according to the actual application requirement.
For example, in an alternative example, the plurality of identification elements includes a share number, a secret level, a secret term, an urgency level, a posting office sign, a posting number, a title, an attachment description (where the share number, the secret level, the secret term, the urgency level, the posting office sign, and the posting number are terms that are known to the respective personnel in the edition portion of the briefcase, and the title and the attachment description are terms that are known to the respective personnel in the body portion of the briefcase).
Based on this, step S112 may include the steps of:
the identification process is carried out on the number in the head area of each line of the text to be identified according to a predetermined first regular expression by the text identification thread corresponding to the number, and an identification result corresponding to the number is obtained; the security class is identified in a line head area of each line of the text to be identified according to a second predetermined regular expression through a text identification thread corresponding to the security class, and an identification result corresponding to the security class is obtained; the text recognition thread corresponding to the security deadline carries out recognition processing on the security deadline in a line head area of each line of the text to be recognized according to a third predetermined regular expression, so as to obtain a recognition result corresponding to the security deadline; the emergency degree is identified in a line head area of each line of the text to be identified according to a fourth predetermined regular expression through a text identification thread corresponding to the emergency degree, and an identification result corresponding to the emergency degree is obtained; the text recognition thread corresponding to the text issuing authority mark is used for recognizing the text issuing authority mark in a line head area of each line of the text to be recognized according to a fifth predetermined regular expression, so that a recognition result corresponding to the text issuing authority mark is obtained; the text recognition thread corresponding to the text number is used for recognizing the text number in the head area of each line of the text to be recognized according to a sixth predetermined regular expression, so that a recognition result corresponding to the text number is obtained; the title is identified in a head area of each line of the text to be identified according to a seventh predetermined regular expression through a text identification thread corresponding to the title, and an identification result corresponding to the title is obtained; and carrying out recognition processing on the attachment description in a line head area of each line of the text to be recognized according to a predetermined eighth regular expression through a text recognition thread corresponding to the attachment description, and obtaining a recognition result corresponding to the attachment description.
It will be appreciated that in the above example, the 8-way text recognition threads corresponding to the share number, the secret level, the security period, the degree of urgency, the issuing authority flag, the issuing word number, the title, and the attachment description may be executed in parallel, so that the recognition processing efficiency may be sufficiently improved.
Also, in the above example, the specific content of the first regular expression, the second regular expression, the third regular expression, the fourth regular expression, the fifth regular expression, the sixth regular expression, the seventh regular expression, and the eighth regular expression is not limited, and may be configured according to actual application requirements.
Wherein, in a specific application example, the first regular expression (i.e. the regular expression corresponding to the part number) is? D {6}, for identifying 6-bit arabic numerals such as "No 123456" or "123456". The second regular expression (i.e. the regular expression corresponding to the part number) is secret and is used for identifying and processing secret and the like. The third regular expression (i.e. the regular expression corresponding to the part number) is ≡secret ≡annual month, and is used for identifying and processing 'confidential 10 months'. The fourth regular expression (i.e. the regular expression corresponding to the part number) is a ∈therly [ terglapine ] urgent, and is used for identifying and processing "urgent" and the like. The fifth regular expression (i.e., the regular expression corresponding to the share number) is a file $ and is used for identifying and processing the "common central office file" and the like. The sixth regular expression (i.e., the regular expression corresponding to the part number) is {,9} [ [ (the [ { ] \d {4} [) ] ], no. for identifying the "national (2012) No. 12" and the like. The seventh regular expression (i.e., the regular expression corresponding to the share number) is. The eighth regular expression (i.e., the regular expression corresponding to the part number) is an: \d for use in connection with "accessory: 1", etc.
In the above example, the head-of-line region refers to a certain number of elements (such as a text, a number, and a letter) at the front of each line, so that since each text recognition thread performs recognition processing on the head-of-line region of each line of the text to be recognized, recognition efficiency can be further improved.
On this basis, in order to further improve the recognition efficiency and the recognition accuracy, when the line head region is recognized, the line head (the first element of the first letter, number, letter, etc. of each line) may be used as the starting point for each line to perform recognition. That is, when the identified element of the first position matches the identified element, it is identified whether the element of the second position matches the identified element.
For another example, the plurality of identification elements may include a letter designation, a main letter, a letter signature, a copying machine, a sender, a date of formation, and a date of impression (where the letter designation and the sender are terms known to the respective person in the edition portion of the document, the main letter, the letter signature, and the date of formation are terms known to the respective person in the main body portion of the document, and the copying machine and the date of impression are terms known to the respective person in the edition portion of the document).
Based on this, step S112 may include the steps of:
the text recognition thread corresponding to the text issuing organization mark is used for recognizing the text issuing organization mark in the text to be recognized according to the organization name, and a recognition result corresponding to the text issuing organization mark is generated in response to the identification operation of the user on the recognition result; the text recognition thread corresponding to the main delivery mechanism is used for recognizing the main delivery mechanism in the text to be recognized according to the mechanism name, and a recognition result corresponding to the main delivery mechanism is generated in response to the identification operation of the user on the recognition result; the method comprises the steps that through a text recognition thread corresponding to the sender signature, recognition processing is carried out on the sender signature in the text to be recognized according to a mechanism name, and a recognition result corresponding to the sender signature is generated in response to the identification operation of a user on the recognition processing result; the identification processing is carried out on the copying mechanism in the text to be identified according to the mechanism name through the text identification thread corresponding to the copying mechanism, and an identification result corresponding to the copying mechanism is generated in response to the identification operation of the user on the identification processing result; the text recognition thread corresponding to the sender is used for recognizing the sender in the text to be recognized according to the name of the person to obtain a recognition result corresponding to the sender; the text recognition thread corresponding to the text date carries out recognition processing on the text date to be recognized according to the date, and a recognition result corresponding to the text date is generated in response to the identification operation of a user on the recognition processing result; and carrying out recognition processing on the date to be recognized in the text to be recognized according to the date by a text recognition thread corresponding to the date to be recognized, and responding to the identification operation of a user on the result of the recognition processing to generate a recognition result corresponding to the date to be recognized.
It is to be understood that in the above example, performing the recognition processing by the organization name may mean performing the recognition processing in the text to be recognized, determining whether there is an organization name such as "national sports office", "city building", or the like. The recognition processing according to the name of the person may be that the recognition processing is performed in the text to be recognized to determine whether the person has the name of the person, such as "Zhang Sano", or the like. Performing the recognition processing by date may mean performing the recognition processing in the text to be recognized, determining whether there is a date, such as "2019, 10, 5, etc.
When the user identifies the organization with the organization name, the identified organization name can be further determined to specifically belong to a issuing organization mark, a main delivery organization, a issuing organization signature or a copying organization based on the identification operation of the user. When the date is identified, it may be further determined that the identified date belongs to the idiom date or the date of the impression based on the identification operation of the user.
It will be appreciated that, in the above example, two examples of performing the recognition process may be adopted at the same time, so that two recognition results may be obtained for the issuing authority flag, and thus, the two recognition results may also need to be processed, such as the exclusive or process described above, to obtain the final recognition result of the issuing authority flag.
And, because the issuing authority mark corresponds to two text recognition threads, the two text recognition threads can be set with mutual exclusion locks, so that only one thread is allowed to operate at the same time, and after the thread operation is completed, the other thread operates again.
In the second aspect, it should be noted that, in step S120, a specific manner of constructing the target text vector is not limited, and may be selected according to actual application requirements.
For example, in an alternative example, the first recognition value may be 1 and the second recognition value may be 0, so that the target text vector constructed may be a one-dimensional vector, such as [1,0,1,1,1,1,1,0,0,0,1,1,1,1].
In the third aspect, it should be noted that, in step S130, a specific manner of performing the update process is not limited, and may be selected according to actual application requirements.
For example, in an alternative example, when performing the update process to obtain the first text vector based on the target position information, in conjunction with fig. 4, step S130 may include step S131, step S132, and step S133, which are described below.
Step S131, for each first recognition value in the target text vector, obtaining the position information of the recognition element corresponding to the first recognition value in the text to be recognized.
In this embodiment, after the target text vector is obtained based on step S120, for each first recognition value (1 in the above example) in the target text vector, position information of the recognition element corresponding to the first recognition value (i.e., the recognition element existing in the text to be recognized) in the text to be recognized may be obtained (for example, in an alternative example, the position information may refer to what number of words (elements or characters) the last word (element or character) of the recognition element belongs to in the text to be recognized).
Step S132, for the position information of each identification element, obtaining a corresponding gaussian distribution value based on the position information and a gaussian distribution formula corresponding to the identification element.
In this embodiment, after the position information is obtained in step S131, for the position information of each identification element, a corresponding gaussian distribution value may be calculated based on the position information and a gaussian distribution formula corresponding to the identification element.
The mean value parameter and standard deviation parameter of the gaussian distribution formula are determined based on the position information of the identification element in the document text samples (for example, for the identification element "number of copies", the position information of the identification element "number of copies" in the document text samples may be determined first to obtain a plurality of position information, and then mean value calculation and standard deviation calculation are performed based on the plurality of position information, so as to obtain the mean value parameter and standard deviation parameter of the gaussian distribution formula corresponding to the identification element "number of copies").
That is, there may be a one-to-one correspondence between the identification elements and the gaussian distribution formula. And, the gaussian distribution formula may be:
where P (X) represents a gaussian distribution value of the position information X of the identification element, μ represents a mean parameter corresponding to the identification element, and σ represents a standard deviation parameter corresponding to the identification element. It will be appreciated that there may be a plurality of location information for an identification element in the sample to be identified, so that a plurality of gaussian distribution values may be obtained, and thus, the maximum value thereof may be selected for updating.
Step S133, for each obtained Gaussian distribution value, updating the first recognition value corresponding to the Gaussian distribution value based on the Gaussian distribution value to obtain a corresponding first text vector.
In this embodiment, after the gaussian distribution value is obtained based on step S132, for each gaussian distribution value, update processing may be performed on the first identification value corresponding to the gaussian distribution value based on the gaussian distribution value (for example, the gaussian distribution value may be multiplied by the corresponding first identification value to obtain an updated first identification value, where the identification element corresponding to the second identification value is not in the text to be identified, and thus, does not have corresponding position information, so that no update is performed), and thus, an updated first identification value is obtained, and thus, a first text vector corresponding to the target text vector may be obtained.
For another example, in another alternative example, when performing the update process based on the weight coefficient to obtain the second text vector, in conjunction with fig. 5, step S130 may include step S134 and step S135, which are described in detail below.
Step S134, processing the plurality of document text samples to obtain weight coefficients.
In this embodiment, when the target text vector obtained in step S120 needs to be updated based on the weight coefficient, a plurality of document text samples may be first processed to obtain the weight coefficient.
And step S135, updating the target text vector based on the weight coefficient to obtain a corresponding second text vector.
In this embodiment, after the weight coefficient is obtained in step S134, the update process may be performed on the target text vector based on the weight coefficient, so as to obtain a corresponding second text vector.
Wherein the updating process includes multiplying the weight coefficient and the target text vector.
Alternatively, in the above example, the specific manner of obtaining the weight coefficient based on step S133 is not limited, and may be selected according to actual application requirements.
For example, in an alternative example, in order to make the weight coefficient have a higher reliability, the step S133 may include the steps of:
firstly, constructing an element list corresponding to each document text sample based on identification elements included in the document text sample, wherein the document text samples are multiple; secondly, constructing a frequent n-term set based on a plurality of identification elements included in the constructed plurality of requirement lists to obtain a plurality of frequent n-term sets, wherein n comprises each integer between 1 and the number of the plurality of identification elements; then, for each frequent n item set, obtaining the support degree of the frequent n item set based on the frequency of occurrence of the frequent n item set in a plurality of element lists and the number of the element lists; then, determining a target frequent n item set from the plurality of frequent n item sets based on each first recognition value in the target text vector; and finally, carrying out summation processing based on the support degree of the target frequent n item sets to obtain a weight coefficient.
For the above steps, the present application provides a specific application example, which is specifically described below. In this application example, 4 document text samples are included, a first document text sample, a second document text sample, a third document text sample, and a fourth document text sample, respectively, and the element list of each document text sample is shown in the following table.
Document text sample Element list
First document text sample t 1 ,t 3 ,t 7 ,t 9 ,t 13 ,t 14
Second document text sample t 1 ,t 2 ,t 3 ,t 4 ,t 6 ,t 8 ,t 9 ,t 10 ,t 11 ,t 12 ,t 13 ,t 14
Third document text sample t 3 ,t 4 ,t 7 ,t 8 ,t 10 ,t 11 ,t 12 ,t 14
Fourth document text sample t 1 ,t 2 ,t 3 ,t 4 ,t 5 ,t 7 ,t 8 ,t 12 ,t 13 ,t 14
Based on this, frequent 1 item sets can be obtained, and the support degree corresponding to each frequent 1 item set is calculated (for the identification element t, for example 1 The number of occurrences in the 4 element list is 3 times, and thus, the support degree is 3/4), as shown in the following table:
frequent 1 item set Support degree
{t 1 } 3/4=0.75
{t 2 } 2/4=0.5
... ...
{t 14 } 4/4=1
And can obtain 2 frequent item sets and countCalculate the corresponding support of each frequent 2 item set (e.g. for the identification element t 1 And t 2 The number of occurrences in the 4 element list is 2 times, and thus, the support degree is 2/4), as shown in the following table:
frequent 2 item sets Support degree
{t 1 ,t 2 } 2/4=0.5
{t 1 ,t 3 } 3/4=0.75
... ...
{t 13 ,t 14 } 3/4=0.75
Based on this, the first and second light sources, the frequent 3 item sets and the support corresponding to each frequent 3 item set, the 4-item sets, and the support corresponding to each frequent 14 item set can also be obtained.
Then, a target frequent n-term set is determined based on each first recognition value in the target text vector. For example, if the target text vector is [1,0,1,1,1,1,1,0,0,0,0,0,0,0], the determined target frequent n-term set (the target frequent n-term set is a non-empty subset of non-0 elements in the target text vector, ti refers to the i-th element in the target text vector) may include: { t1, t3, t4, t5, t6, t7}, { t1, t3, t4, t5, t6}, { t1, t3, t4, t5, t7}, { t1, t3, t4, t6, t7}, { t1, t3, t5, t6, t7}, { t1, t4, t5, t6, t7}, { t1, t3, t4, t5}, { t1, t3, t4, t7}, { t1, t6, t7}, { t1, t5, t6, t7}, { t1, { t3}, { t4}, { t5}, and { t6 }.
Thus, the support degree corresponding to each target frequent n item set can be obtained, and then summation processing is carried out to obtain the weight coefficient.
In the fourth aspect, it should be noted that, in step S140, a specific manner of determining whether the text to be identified belongs to the document text is not limited, and may be selected according to actual application requirements.
For example, in an alternative example, in order to improve the reliability of identifying the text to be identified, in conjunction with fig. 6, step S140 may include step S141, step S142, and step S143, which are described below.
And step S141, carrying out vector merging processing based on the first text vector and the second text vector to obtain a third text vector.
In this embodiment, after the first text vector and the second text vector are obtained based on step S130, vector merging processing (such as adding two vectors) may be performed on the first text vector and the second text vector, so that a third text vector may be obtained.
Step S142, obtaining a probability value of the text to be identified belonging to the document text based on the third text vector.
In this embodiment, after the third text vector is obtained based on step S141, a probability value that the text to be recognized belongs to a document text may be obtained based on the third text vector.
And step S143, judging whether the text to be recognized belongs to a document text or not based on the probability value and a predetermined text probability threshold value.
In this embodiment, after the probability value is obtained based on step S142, it may be determined whether the text to be recognized belongs to a document text based on the probability value and a predetermined text probability threshold (the specific value of the text probability threshold is not limited, for example, in an alternative example, the text probability threshold may be 3).
And if the probability value is greater than or equal to the text probability threshold value, judging that the text to be recognized belongs to a document text. And if the probability value is smaller than the text probability threshold value, judging that the text to be identified does not belong to the document text.
Optionally, the specific manner of obtaining the probability value of the text to be identified belonging to the document text based on step S142 is not limited, and may be selected according to the actual application requirement.
For example, in an alternative example, step S142 may include the steps of:
firstly, determining whether each vector value in the third text vector is smaller than a preset threshold value; secondly, updating each vector value smaller than the preset threshold value to 0, and updating each vector value larger than or equal to the preset threshold value; and then, calculating the sum value of the plurality of updated vector values, and taking the sum value as the probability value that the text to be identified belongs to the document text.
The specific value of the preset threshold is not limited, and for example, in an alternative example, the preset threshold may be 1. That is, if the vector value is less than 1, the vector value is updated to 0; if the vector value is greater than or equal to 1, the vector value is updated to 1.
Referring to fig. 7, an embodiment of the present application further provides a document text recognition device 100 based on feature association, which is applicable to the above-mentioned electronic device 10. The document text recognition device 100 based on feature association includes a text recognition module 110 to be recognized, a text vector construction module 120, a text vector update module 130, and a document text determination module 140.
The text to be identified identifying module 110 may be configured to identify a text to be identified based on a plurality of identifying elements of a document text, and obtain an identifying result corresponding to each identifying element, where the identifying result includes a first identifying value or a second identifying value, the first identifying value is used to represent that the text to be identified has a corresponding identifying element, and the second identifying value is used to represent that the text to be identified does not have a corresponding identifying element. In this embodiment, the text recognition module 110 to be recognized may be used to perform step S110 shown in fig. 2, and the description of step S110 may be referred to as related content of the text recognition module 110 to be recognized.
The text vector construction module 120 may be configured to construct a target text vector based on the obtained plurality of recognition results, where the number of dimensions of the target text vector is the number of the plurality of recognition elements. In this embodiment, the text vector constructing module 120 may be used to perform step S120 shown in fig. 2, and the description of step S120 may be referred to above with respect to the relevant content of the text vector constructing module 120.
The text vector updating module 130 may be configured to update the target text vector based on pre-obtained target position information and a weight coefficient, to obtain a corresponding first text vector and a corresponding second text vector, where the target position information includes position information of an identification element corresponding to each first identification value in the target text vector in the text to be identified, and the weight coefficient is obtained by processing a document text sample. In this embodiment, the text vector update module 130 may be used to perform step S130 shown in fig. 2, and the description of step S130 may be referred to above with respect to the relevant content of the text vector update module 130.
The document text determination module 140 may be configured to determine whether the text to be recognized belongs to document text based on the first text vector, the second text vector, and a predetermined text probability threshold. In this embodiment, the document text determining module 140 may be used to perform step S140 shown in fig. 2, and the description of step S140 may be referred to above with respect to the relevant content of the document text determining module 140.
In summary, according to the method and the device for recognizing the document text based on the feature association, the recognition element of the document text is used for recognizing the text to be recognized to obtain the recognition result, so that the target text vector can be constructed based on the recognition result, the target text vector is processed based on the position information of the recognition element in the text to be recognized and the weight coefficient obtained based on the document text sample, so that the first text vector and the second text vector are obtained, and then, based on the first text vector and the second text vector, and a predetermined text probability threshold is combined, so that whether the text to be recognized belongs to the document can be determined. Based on the method, the feature association between the text to be identified and the document text sample can be realized due to the adoption of the weight coefficient obtained based on the document text sample, so that the document text is effectively identified, the problem that the document text is difficult to effectively identify based on the prior art is solved, and the method has higher practical value.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus and method embodiments described above are merely illustrative, for example, flow diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the same, but rather, various modifications and variations may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims (7)

1. The document text recognition method based on the characteristic association is characterized by comprising the following steps of:
identifying the text to be identified based on a plurality of identification elements of the document text to obtain an identification result corresponding to each identification element, wherein the identification result comprises a first identification value or a second identification value, the first identification value is used for representing that the text to be identified has the corresponding identification element, and the second identification value is used for representing that the text to be identified does not have the corresponding identification element;
constructing a target text vector based on the obtained multiple recognition results, wherein the number of dimensions of the target text vector is the number of the multiple recognition elements;
updating the target text vector based on pre-obtained target position information and a weight coefficient respectively to obtain a corresponding first text vector and a corresponding second text vector, wherein the target position information comprises the position information of identification elements corresponding to each first identification value in the target text vector in the text to be identified, and the weight coefficient is obtained by processing a document text sample;
Determining whether the text to be recognized belongs to a document text or not based on the first text vector, the second text vector and a predetermined text probability threshold;
the step of updating the target text vector based on the pre-obtained target position information and the weight coefficient to obtain a corresponding first text vector and a corresponding second text vector comprises the following steps:
for each first recognition value in the target text vector, obtaining the position information of the recognition element corresponding to the first recognition value in the text to be recognized;
for the position information of each identification element, obtaining a corresponding Gaussian distribution value based on the position information and a Gaussian distribution formula corresponding to the identification element, wherein the mean value parameter and standard deviation parameter of the Gaussian distribution formula are determined based on the position information of the identification element in a plurality of document text samples;
for each obtained Gaussian distribution value, updating a first identification value corresponding to the Gaussian distribution value based on the Gaussian distribution value to obtain a corresponding first text vector;
for each document text sample, constructing an element list corresponding to the document text sample based on identification elements included in the document text sample, wherein the document text sample is a plurality of;
Constructing a frequent n-term set based on a plurality of identification elements included in the constructed element lists to obtain a plurality of frequent n-term sets, wherein n comprises each integer between 1 and the number of the identification elements;
for each frequent n-item set, obtaining the support degree of the frequent n-item set based on the frequency of occurrence of the frequent n-item set in a plurality of element lists and the number of the element lists;
determining a target frequent n-term set based on each first recognition value in the target text vector in the plurality of frequent n-term sets;
summing processing is carried out based on the support degree of the target frequent n item sets, so that a weight coefficient is obtained;
and updating the target text vector based on the weight coefficient to obtain a corresponding second text vector, wherein the updating comprises multiplying the weight coefficient and the target text vector.
2. The method for recognizing document text based on feature association according to claim 1, wherein the step of recognizing the text to be recognized by a plurality of recognition elements included in the document text to obtain a recognition result corresponding to each recognition element includes:
Creating at least one corresponding text recognition thread for each recognition element of a plurality of recognition elements of the document text;
and aiming at each text recognition thread, carrying out recognition processing on the corresponding recognition element in the text to be recognized through the text recognition thread to obtain a recognition result corresponding to the recognition element.
3. The method for recognizing document text based on feature association according to claim 2, wherein the plurality of recognition elements include a share number, a secret level, a confidentiality deadline, an emergency level, a document issuing organization mark, a document issuing number, a title, and an attachment description, and the step of recognizing the corresponding recognition element in the text to be recognized by the text recognition thread for each text recognition thread to obtain the recognition result corresponding to the recognition element comprises:
the identification process is carried out on the number in the head area of each line of the text to be identified according to a predetermined first regular expression by the text identification thread corresponding to the number, and an identification result corresponding to the number is obtained;
the security class is identified in a line head area of each line of the text to be identified according to a second predetermined regular expression through a text identification thread corresponding to the security class, and an identification result corresponding to the security class is obtained;
The text recognition thread corresponding to the security deadline carries out recognition processing on the security deadline in a line head area of each line of the text to be recognized according to a third predetermined regular expression, so as to obtain a recognition result corresponding to the security deadline;
the emergency degree is identified in a line head area of each line of the text to be identified according to a fourth predetermined regular expression through a text identification thread corresponding to the emergency degree, and an identification result corresponding to the emergency degree is obtained;
the text recognition thread corresponding to the text issuing authority mark is used for recognizing the text issuing authority mark in a line head area of each line of the text to be recognized according to a fifth predetermined regular expression, so that a recognition result corresponding to the text issuing authority mark is obtained;
the text recognition thread corresponding to the text number is used for recognizing the text number in the head area of each line of the text to be recognized according to a sixth predetermined regular expression, so that a recognition result corresponding to the text number is obtained;
the title is identified in a head area of each line of the text to be identified according to a seventh predetermined regular expression through a text identification thread corresponding to the title, and an identification result corresponding to the title is obtained;
And carrying out recognition processing on the attachment description in a line head area of each line of the text to be recognized according to a predetermined eighth regular expression through a text recognition thread corresponding to the attachment description, and obtaining a recognition result corresponding to the attachment description.
4. A document text recognition method based on feature association according to claim 2 or 3, wherein the plurality of recognition elements include a document office sign, a main office, a sender office signature, a transcription office, a sender, a text date and a date of impression, and the step of, for each text recognition thread, performing recognition processing on the corresponding recognition element in the text to be recognized based on the text recognition thread to obtain a recognition result corresponding to the recognition element comprises:
the text recognition thread corresponding to the text issuing organization mark is used for recognizing the text issuing organization mark in the text to be recognized according to the organization name, and a recognition result corresponding to the text issuing organization mark is generated in response to the identification operation of the user on the recognition result;
the text recognition thread corresponding to the main delivery mechanism is used for recognizing the main delivery mechanism in the text to be recognized according to the mechanism name, and a recognition result corresponding to the main delivery mechanism is generated in response to the identification operation of the user on the recognition result;
The method comprises the steps that through a text recognition thread corresponding to the sender signature, recognition processing is carried out on the sender signature in the text to be recognized according to a mechanism name, and a recognition result corresponding to the sender signature is generated in response to the identification operation of a user on the recognition processing result;
the identification processing is carried out on the copying mechanism in the text to be identified according to the mechanism name through the text identification thread corresponding to the copying mechanism, and an identification result corresponding to the copying mechanism is generated in response to the identification operation of the user on the identification processing result;
the text recognition thread corresponding to the sender is used for recognizing the sender in the text to be recognized according to the name of the person to obtain a recognition result corresponding to the sender;
the text recognition thread corresponding to the text date carries out recognition processing on the text date to be recognized according to the date, and a recognition result corresponding to the text date is generated in response to the identification operation of a user on the recognition processing result;
and carrying out recognition processing on the date to be recognized in the text to be recognized according to the date by a text recognition thread corresponding to the date to be recognized, and responding to the identification operation of a user on the result of the recognition processing to generate a recognition result corresponding to the date to be recognized.
5. The method for identifying a document text based on feature association according to claim 1, wherein the step of determining whether the text to be identified belongs to a document text based on the first text vector, the second text vector and a predetermined text probability threshold value comprises:
vector merging processing is carried out on the basis of the first text vector and the second text vector, and a third text vector is obtained;
obtaining a probability value of the text to be identified belonging to a document text based on the third text vector;
and judging whether the text to be identified belongs to the document text or not based on the probability value and a predetermined text probability threshold value, wherein if the probability value is greater than or equal to the text probability threshold value, the text to be identified is judged to belong to the document text.
6. The feature association-based document text recognition method according to claim 5, wherein the step of obtaining a probability value of the text to be recognized belonging to the document text based on the third text vector includes:
determining whether each vector value in the third text vector is smaller than a preset threshold value;
updating each vector value smaller than the preset threshold value to 0, and updating each vector value larger than or equal to the preset threshold value;
And calculating the sum of the updated vector values, and taking the sum as the probability value that the text to be identified belongs to the document text.
7. A document text recognition device based on feature association, comprising:
the text recognition module to be recognized is used for recognizing the text to be recognized based on a plurality of recognition elements of the document text to obtain a recognition result corresponding to each recognition element, wherein the recognition result comprises a first recognition value or a second recognition value, the first recognition value is used for representing that the text to be recognized has the corresponding recognition element, and the second recognition value is used for representing that the text to be recognized does not have the corresponding recognition element;
the text vector construction module is used for constructing a target text vector based on the obtained multiple recognition results, wherein the number of dimensions of the target text vector is the number of the multiple recognition elements;
the text vector updating module is used for updating the target text vector based on the pre-obtained target position information and weight coefficients respectively to obtain corresponding first text vectors and second text vectors, wherein the target position information comprises the position information of identification elements corresponding to each first identification value in the target text vector in the text to be identified, and the weight coefficients are obtained by processing document text samples;
The document text determining module is used for determining whether the text to be identified belongs to document text or not based on the first text vector, the second text vector and a predetermined text probability threshold;
the text vector updating module is specifically configured to:
for each first recognition value in the target text vector, obtaining the position information of the recognition element corresponding to the first recognition value in the text to be recognized;
for the position information of each identification element, obtaining a corresponding Gaussian distribution value based on the position information and a Gaussian distribution formula corresponding to the identification element, wherein the mean value parameter and standard deviation parameter of the Gaussian distribution formula are determined based on the position information of the identification element in a plurality of document text samples;
for each obtained Gaussian distribution value, updating a first identification value corresponding to the Gaussian distribution value based on the Gaussian distribution value to obtain a corresponding first text vector;
for each document text sample, constructing an element list corresponding to the document text sample based on identification elements included in the document text sample, wherein the document text sample is a plurality of;
Constructing a frequent n-term set based on a plurality of identification elements included in the constructed element lists to obtain a plurality of frequent n-term sets, wherein n comprises each integer between 1 and the number of the identification elements;
for each frequent n-item set, obtaining the support degree of the frequent n-item set based on the frequency of occurrence of the frequent n-item set in a plurality of element lists and the number of the element lists;
determining a target frequent n-term set based on each first recognition value in the target text vector in the plurality of frequent n-term sets;
summing processing is carried out based on the support degree of the target frequent n item sets, so that a weight coefficient is obtained;
and updating the target text vector based on the weight coefficient to obtain a corresponding second text vector, wherein the updating comprises multiplying the weight coefficient and the target text vector.
CN202011551817.XA 2020-12-24 2020-12-24 Document text recognition method and device based on feature association Active CN112507968B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011551817.XA CN112507968B (en) 2020-12-24 2020-12-24 Document text recognition method and device based on feature association

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011551817.XA CN112507968B (en) 2020-12-24 2020-12-24 Document text recognition method and device based on feature association

Publications (2)

Publication Number Publication Date
CN112507968A CN112507968A (en) 2021-03-16
CN112507968B true CN112507968B (en) 2024-03-05

Family

ID=74923389

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011551817.XA Active CN112507968B (en) 2020-12-24 2020-12-24 Document text recognition method and device based on feature association

Country Status (1)

Country Link
CN (1) CN112507968B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815500A (en) * 2019-01-25 2019-05-28 杭州绿湾网络科技有限公司 Management method, device, computer equipment and the storage medium of unstructured official document
CN110909122A (en) * 2019-10-10 2020-03-24 重庆金融资产交易所有限责任公司 Information processing method and related equipment
CN111460131A (en) * 2020-02-18 2020-07-28 平安科技(深圳)有限公司 Method, device and equipment for extracting official document abstract and computer readable storage medium
CN111626057A (en) * 2020-07-28 2020-09-04 南京中孚信息技术有限公司 Official document judgment method and judgment system based on named entity
CN111681670A (en) * 2019-02-25 2020-09-18 北京嘀嘀无限科技发展有限公司 Information identification method and device, electronic equipment and storage medium
CN111680490A (en) * 2020-06-10 2020-09-18 东南大学 Cross-modal document processing method and device and electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10068132B2 (en) * 2016-05-25 2018-09-04 Ebay Inc. Document optical character recognition

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815500A (en) * 2019-01-25 2019-05-28 杭州绿湾网络科技有限公司 Management method, device, computer equipment and the storage medium of unstructured official document
CN111681670A (en) * 2019-02-25 2020-09-18 北京嘀嘀无限科技发展有限公司 Information identification method and device, electronic equipment and storage medium
CN110909122A (en) * 2019-10-10 2020-03-24 重庆金融资产交易所有限责任公司 Information processing method and related equipment
CN111460131A (en) * 2020-02-18 2020-07-28 平安科技(深圳)有限公司 Method, device and equipment for extracting official document abstract and computer readable storage medium
CN111680490A (en) * 2020-06-10 2020-09-18 东南大学 Cross-modal document processing method and device and electronic equipment
CN111626057A (en) * 2020-07-28 2020-09-04 南京中孚信息技术有限公司 Official document judgment method and judgment system based on named entity

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张维冲 ; 王芳 ; 赵洪 ; 张建光 ; .基于政府公文结构解析的科技政策主题抽取与分析.科学学研究.2020,(第07期),全文. *
黄良友 ; .再论发文字号第四要素发文形式――以国务院和重庆市等省级政府及其办公厅发文为例.档案学通讯.2015,(第03期),全文. *

Also Published As

Publication number Publication date
CN112507968A (en) 2021-03-16

Similar Documents

Publication Publication Date Title
US20190166161A1 (en) User Model-Based Data Loss Prevention
US8725711B2 (en) Systems and methods for information categorization
CN111383101B (en) Post-credit risk monitoring method, post-credit risk monitoring device, post-credit risk monitoring equipment and computer readable storage medium
US9654510B1 (en) Match signature recognition for detecting false positive incidents and improving post-incident remediation
US8095547B2 (en) Method and apparatus for detecting spam user created content
CN107018062B (en) System and method for identifying spam messages using topic information
KR20040088036A (en) Real time data warehousing
CN109508373B (en) Method and device for calculating enterprise public opinion index and computer readable storage medium
CN114760149B (en) Data cross-border compliance management and control method and device, computer equipment and storage medium
US20240048514A1 (en) Method for electronic impersonation detection and remediation
CN112052891A (en) Machine behavior recognition method, device, equipment and computer readable storage medium
CN111813946A (en) Medical information feedback method, device, equipment and readable storage medium
CN114186275A (en) Privacy protection method and device, computer equipment and storage medium
CN113408281A (en) Mailbox account abnormity detection method and device, electronic equipment and storage medium
US11908035B2 (en) System and method for authenticated mail
CN112507968B (en) Document text recognition method and device based on feature association
CN110489434B (en) Information processing method and related equipment
US11886467B2 (en) Method, apparatus, and computer-readable medium for efficiently classifying a data object of unknown type
US11681966B2 (en) Systems and methods for enhanced risk identification based on textual analysis
CN115017269A (en) Data processing system for determining similar texts
CN113821594A (en) Text processing method and device and readable storage medium
CN113111153A (en) Data analysis method, device, equipment and storage medium
CN115564313B (en) Shared service data management method and device based on shared management platform
CN113434672B (en) Text type intelligent recognition method, device, equipment and medium
CN113705200B (en) Analysis method, analysis device, analysis equipment and analysis storage medium for complaint behavior data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant