CN110955796B - Case feature information extraction method and device based on stroke information - Google Patents

Case feature information extraction method and device based on stroke information Download PDF

Info

Publication number
CN110955796B
CN110955796B CN201911176959.XA CN201911176959A CN110955796B CN 110955796 B CN110955796 B CN 110955796B CN 201911176959 A CN201911176959 A CN 201911176959A CN 110955796 B CN110955796 B CN 110955796B
Authority
CN
China
Prior art keywords
document
sample
stroke
case feature
feature information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911176959.XA
Other languages
Chinese (zh)
Other versions
CN110955796A (en
Inventor
茹渑博
贠盟洲
李亮
孙德毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN201911176959.XA priority Critical patent/CN110955796B/en
Publication of CN110955796A publication Critical patent/CN110955796A/en
Application granted granted Critical
Publication of CN110955796B publication Critical patent/CN110955796B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a case feature information extraction method and device based on stroke information, wherein the method comprises the following steps: acquiring a target stroke picture containing stroke information, and performing character recognition on the target stroke picture to obtain a target stroke document; preprocessing the target transcript document; determining the type of the corresponding directory document of the preprocessed target directory document according to a preset keyword library of the type of the directory document corresponding to the type of each directory document; inputting the target stroke document into a case feature information extraction model corresponding to the stroke document type to obtain case feature information of the stroke information.

Description

Case feature information extraction method and device based on stroke information
Technical Field
The application relates to the field of data processing, in particular to a case feature information extraction method and device based on stroke information.
Background
The records are legal documents which are produced when policemen in public security authorities survey and evidence obtaining to victims (witness) and suspects in the case checking process, and the legal documents are used for recording the case facts faithfully. The pen records comprise a plurality of factors such as a case issuing process, a case making machine, a case making tool, a case making mode, the morphological characteristics of suspects and the like, and have important roles in acquiring evidence, comprehensively analyzing and researching case conditions, convicting criminal investigation, summarizing case handling experience, checking case handling quality and the like.
At present, the records of the public security system mainly exist in a case handling system in the form of a scanning piece or a photo, so that the search of the contents of the records cannot be realized, the excavation and association of related elements of the records cannot be realized, and the value of the records cannot be fully released.
In the prior art, the extraction of the elements of the strokes of the related data in the stroke-recording scanning piece is realized on the basis of converting the scanning piece into the text, and the extraction is performed by adopting a method of dividing words and marking parts of speech of a custom dictionary, so that on one hand, the construction and maintenance of the custom dictionary are required to take a great deal of manpower, and on the other hand, the corresponding elements are identified according to the result after marking the parts of speech, and the identification accuracy is still to be verified.
Disclosure of Invention
In view of this, the present application aims to provide a case feature information extraction method and device based on the writing information, which are used for solving the problem of how to effectively extract the case features in the writing information in the prior art.
In a first aspect, an embodiment of the present application provides a case feature information extraction method based on transcript information, where the method includes:
acquiring a target stroke picture containing stroke information, and performing character recognition on the target stroke picture to obtain a target stroke document;
Preprocessing the target transcript document;
determining the type of the corresponding directory document of the preprocessed target directory document according to a preset keyword library of the type of the directory document corresponding to the type of each directory document;
inputting the target stroke document into a case feature information extraction model corresponding to the stroke document type to obtain case feature information of the stroke information.
According to a first aspect, the present application provides a first possible implementation manner of the first aspect, wherein the constructing a case feature information extraction model includes:
according to the case characteristic information, extracting the type of the written document corresponding to the model, and obtaining a sample written picture corresponding to the type of the written document;
performing text recognition on the sample stroke picture to obtain a sample stroke document;
sorting the sample pages according to the page numbers of the sample pages in the sample page, deleting the interference information in the sorted sample page, and obtaining a preprocessed sample page document; the interference information comprises page numbers, signatures and line feed symbols;
acquiring a case feature information base corresponding to the type of each stroke document from the corresponding relation between the type of each stroke document and the case feature information base stored in advance;
Extracting a sample case feature information base matched with the acquired case feature information base from the sample stroke document;
and training a case feature information training model based on the sample stroke document and the sample case feature information base to obtain a case feature information extraction model corresponding to the stroke document type.
According to a first possible implementation manner of the first aspect, the present application provides a second possible implementation manner of the first aspect, wherein the sample record document includes a sample record training document and a sample record test document, and the sample case feature information base includes a training sample case feature information base and a test sample case feature information base; training the case feature information training model based on the sample stroke document and the sample case feature information base to obtain a case feature information extraction model corresponding to the stroke document type, wherein the case feature information extraction model comprises the following components:
taking the sample stroke training document as input of the case feature information training model, taking a training sample case feature information base corresponding to the sample stroke training document as output of the case feature information training model, and training the case feature information training model;
And verifying the trained case feature information training model by using the sample stroke test document and the test sample case feature information library, and obtaining the case feature information extraction model after verification.
According to a first possible implementation manner of the first aspect, the present application provides a third possible implementation manner of the first aspect, wherein the extracting, from the sample transcript document, a sample case feature information base matched with the obtained case feature information base includes:
extracting case feature words matched with the acquired case feature information base from the sample stroke document;
calculating the association degree of each sample case feature word and other sample case feature words according to the sample stroke document, and forming case feature words with the association degree exceeding a preset threshold into sample case feature words;
and taking the set of the sample case feature words which do not form the sample case feature word group and the sample case feature words as a sample case feature information base.
According to a first aspect, the present embodiment provides a fourth possible implementation manner of the first aspect, wherein preprocessing the target transcript document includes:
Sorting the target pages according to the page numbers of the target pages in the target page-book document, and deleting the interference information in the sorted target page-book to obtain a preprocessed target page-book document; the interference information includes page numbers, signatures, and line feed.
In a second aspect, an embodiment of the present application provides a case feature information extraction device based on transcript information, where the device includes:
the identification module is used for acquiring a target stroke picture containing stroke information, and carrying out character identification on the target stroke picture to obtain a target stroke document;
the preprocessing module is used for preprocessing the target transcript document;
the type module is used for determining the type of the corresponding directory document of the preprocessed target directory document according to a preset keyword library of the type of the directory document corresponding to the type of each directory document;
the extraction module is used for inputting the target stroke document into a case characteristic information extraction model corresponding to the stroke document type to obtain case characteristic information of the stroke information.
According to a second aspect, an embodiment of the present application provides a first possible implementation manner of the second aspect, where the extracting module includes a model building unit, configured to extract, according to the case feature information, a type of a record document corresponding to the model, and obtain a sample record picture corresponding to the type of the record document;
Performing text recognition on the sample stroke picture to obtain a sample stroke document;
sorting the sample pages according to the page numbers of the sample pages in the sample page, deleting the interference information in the sorted sample page, and obtaining a preprocessed sample page document; the interference information comprises page numbers, signatures and line feed symbols;
acquiring a case feature information base corresponding to the type of each stroke document from the corresponding relation between the type of each stroke document and the case feature information base stored in advance;
extracting a sample case feature information base matched with the acquired case feature information base from the sample stroke document;
and training a case feature information training model based on the sample stroke document and the sample case feature information base to obtain a case feature information extraction model corresponding to the stroke document type.
According to a second aspect, the present examples provide a second possible implementation manner of the second aspect, wherein the preprocessing module includes:
the processing unit is used for sorting the target pages according to the page numbers of the target pages in the target page-book document, deleting the interference information in the sorted target page-book, and obtaining the preprocessed target page-book document; the interference information includes page numbers, signatures, and line feed.
In a third aspect, the present application provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of the above first aspect and possible implementations thereof when the computer program is executed.
In a fourth aspect, the present examples provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of any of the first aspect and possible implementations thereof.
According to the case feature information extraction method and device based on the directory information, after characters in a target directory picture are identified, preprocessing is carried out on an identified target directory document, and then the preprocessed target directory document is input into a case feature information extraction model corresponding to the directory document type according to the analyzed directory document type of the target directory document, so that case feature information of the target directory document is obtained. The case feature information extraction method based on the stroke information can accurately extract the case features in the stroke, improves the stroke information input efficiency, and reduces the query use difficulty of the case features.
In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a case feature information extraction method based on the transcript information according to the embodiment of the present application;
fig. 2 is a flow chart of a case feature information extraction method based on the transcript information according to the embodiment of the present application;
fig. 3 is a schematic structural diagram of a case feature information extraction device based on the transcript information according to the embodiment of the present application;
fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.
The embodiment of the application provides a case feature information extraction method based on the stroke information, as shown in fig. 1, comprising the following steps S101 to S104:
step S101, obtaining a target stroke picture containing stroke information, and performing text recognition on the target stroke picture to obtain a target stroke document;
step S102, preprocessing the target transcript document;
step S103, determining the type of the corresponding stroke document of the preprocessed target stroke document according to a preset keyword library of the corresponding stroke document type of each stroke document type;
step S104, inputting the target stroke document into a case feature information extraction model corresponding to the stroke document type to obtain case feature information of the stroke information.
The record file is usually a paper file, and the scanned part (i.e. the record picture) needs to be obtained through the scanning device to be processed through the computer device.
After the target pen-recorded picture is obtained, firstly, character recognition is carried out on the target pen-recorded picture through a character recognition model, and a target pen-recorded document is obtained for processing by computer equipment. The commonly used character recognition model is an OCR recognition model, and the OCR recognition model can be trained to obtain the OCR model which accords with the use of the stroke record, so that Chinese, english, numbers, symbols and the like in the stroke record can be accurately recognized.
The effective operation of the case feature information extraction model is unnecessarily disturbed by a plurality of pieces of useless information in the target stroke document obtained through character recognition, so that the target stroke document needs to be preprocessed, interference is eliminated, and the effectiveness of the information extracted by the case feature information extraction model is improved.
Since the types of public security cases are various, the objects of the records are classified into suspects, witness and victim, for example, in the phishing cases, the case characteristics required to be acquired in the records of the suspects include the name, sex, identification card number, mobile phone number of the suspects, all the suspects use social platform account numbers and other case related detailed content information, so in order to perform targeted processing on different case types and different object types, the record document types of the target record documents need to be judged and identified. And carrying out matching recognition on the target directory documents one by one through a directory document type keyword library corresponding to each directory document type stored in advance, and determining that the directory document type of the target directory document is the directory document type with the highest matching degree. The matching degree may refer to a stroke the type of the written document corresponding to the type of the document the number of keywords in the keyword library that can be matched in the target transcript document, or a corresponding type of a written document keywords in the keyword library of the type of the stroke document are at the target the ratio of the number of possible matches in a transcript document to the total number of keywords in the transcript document type keyword library.
And inputting the target written document into a case feature information extraction model of the written document type corresponding to the target written document to extract the case feature information, so that the case feature information of the target written document can be obtained. The obtained case characteristic information can be in various forms such as a table, a data table, a text document and the like, and a required output form can be selected according to the requirement.
According to the method, characters in a target record picture obtained after the entity record is scanned by using a scanning instrument are identified, a target record document corresponding to the target record picture is obtained, the target record document which contains a plurality of interference characters or symbols is preprocessed to obtain a preprocessed target record document which can be accurately extracted by a case feature information extraction model, the target record document is identified in a record document type according to a record document type keyword library corresponding to each record document type, the preprocessed target record document is input into a case feature information extraction model corresponding to the record document type according to a record document type matching corresponding case feature information extraction model of the target record document, and case feature information of the target record document is extracted by the case feature information extraction model, so that case feature information of the target record document is obtained. The case feature information extraction method based on the stroke information can accurately extract the case features in the stroke, improves the stroke information input efficiency, and reduces the query use difficulty of the case features.
In an alternative embodiment, the case feature information extraction model is constructed, as shown in fig. 2, including:
step S201, according to the type of the record document corresponding to the case feature information extraction model, obtaining a sample record picture corresponding to the type of the record document;
step S202, performing text recognition on the sample pen-recorded picture to obtain a sample pen-recorded document;
step S203, sorting the sample pages according to the page numbers of the sample pages in the sample page, deleting the interference information in the sorted sample page, and obtaining a preprocessed sample page document; the interference information comprises page numbers, signatures and line feed symbols;
step S204, acquiring a case feature information base corresponding to the type of each stroke document from the corresponding relation between the type of each stroke document and the case feature information base stored in advance;
step S205, extracting a sample case feature information base matched with the acquired case feature information base from the sample pen-recorded document;
step S206, training a case feature information training model based on the sample stroke document and the sample case feature information base to obtain a case feature information extraction model corresponding to the stroke document type.
Specifically, the case feature information extraction model in step S104 needs to be built through training in advance, and separate training and building are required for the case feature information extraction model of each transcript document type.
For each type of the sample record document, a plurality of sample record pictures need to be found, and the sample record pictures are subjected to text recognition through an OCR text recognition model to obtain the sample record document corresponding to each sample record picture.
In order to make the feature information of the sample case for training more standard, each of the pages of the sample record document needs to be ordered according to the order of page numbers, and then the page numbers in the ordered pages of the record document are deleted, so that the contents between the pages of the record document can be consistent. The line-changing symbols are arranged between the lines of the sample stroke document obtained through the OCR text recognition model, and the line-changing symbols in the sample stroke document are removed because the recognition of sentences by the computer equipment is carried out according to the format symbols and punctuation marks, which can influence the consistency of sentences in the sample stroke document. The signature (possibly of the record object or possibly of a record recorder or a public security case related person) exists in the sample record document, and the signature is not taken as the case feature of the record document, so that the identified signature text is removed to eliminate the interference factor.
Through the processing, the standardized sample pen-recorded document which can be well identified by the computer equipment is obtained. Then, determining the case characteristic information to be extracted by calling a pre-stored case characteristic information base of the type of the record document, and matching the corresponding case characteristic information in the sample record document to obtain the case characteristic information of the sample record document, wherein the case characteristic information is used as a sample case characteristic information base corresponding to the sample record document.
The sample stroke documents and the corresponding sample case feature information base are input into a case feature information training model for training. After the case feature information is extracted from the sample stroke document, the case feature information training model obtains the extracted case feature information of the sample stroke document, compares the case feature information with a sample case feature information base to obtain corresponding accuracy, and if the accuracy does not reach a preset threshold value, for example, 95%, adjusts parameters of the case feature information training model, and extracts the case feature information of the sample stroke document again; if the accuracy reaches a preset threshold, determining that the case feature information training model is trained, namely the case feature information training model has the capability of extracting case feature information according to the sample stroke document, and taking the trained case feature information training model as the case feature information extraction model of the stroke document type.
According to the method, the case feature information extraction model of each type of the record document is constructed, after the record document is put into use, when the case feature information of the target record document is required to be extracted, the corresponding case feature information extraction model is called to extract the case feature information by confirming the type of the record document corresponding to the target record document, and because each case feature information extraction model is trained for a specific type of the record document, the accuracy of the case feature information of the extracted target record document is improved.
According to the embodiment of the application, the case characteristic information extraction model of each stroke document type is independently trained, so that a case characteristic information extraction model library which can cope with various stroke documents is obtained. According to each type of the recorded documents, text recognition is carried out on a plurality of sample recorded pictures to obtain corresponding sample recorded documents, page numbers, line changing symbols and signatures in the sample recorded documents are deleted after recorded pages in the sample recorded documents are ordered, so that interference factors are eliminated, then corresponding case characteristic information in the sample recorded documents is matched according to a case characteristic information base of the type of the recorded documents, so that a sample case characteristic information base corresponding to the sample recorded documents is formed, the sample recorded documents are input into a case characteristic information training model for training, and then case characteristic information output by the case characteristic information training model is circularly trained in a mode of comparing with the sample case characteristic information base until a training result meets requirements, so that a case characteristic information extraction model of the type of the recorded documents meeting requirements is obtained.
In an optional embodiment, the sample record document includes a sample record training document and a sample record test document, and the sample case feature information base includes a training sample case feature information base and a test sample case feature information base; step S206, training the case feature information training model based on the sample transcript document and the sample case feature information base to obtain a case feature information extraction model corresponding to the transcript document type, including:
step 2061, taking the sample stroke training document as the input of the case feature information training model, taking a training sample case feature information base corresponding to the sample stroke training document as the output of the case feature information training model, and training the case feature information training model;
step 2062, verifying the trained case feature information training model by using the sample stroke test document and the test sample case feature information library, and obtaining the case feature information extraction model after verification.
In the training of the case feature information training model, the judgment of the training completion degree of the case feature information training model is based on the same or several sample stroke documents, so that the trained case feature information extraction model can only meet the requirements on the extraction accuracy of the sample stroke documents for training, but can not meet the requirements on other stroke documents.
Therefore, in training of the case feature information training model, two sample-transcript documents, a sample-transcript training document and a sample-transcript test document, are required. Training a case feature information training model through a sample stroke training document, when the accuracy of case feature information output by the case feature information training model reaches a preset threshold, testing the case feature information training model through a sample stroke testing document, and also, after the sample stroke testing document is processed and extracted from a sample case feature information base, obtaining a sample case feature information base corresponding to the sample stroke testing document, inputting the sample stroke testing document into the case feature information training model, comparing the output case feature information with the sample case feature information base corresponding to the sample stroke testing document to obtain the accuracy of test verification, and if the accuracy still reaches the preset threshold, confirming that the case feature information training model is a case feature information extraction model of the type of the stroke document; and if the accuracy does not reach the preset threshold, further training the case feature information training model by using a new sample stroke training document.
According to the embodiment of the application, the case characteristic information extraction model of each stroke document type is independently trained, so that a case characteristic information extraction model library which can cope with various stroke documents is obtained. Preparing a plurality of sample stroke training documents and a plurality of sample stroke test documents for each stroke document type, after sorting stroke pages in the sample stroke training documents, deleting page numbers, line changing symbols and signatures in the sample stroke training documents to eliminate interference factors, matching corresponding training case characteristic information in the sample stroke training documents according to a case characteristic information base of the stroke document type to form a training sample case characteristic information base corresponding to the sample stroke training documents, inputting the sample stroke training documents into a case characteristic information training model to train, and circularly training in a mode of comparing the case characteristic information output by the case characteristic information training model with the sample case characteristic information base until the case characteristic information meets the requirement, and testing the case characteristic information training model through the sample stroke test documents; after ordering the page of the record in the sample record test document, deleting the page number, the line feed character and the signature in the sample record test document to eliminate interference factors, matching the corresponding test case characteristic information in the sample record test document according to the case characteristic information base of the record document type to form a test sample case characteristic information base corresponding to the sample record test document, performing cyclic test in a mode of comparing the case characteristic information output by the case characteristic information training model with the sample case characteristic information base until the test result reaches the requirement, testing the case characteristic information training model through the sample record test document, further training the case characteristic information training model if the test result does not reach the requirement, and re-testing until the test result meets the requirement, so that the case characteristic information training model can be determined as the case characteristic information extraction model of the record document type.
In an optional embodiment, the step S205 of extracting, from the sample transcript document, a sample case feature information base that matches the obtained case feature information base includes:
step 2051, extracting case feature words matched with the acquired case feature information base from the sample stroke document;
step 2052, calculating the association degree of each sample case feature word and other sample case feature words according to the sample stroke document, and forming the case feature words with the association degree exceeding a preset threshold into sample case feature words;
step 2053, using the set of the sample case feature phrase and the sample case feature words that do not form the sample case feature phrase as a sample case feature information base.
Specifically, the training case feature information training model extracts the integrity of case feature information, when a sample case feature information base is extracted, case feature words with the association degree reaching a preset threshold value need to be formed into word groups, and then the word groups and the rest case feature words without the word groups form a sample case feature information base of the sample pen record document.
For example, there are "ask what name you call? Who does your partner have? The phone speaks. Answering: i call Zhang three, my partnership is Li IV and Wang five, and my phone is 137XXXXXXXX. The case feature words extracted according to the case feature information base are ' Zhang Sang ', ' Lisi ', ' Wang Wu ', ' telephone is 137XXXXXX ', and the association degree between the Zhang Sang and the ' telephone is 137XXXXXX ' is calculated to reach a preset threshold value, so that the phrase Zhang Sang is obtained, and the telephone is 137XXXXXXXX '.
The above examples are simpler, and more complex association rules may be employed, as this application is not limited.
The sample case feature information base obtained by the method can enable the trained case feature information extraction model to effectively extract the relevant case feature words from the correct case feature items, avoid confusion of case feature information and improve the accuracy of case feature information extraction.
In an alternative embodiment, preprocessing the target transcript document in step S102 includes:
step 1021, sorting the target pages according to the page numbers of the target pages in the target pages, and deleting the interference information in the sorted target pages to obtain a preprocessed target page document; the interference information includes page numbers, signatures, and line feed.
In order to make the case feature information in the target transcript document be better extracted by the case feature information extraction model, each transcript page in the target transcript document needs to be ordered according to the page number sequence, and then the page numbers in the ordered transcript pages are deleted, so that the content between the transcript pages can be consistent.
The line-changing symbols are arranged between the lines of the target stroke document obtained through the OCR character recognition model, and because the recognition of sentences by the computer equipment is carried out according to the format symbols and the punctuation marks, if the line-changing symbols are reserved, a large amount of case characteristic information in the target stroke document can be divided because of the line-changing symbols, so that the line-changing symbols in the target stroke document are removed.
Signatures of a record object, a record recorder or a public security case related person exist in the target record document, but the signatures cannot be used as case characteristics of the record document, so that in order to eliminate the interference factor, the identified signature text is removed.
According to the method, characters in a target record picture obtained after the entity record is scanned by using a scanning instrument are identified, a target record document corresponding to the target record picture is obtained, page numbers, line changing symbols and signatures in the target record document are deleted according to the target record document containing a plurality of interference characters or symbols to eliminate interference factors, so that a preprocessed target record document with accurate case feature information extraction can be obtained by a case feature information extraction model, the type of the target record document is identified according to a record document type keyword library corresponding to each record document type, the preprocessed target record document is input into a case feature information extraction model corresponding to the record document type according to a record document type matching corresponding to the target record document, and case feature information of the target record document is obtained by extracting case feature information of the preprocessed target record document through the case feature information extraction model. According to the case feature information extraction method based on the stroke information, standardized processing is conducted on target stroke documents of different stroke document types, interference information is eliminated, and then case feature information extraction is conducted in a targeted mode through a case feature information extraction model corresponding to the stroke document type of the target stroke document, so that case feature information with high recall rate and high accuracy is obtained, case feature information in the stroke can be accurately extracted, the stroke information input efficiency is improved, accordingly in the later stage of calling the case feature information by a public security personnel, the inquiry use difficulty of the public security personnel is reduced, the time for inquiring the case feature information by the public security personnel is saved, and the office efficiency of the public security personnel is improved.
The embodiment of the application provides a case feature information extraction device based on stroke information, as shown in fig. 3, the device includes:
the recognition module 30 is configured to obtain a target transcript picture containing transcript information, and perform text recognition on the target transcript picture to obtain a target transcript document;
a preprocessing module 31, configured to preprocess the target transcript document;
the type module 32 is configured to determine a type of the pre-processed target document corresponding to the type of the document according to a preset keyword library of the type of the document corresponding to the type of the document;
the extracting module 33 is configured to input the target transcript document into a case feature information extracting model corresponding to the transcript document type, so as to obtain case feature information of the transcript information.
In an optional embodiment, the extracting module 33 includes a model building unit 331, configured to extract a type of a transcript corresponding to the model according to the case feature information, and obtain a sample transcript picture corresponding to the type of the transcript;
performing text recognition on the sample stroke picture to obtain a sample stroke document;
according to page numbers of all sample pages in the sample page-book document, sorting all sample page-books, deleting interference information in the sorted sample page-books, and obtaining a preprocessed sample page-book document; the interference information comprises page numbers, signatures and line feed symbols;
Acquiring a case feature information base corresponding to the type of each stroke document from the corresponding relation between the type of each stroke document and the case feature information base stored in advance;
extracting a sample case feature information base matched with the acquired case feature information base from the sample stroke document;
and training a case feature information training model based on the sample stroke document and the sample case feature information base to obtain a case feature information extraction model corresponding to the stroke document type.
In an alternative embodiment, the preprocessing module 31 includes:
the processing unit 311 is configured to sort the target pages according to the page numbers of the target pages in the target document, and delete the interference information in the sorted target pages to obtain a preprocessed target document; the interference information includes page numbers, signatures, and line feed.
Corresponding to a case feature information extraction method based on the transcript information in fig. 1, the embodiment of the present application further provides a computer device 400, as shown in fig. 4, where the device includes a memory 401, a processor 402, and a computer program stored in the memory 401 and capable of running on the processor 402, where the processor 402 implements the case feature information extraction method based on the transcript information when executing the computer program.
Specifically, the memory 401 and the processor 402 can be general-purpose memories and processors, which are not limited herein, and when the processor 402 runs the computer program stored in the memory 401, the method for extracting case feature information based on the transcript information can be executed, so that the problem of how to effectively extract the case feature in the transcript information in the prior art is solved.
Corresponding to a case feature information extraction method based on the transcript information in fig. 1, the embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, and the computer program is executed by a processor to perform the steps of the case feature information extraction method based on the transcript information.
Specifically, the storage medium can be a general storage medium, such as a mobile disk, a hard disk, and the like, when a computer program on the storage medium is run, the method for extracting case feature information based on the case information can be executed, so that the problem of how to effectively extract case features in the case information in the prior art is solved. The case feature information extraction method based on the stroke information can accurately extract the case features in the stroke, improves the stroke information input efficiency, and reduces the query use difficulty of the case features.
In the embodiments provided in the present application, it should be understood that the disclosed methods and apparatuses may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments provided in the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
It should be noted that: like reference numerals and letters in the following figures denote like items, and thus once an item is defined in one figure, no further definition or explanation of it is required in the following figures, and furthermore, the terms "first," "second," "third," etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the foregoing examples are merely specific embodiments of the present application, and are not intended to limit the scope of the present application, but the present application is not limited thereto, and those skilled in the art will appreciate that while the foregoing examples are described in detail, the present application is not limited thereto. Any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or make equivalent substitutions for some of the technical features within the technical scope of the disclosure of the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the corresponding technical solutions. Are intended to be encompassed within the scope of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (7)

1. A case characteristic information extraction method based on stroke information is characterized by comprising the following steps:
acquiring a target stroke picture containing stroke information, and performing character recognition on the target stroke picture to obtain a target stroke document;
preprocessing the target transcript document;
Determining the type of the pre-processed target record document according to a preset record document type keyword library corresponding to each record document type; the types of the stroke documents are divided by the case types and the object types of the target stroke documents;
inputting the target stroke document into a case feature information extraction model corresponding to the stroke document type to obtain case feature information of the stroke information; the form of the case characteristic information is a table or a data table;
the case characteristic information extraction model is constructed, and the method comprises the following steps: according to the case characteristic information, extracting the type of the written document corresponding to the model, and obtaining a sample written picture corresponding to the type of the written document; performing text recognition on the sample stroke picture to obtain a sample stroke document; sorting the sample pages according to the page numbers of the sample pages in the sample page, deleting the interference information in the sorted sample page, and obtaining a preprocessed sample page document; the interference information comprises page numbers, signatures and line feed symbols; acquiring a case feature information base corresponding to the type of each stroke document from the corresponding relation between the type of each stroke document and the case feature information base stored in advance; extracting case feature words matched with the acquired case feature information base from the sample stroke document; calculating the association degree of each sample case feature word and other sample case feature words according to the sample stroke document, and forming case feature words with the association degree exceeding a preset threshold into sample case feature words; taking the sample case feature phrase and the set of sample case feature words which do not form the sample case feature phrase as a sample case feature information base; and training a case feature information training model based on the sample stroke document and the sample case feature information base to obtain a case feature information extraction model corresponding to the stroke document type.
2. The method of claim 1, wherein the sample record document includes a sample record training document and a sample record test document, and the sample case feature information library includes a training sample case feature information library and a test sample case feature information library; training the case feature information training model based on the sample stroke document and the sample case feature information base to obtain a case feature information extraction model corresponding to the stroke document type, wherein the case feature information extraction model comprises the following components:
taking the sample stroke training document as input of the case feature information training model, taking a training sample case feature information base corresponding to the sample stroke training document as output of the case feature information training model, and training the case feature information training model;
and verifying the trained case feature information training model by using the sample stroke test document and the test sample case feature information library, and obtaining the case feature information extraction model after verification.
3. The method of claim 1, wherein preprocessing the target transcript document comprises:
Sorting the target pages according to the page numbers of the target pages in the target page-book document, and deleting the interference information in the sorted target page-book to obtain a preprocessed target page-book document; the interference information includes page numbers, signatures, and line feed.
4. The utility model provides a case characteristic information extraction element based on stroke record information which characterized in that includes:
the identification module is used for acquiring a target stroke picture containing stroke information, and carrying out character identification on the target stroke picture to obtain a target stroke document;
the preprocessing module is used for preprocessing the target transcript document;
the type module is used for determining the type of the pre-processed target record document according to a preset record document type keyword library corresponding to each record document type; the types of the stroke documents are divided by the case types and the object types of the target stroke documents;
the extraction module is used for inputting the target stroke document into a case characteristic information extraction model corresponding to the stroke document type to obtain case characteristic information of the stroke information; the form of the case characteristic information is a table or a data table;
The extraction module comprises a model construction unit and a file extraction unit, wherein the model construction unit is used for extracting the type of the written document corresponding to the model according to the case characteristic information and obtaining a sample written picture corresponding to the type of the written document; performing text recognition on the sample stroke picture to obtain a sample stroke document; sorting the sample pages according to the page numbers of the sample pages in the sample page, deleting the interference information in the sorted sample page, and obtaining a preprocessed sample page document; the interference information comprises page numbers, signatures and line feed symbols; acquiring a case feature information base corresponding to the type of each stroke document from the corresponding relation between the type of each stroke document and the case feature information base stored in advance; extracting case feature words matched with the acquired case feature information base from the sample stroke document; calculating the association degree of each sample case feature word and other sample case feature words according to the sample stroke document, and forming case feature words with the association degree exceeding a preset threshold into sample case feature words; taking the sample case feature phrase and the set of sample case feature words which do not form the sample case feature phrase as a sample case feature information base; and training a case feature information training model based on the sample stroke document and the sample case feature information base to obtain a case feature information extraction model corresponding to the stroke document type.
5. The apparatus of claim 4, wherein the preprocessing module comprises:
the processing unit is used for sorting the target pages according to the page numbers of the target pages in the target page-book document, deleting the interference information in the sorted target page-book, and obtaining the preprocessed target page-book document; the interference information includes page numbers, signatures, and line feed.
6. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of the preceding claims 1-3 when the computer program is executed.
7. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of the method of any of the preceding claims 1-3.
CN201911176959.XA 2019-11-26 2019-11-26 Case feature information extraction method and device based on stroke information Active CN110955796B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911176959.XA CN110955796B (en) 2019-11-26 2019-11-26 Case feature information extraction method and device based on stroke information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911176959.XA CN110955796B (en) 2019-11-26 2019-11-26 Case feature information extraction method and device based on stroke information

Publications (2)

Publication Number Publication Date
CN110955796A CN110955796A (en) 2020-04-03
CN110955796B true CN110955796B (en) 2023-05-02

Family

ID=69977030

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911176959.XA Active CN110955796B (en) 2019-11-26 2019-11-26 Case feature information extraction method and device based on stroke information

Country Status (1)

Country Link
CN (1) CN110955796B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639479A (en) * 2020-04-30 2020-09-08 广州华资软件技术有限公司 Intelligent auxiliary case handling method based on deep learning
CN113111829B (en) * 2021-04-23 2023-04-07 杭州睿胜软件有限公司 Method and device for identifying document

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005043990A (en) * 2003-07-23 2005-02-17 Toshiba Corp Document processor and document processing method
CN107766371A (en) * 2016-08-19 2018-03-06 中兴通讯股份有限公司 A kind of text message sorting technique and its device
CN109472722A (en) * 2017-09-08 2019-03-15 北京国双科技有限公司 Obtain the method and device that judgement document to be generated finds out section relevant information through trying
CN109800304A (en) * 2018-12-29 2019-05-24 北京奇安信科技有限公司 Processing method, device, equipment and the medium of case notes
CN109871452A (en) * 2019-01-31 2019-06-11 深度好奇(北京)科技有限公司 Determine the method, apparatus and storage medium of characteristics of crime
CN110020424A (en) * 2019-01-04 2019-07-16 阿里巴巴集团控股有限公司 Extracting method, the extracting method of device and text information of contract information

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005043990A (en) * 2003-07-23 2005-02-17 Toshiba Corp Document processor and document processing method
CN107766371A (en) * 2016-08-19 2018-03-06 中兴通讯股份有限公司 A kind of text message sorting technique and its device
CN109472722A (en) * 2017-09-08 2019-03-15 北京国双科技有限公司 Obtain the method and device that judgement document to be generated finds out section relevant information through trying
CN109800304A (en) * 2018-12-29 2019-05-24 北京奇安信科技有限公司 Processing method, device, equipment and the medium of case notes
CN110020424A (en) * 2019-01-04 2019-07-16 阿里巴巴集团控股有限公司 Extracting method, the extracting method of device and text information of contract information
CN109871452A (en) * 2019-01-31 2019-06-11 深度好奇(北京)科技有限公司 Determine the method, apparatus and storage medium of characteristics of crime

Also Published As

Publication number Publication date
CN110955796A (en) 2020-04-03

Similar Documents

Publication Publication Date Title
CN106504744B (en) Voice processing method and device
CN109033105B (en) Method and device for acquiring focus of referee document
US8577155B2 (en) System and method for duplicate text recognition
CN109472207B (en) Emotion recognition method, device, equipment and storage medium
CN110555372A (en) Data entry method, device, equipment and storage medium
CN109508458B (en) Legal entity identification method and device
CN106815208A (en) The analysis method and device of law judgement document
CN110955796B (en) Case feature information extraction method and device based on stroke information
Nizamani et al. CEAI: CCM-based email authorship identification model
CN112328936A (en) Website identification method, device and equipment and computer readable storage medium
CN112507176A (en) Automatic determination method and device for domain name infringement, electronic equipment and storage medium
CN107704520A (en) Multifile search method and apparatus based on recognition of face
CN115116082A (en) One-key filing system based on OCR recognition algorithm
CN113076961A (en) Image feature library updating method, image detection method and device
KR101440887B1 (en) Method and apparatus of recognizing business card using image and voice information
KR101721063B1 (en) Personal information retrieval method in the image files and storing medium storing program using the method thereof
US20110010373A1 (en) Text mining device, text mining method, text mining program, and recording medium
KR101800975B1 (en) Sharing method and apparatus of the handwriting recognition is generated electronic documents
CN111401047A (en) Method and device for generating dispute focus of legal document and computer equipment
US11681966B2 (en) Systems and methods for enhanced risk identification based on textual analysis
CN114817518A (en) License handling method, system and medium based on big data archive identification
CN113688240A (en) Threat element extraction method, device, equipment and storage medium
CN108255887B (en) Method and device for verifying industry text
CN110868421A (en) Malicious code identification method, device, equipment and storage medium
CN106961423A (en) A kind of information issuing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant