CN114003750A - Material online method, device, equipment and storage medium - Google Patents

Material online method, device, equipment and storage medium Download PDF

Info

Publication number
CN114003750A
CN114003750A CN202111273705.7A CN202111273705A CN114003750A CN 114003750 A CN114003750 A CN 114003750A CN 202111273705 A CN202111273705 A CN 202111273705A CN 114003750 A CN114003750 A CN 114003750A
Authority
CN
China
Prior art keywords
text
label
character
information
text information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111273705.7A
Other languages
Chinese (zh)
Other versions
CN114003750B (en
Inventor
李志宏
刘雪毅
王浩
卢春辉
范宜发
王芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Bank Co Ltd
Original Assignee
Ping An Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Bank Co Ltd filed Critical Ping An Bank Co Ltd
Priority to CN202111273705.7A priority Critical patent/CN114003750B/en
Publication of CN114003750A publication Critical patent/CN114003750A/en
Application granted granted Critical
Publication of CN114003750B publication Critical patent/CN114003750B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance

Abstract

The invention relates to the field of online material management, and discloses a material online method, a device, equipment and a storage medium, wherein the method is used for acquiring materials uploaded to audit by a user; performing text recognition on the picture by using an OCR text recognition technology to obtain picture recognition text information; performing word segmentation operation on the image identification text information and the material text information by using a word segmentation tool to obtain text word segmentation; filtering the sensitive words of the text participles according to the sensitive word lexicon; and if filtering is successful, performing label matching on the text word segmentation through the big data label library to obtain a corresponding label, adding the corresponding label to an audit page of the material, and issuing the material on line after the label is signed. The method includes the steps of carrying out OCR text recognition on uploaded picture contents to obtain text contents, marking the text contents through a label library, filtering sensitive words, and rapidly achieving marking of the pictures and the text contents and legal and normative verification of the contents. The operation process is reduced, and therefore the purpose of quickly getting on-line materials is achieved.

Description

Material online method, device, equipment and storage medium
Technical Field
The invention relates to the field of online material management, in particular to a material online method, a device, equipment and a storage medium.
Background
With the development of artificial intelligence technology, especially in business online auditing applications, intelligent online auditing has become a trend. In the existing intelligent online auditing mode, after an online picture and a text material are displayed through a material library and other relevant operation positions, the picture and the text material are respectively labeled, but the existing labeling mode is only to label the picture and the text material based on the actual semantics of the actual words in the picture or the text material, if a plurality of words with different semantics exist in the text material or the picture, a plurality of labels exist, and the online auditing of the service can have a plurality of standards due to the plurality of labels, so that the final online classification of the service is inaccurate, and the efficiency of the service auditing is low.
Disclosure of Invention
The invention mainly aims to solve the technical problem of low labeling accuracy rate in the existing material online process.
The invention provides a material online method in a first aspect, which comprises the following steps: acquiring material information uploaded to a material system by a user terminal, wherein the material information at least comprises material pictures and material text information; performing text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information; performing word segmentation operation on the image identification text information and the material text information by using a word segmentation tool to obtain text word segmentation; performing sensitive word filtering on the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure; if the filtering result is failure, marking the text participles which are identified as sensitive words in the picture identification text information and the material text information, and returning the marked picture identification text information and the material text information to the user terminal; if the filtering result is successful, screening a big data label from a big data label library through a preset feature word and an exclusion word in the big data label library as a candidate label corresponding to the text word segmentation; counting the positions and times of the feature words and the excluding words corresponding to the candidate labels in the image recognition text information and the material text information; calculating a weight score corresponding to the candidate label according to a preset weight formula, the position and the times; taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material; and adding the label into the corresponding material information, and sending the material information added with the label to the material system for material online.
Optionally, in a first implementation manner of the first aspect of the present invention, the performing text recognition on the material picture by using a preset optical character recognition technology to obtain picture recognition text information includes: recognizing characters in the material picture through a preset optical character recognition technology to obtain a character set corresponding to the material picture; converting each character in the character set into a corresponding character vector, and inputting the character vector into a preset semantic correction model to obtain a predicted character corresponding to each character; and summarizing the predicted characters corresponding to all the characters in sequence to obtain the picture identification text information of the material picture.
Optionally, in a second implementation manner of the first aspect of the present invention, the converting each character in the character set into a corresponding character vector, and inputting the character vector into a preset semantic correction model to obtain a predicted character corresponding to each character includes: converting each character into a corresponding character vector, and inputting the character vector into the semantic correction model; multiplying the character vector corresponding to each character by a predetermined matrix WQ, matrix WK and matrix WV respectively through the semantic correction model to obtain a query character vector, a key character vector and a value character vector corresponding to the character; and executing coding and decoding operations on the query character vector, the key character vector and the value character vector corresponding to the characters to obtain a predicted character corresponding to each character.
Optionally, in a third implementation manner of the first aspect of the present invention, the performing, by using a word segmentation tool, word segmentation on the image recognition text information and the material text information to obtain text word segmentation includes: calling a preset word segmentation tool, and dividing the image recognition text information and the material text information into N reference words with preset character lengths through a dictionary of the word segmentation tool, wherein N is a natural number not less than 1; acquiring reference participles divided by the participle dividing tool, searching a corresponding preset dictionary according to the character length of the reference participle, and judging whether the reference participle exists in the preset dictionary or not; and when the reference participles are not stored in the preset dictionary, screening the reference participles which are not stored by a fuzzy matching algorithm to obtain the text participles.
Optionally, in a fourth implementation manner of the first aspect of the present invention, the screening, from the big data tag library, a big data tag as a candidate tag corresponding to the text segmentation word by using the feature word and the excluded word in the preset big data tag library includes: traversing the big data label database based on the feature words and the exclusion words, and matching out big data labels corresponding to the text participles; judging whether the text word segmentation contains an excluded word corresponding to the big data label; if yes, determining that the text word segmentation does not have a corresponding candidate label; and if not, taking the big data label as a candidate label of the text word segmentation.
Optionally, in a fifth implementation manner of the first aspect of the present invention, before the obtaining of the material information uploaded to the material system by the user terminal, the method further includes:
obtaining a history audit text, performing word segmentation processing on the history audit text to obtain a history text word segmentation, and converting the history text word segmentation into a corresponding word vector; randomly selecting N historical text participles from the word vector as an initial clustering center, wherein N is a positive integer; respectively calculating cosine distances from all historical text participles to each initial clustering center; dividing the history text participles which are closest to the cosine distance of the initial clustering center into the same label category; calculating a mean vector of word vectors in the label category, reselecting a current clustering center according to the mean vector, and calculating the cosine distance between the current clustering center and a corresponding initial clustering center; if the cosine distance between the current clustering center and the corresponding initial clustering center is smaller than or equal to a preset threshold value, obtaining N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories; if the cosine distance between the current clustering center and the corresponding initial clustering center is larger than a preset threshold, re-clustering is carried out by using the current clustering center until the cosine distance between the current clustering center and the last clustering center is smaller than or equal to the preset threshold, and N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories are obtained; taking the clustering center as a big data label, and taking historical text participles in the corresponding label category as feature words corresponding to the big data label; and obtaining a big data label database according to the big data labels and the feature words.
Optionally, in a sixth implementation manner of the first aspect of the present invention, the adding the tag to the corresponding material information and sending the material information with the tag added to the material system to perform material online includes: adding the label into corresponding material information, and uploading the material information added with the label to the material system; submitting the material information added with the label to a preset office automation system for process signing through the material system; and when the office automation system finishes signing the material information added with the label, releasing the material information added with the label on line.
The second aspect of the present invention provides a material feeding device, which comprises: the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring material information uploaded to a material system by a user terminal, and the material information at least comprises material pictures and material text information; the identification module is used for carrying out text identification on the material picture through a preset optical character identification technology to obtain picture identification text information; the word segmentation module is used for performing word segmentation operation on the image identification text information and the material text information by using a word segmentation tool to obtain text word segmentation; the filtering module is used for filtering the sensitive words of the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure; the marking module is used for marking the text participles which are identified as sensitive words in the picture identification text information and the material text information when the filtering result is filtering failure, and returning the marked picture identification text information and the material text information to the user terminal; the screening module is used for screening the big data labels from the big data label library as candidate labels corresponding to the text word segmentation through the preset feature words and the exclusion words in the big data label library when the filtering result is that the filtering is successful; the counting module is used for counting the positions and times of the feature words and the exclusion words corresponding to the candidate labels in the image identification text information and the material text information; the calculating module is used for calculating a weight score corresponding to the candidate label according to a preset weight formula, the position and the times; the marking module is used for taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material; and the online module is used for adding the label into the corresponding material information and sending the material information added with the label to the material system for material online.
Optionally, in a first implementation manner of the second aspect of the present invention, the identification module specifically includes: the character set identification unit is used for identifying characters in the material picture through a preset optical character identification technology to obtain a character set corresponding to the material picture; the character correction unit is used for converting each character in the character set into a corresponding character vector and inputting the character vector into a preset semantic correction model to obtain a predicted character corresponding to each character; the character summarizing unit is used for summarizing the predicted characters corresponding to all the characters in sequence to obtain the picture identification text information of the material picture.
Optionally, in a second implementation manner of the second aspect of the present invention, the character correcting unit specifically includes: converting each character into a corresponding character vector, and inputting the character vector into the semantic correction model; multiplying the character vector corresponding to each character by a predetermined matrix WQ, matrix WK and matrix WV respectively through the semantic correction model to obtain a query character vector, a key character vector and a value character vector corresponding to the character; and executing coding and decoding operations on the query character vector, the key character vector and the value character vector corresponding to the characters to obtain a predicted character corresponding to each character.
Optionally, in a third implementation manner of the second aspect of the present invention, the word segmentation module is configured to: calling a preset word segmentation tool, and dividing the image recognition text information and the material text information into N reference words with preset character lengths through a dictionary of the word segmentation tool, wherein N is a natural number not less than 1; acquiring reference participles divided by the participle dividing tool, searching a corresponding preset dictionary according to the character length of the reference participle, and judging whether the reference participle exists in the preset dictionary or not; and when the reference participles are not stored in the preset dictionary, screening the reference participles which are not stored by a fuzzy matching algorithm to obtain the text participles.
Optionally, in a fourth implementation manner of the second aspect of the present invention, the screening module is specifically configured to: traversing the big data label database based on the feature words and the exclusion words, and matching out big data labels corresponding to the text participles; judging whether the text word segmentation contains an excluded word corresponding to the big data label; if yes, determining that the text word segmentation does not have a corresponding candidate label; and if not, taking the big data label as a candidate label of the text word segmentation.
Optionally, in a fifth implementation manner of the second aspect of the present invention, the material online apparatus further includes a tag clustering module, where the tag clustering module is configured to: obtaining a history audit text, performing word segmentation processing on the history audit text to obtain a history text word segmentation, and converting the history text word segmentation into a corresponding word vector; randomly selecting N historical text participles from the word vector as an initial clustering center, wherein N is a positive integer; respectively calculating cosine distances from all historical text participles to each initial clustering center; dividing the history text participles which are closest to the cosine distance of the initial clustering center into the same label category; calculating a mean vector of word vectors in the label category, reselecting a current clustering center according to the mean vector, and calculating the cosine distance between the current clustering center and a corresponding initial clustering center; if the cosine distance between the current clustering center and the corresponding initial clustering center is smaller than or equal to a preset threshold value, obtaining N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories; if the cosine distance between the current clustering center and the corresponding initial clustering center is larger than a preset threshold, re-clustering is carried out by using the current clustering center until the cosine distance between the current clustering center and the last clustering center is smaller than or equal to the preset threshold, and N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories are obtained; taking the clustering center as a big data label, and taking historical text participles in the corresponding label category as feature words corresponding to the big data label; and obtaining a big data label database according to the big data labels and the feature words.
Optionally, in a sixth implementation manner of the second aspect of the present invention, the online module is specifically configured to: adding the label into corresponding material information, and uploading the material information added with the label to the material system; submitting the material information added with the label to a preset office automation system for process signing through the material system; and when the office automation system finishes signing the material information added with the label, releasing the material information added with the label on line.
The third aspect of the invention provides a material threading device, which comprises: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor calls the instructions in the memory to cause the material on-line device to perform the steps of the material on-line method.
A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the above-mentioned material-on-line method.
According to the technical scheme, material information uploaded to a material system by a user terminal is obtained, wherein the material information at least comprises material pictures and material text information; performing text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information; performing word segmentation operation on the image identification text information and the material text information by using a word segmentation tool to obtain text word segmentation; performing sensitive word filtering on the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure; if the filtering result is failure, marking the text participles which are identified as sensitive words in the picture identification text information and the material text information, and returning the marked picture identification text information and the material text information to the user terminal; if the filtering result is successful, screening a big data label from a big data label library through a preset feature word and an exclusion word in the big data label library as a candidate label corresponding to the text word segmentation; counting the positions and times of the feature words and the excluding words corresponding to the candidate labels in the image recognition text information and the material text information; calculating a weight score corresponding to the candidate label according to a preset weight formula, the position and the times; taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material; and adding the label into the corresponding material information, and sending the material information added with the label to the material system for material online. The method adopts an optical character recognition technology, obtains text contents by performing OCR text recognition on uploaded picture contents, marks the text contents through a label library, and filters sensitive words, so that the marking of the picture and the text contents and the legal and normative verification of the contents can be quickly realized. The operation process is reduced, and therefore the purpose of quickly getting on-line materials is achieved.
Drawings
FIG. 1 is a schematic diagram of a first embodiment of a material online method according to an embodiment of the present invention;
FIG. 2 is a diagram of a second embodiment of a method for on-line material in an embodiment of the invention;
FIG. 3 is a schematic diagram of a third embodiment of a material on-line method in an embodiment of the invention;
FIG. 4 is a diagram illustrating a fourth embodiment of a method for on-line material in an embodiment of the present invention;
FIG. 5 is a schematic diagram of an embodiment of a material threading device in an embodiment of the invention;
FIG. 6 is a schematic diagram of another embodiment of a material threading device in an embodiment of the invention;
fig. 7 is a schematic diagram of an embodiment of a material threading device in an embodiment of the invention.
Detailed Description
The embodiment of the invention provides a material online method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring material information uploaded to a material system by a user terminal, wherein the material information at least comprises material pictures and material text information; performing text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information; performing word segmentation operation on the image identification text information and the material text information by using a word segmentation tool to obtain text word segmentation; performing sensitive word filtering on the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure; if the filtering result is failure, marking the text participles which are identified as sensitive words in the picture identification text information and the material text information, and returning the marked picture identification text information and the material text information to the user terminal; if the filtering result is successful, screening a big data label from a big data label library through a preset feature word and an exclusion word in the big data label library as a candidate label corresponding to the text word segmentation; counting the positions and times of the feature words and the excluding words corresponding to the candidate labels in the image recognition text information and the material text information; calculating a weight score corresponding to the candidate label according to a preset weight formula, the position and the times; taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material; and adding the label into the corresponding material information, and sending the material information added with the label to the material system for material online. The method adopts an optical character recognition technology, obtains text contents by performing OCR text recognition on uploaded picture contents, marks the text contents through a label library, and filters sensitive words, so that the marking of the picture and the text contents and the legal and normative verification of the contents can be quickly realized. The operation process is reduced, and therefore the purpose of quickly getting on-line materials is achieved.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of understanding, a specific flow of an embodiment of the present invention is described below, and referring to fig. 1, a first embodiment of a material online method in an embodiment of the present invention includes:
101, acquiring material information uploaded to a material system by a user terminal, wherein the material information at least comprises material pictures and material text information;
in the embodiment, the material is an advertisement material, the basic advertisement material form is material data of characters, pictures and Flash types, in order to ensure that the content of the advertisement material data is legal and can be normally accessed after being online, the material needs to be checked, and when a user uploads the material to a checking page, the user can independently upload material pictures and material text information; an office suite interface can also be accessed to the system background, so that a user can upload an integral document including but not limited to PPT and DOC formats at one time, and the system analyzes and processes on line to obtain material pictures and material text information in the integral document.
Step 102, performing text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information;
in this embodiment, corresponding to the text information in the material picture, due to different sizes and styles of the material picture, the regions where the text information of the material picture is located are different, the text region in the material picture needs to be determined through an image detection algorithm, and then characters in the text region are identified through an optical character identification technology.
In this step, the image detection algorithm is to locate an area where the non-specific font characters are located in the given picture containing the non-specific font characters to obtain the area of the non-specific font characters. Optionally, the text area with the unspecific font is a rectangular frame. For example: an original image containing characters with an unspecific font is selected out of an unspecific font character area of a picture containing the characters with the unspecific font through an image detection algorithm frame, and the unspecific font character area (such as a blank background) in the picture containing the characters with the unspecific font is not considered. The image detection algorithm may be dlib image detection algorithm, opencv library image detection algorithm, or deep learning-based image detection algorithm.
Preferably, in this embodiment, an image detection algorithm based on deep learning is adopted to detect the original image set, and a text region of a text with an unspecific font is analyzed. By acquiring the text area of the unspecific font, the text of the unspecific font can be processed more efficiently in the following.
103, performing word segmentation operation on the image recognition text information and the material text information by using a word segmentation tool to obtain text word segmentation;
in this embodiment, the system performs word segmentation by using a jieba word segmentation tool, the word segmentation tool generates a directed acyclic graph of a sentence by comparing dictionaries, and then searches for a shortest path according to the dictionaries according to different selected modes, and then intercepts the sentence or directly intercepts the sentence.
104, performing sensitive word filtering on the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure;
in the embodiment, the text participles are compared with the sensitive word bank in a traversing way, and when the text participles are traversed and exist in the sensitive word bank, the filtering is failed; and if the text segmentation does not have the vocabulary in the sensitive word lexicon, the filtering is successful.
105, if the filtering result is filtering failure, marking the text participles which are identified as sensitive words in the picture identification text information and the material text information, and returning the marked picture identification text information and the marked material text information to the user terminal;
in the step, a mapping table needs to be made for the text word segmentation of the image recognition text information and the material text information, the position information of the text word segmentation of the recognized sensitive word in the material uploaded by the user is synchronously recorded for the subsequent system filtering failure, and the information is fed back to the image recognition text information and the material text information submitted by the user in real time, so that the user can conveniently position and modify the information.
In this embodiment, when the system detects that a text segment is a sensitive word, a specific position of the text segment in material text information or a material picture submitted by a user is found and highlighted through a mapping table, the user can edit and modify the material text information or the material picture on line, and after the user modifies the material picture, the system can perform secondary text recognition on modified content in real time, perform segmentation on the image recognition text information and the material text information, perform traversal comparison on the updated text segment and a sensitive word bank, and perform secondary filtering.
Step 106, if the filtering result is successful, screening the big data label from the big data label library as a candidate label corresponding to the text word segmentation through the feature words and the exclusion words in the preset big data label library;
in practical application, in order to improve the material characterization capability of the tags and facilitate a user to search related materials according to the tags, the system attaches a weight to each matching in the process of matching the tags for text word segmentation, and each matched tag is provided with a weight value of one tag to represent the matching or correlation degree of the tag and the material.
In this embodiment, each big data tag in the big data tag library needs to be configured with a feature word and an exclusion word. The characteristic word refers to a series of words having the same or similar meaning with the label, and the characteristic word may be another name of the label, a subclass under the label, or other words related to the label, and usually the label itself is also one of the characteristic words, for example, the characteristic word of the label "gold" includes "gold", "gold price", "buy gold", and so on.
Step 107, counting the positions and times of the feature words and the exclusion words corresponding to the candidate labels in the image recognition text information and the material text information;
in this embodiment, a mapping set is established through a full text, the mapping set includes position information and total occurrence times of each feature word and its corresponding excluded word, and statistical preparation is made for calculating a weight formula in the next step by establishing the mapping table.
Step 108, calculating a weight score corresponding to the candidate label according to a preset weight formula, the position and the times;
in this embodiment, through analysis between the historical materials and the tags, a plurality of different calculation modes can be preset, and the calculation is obtained through the weight distribution ratio at different positions. For example, the weight score of each occurrence of a feature word at a title is a score a, the weight score of each occurrence of a subtitle is a score B, the weight score of each occurrence of a first segment at a first stage is a score C, the weight score of each occurrence of a last segment at a last stage is a score D, the weight score of each occurrence of the rest of the feature words is a score E, the weight score subtracted from each occurrence of an excluded word corresponding to the feature word is a score F, and one of possible weight formulas can be simply listed, that is: the total mark of the characteristic words is A + B + C + D + E-F. By adjusting the size of different position weight scores, the obtained total score can change correspondingly, and different labels can be obtained from the same material information.
In this step, in order to improve the material characterization capability of the big data tag and provide more flexible tag service, a weight needs to be attached to the tag of the material. That is, for each big data tag generated by the system for the material, the weight value of the big data tag is taken to represent the matching or correlation degree of the big data tag and the material. The specific implementation mode is as follows: because the positions of the characteristic words appearing in the materials are importantly distinguished, the characteristic words appearing in the titles are the most important, and are the abstract, the initial position of the text and the like in sequence. Thus, the title weight is set to 3, the abstract weight is set to 2, the first sentence weight of the text is set to 1, the weight of each sentence after the first sentence decreases in an equal difference linear way, and the weight of the last sentence is 0 (since the last sentence of the resource is usually an extensional expansion effect, which may be far from the subject of the whole text). And then, calculating the number of times of the feature words appearing in the material and the weight sum calculated according to the appearance positions and the number of times, dividing the weight sum of the feature words appearing in the material by the product of the total number of the label feature words and the weight sum of the material to calculate the label weight, namely the label weight is the weight sum of hit feature words/(the number of all feature words in the label resource weight sum) × 100, and the value range of the finally obtained label weight is between 0 and 100. A higher tag weight indicates a higher correlation of the tag to the material.
Step 109, taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material;
in this embodiment, after the total weight score calculated by each feature word in step 108 is calculated, the candidate label with the largest weight score is determined as the label of the material.
And 110, adding the label into the corresponding material information, and sending the material information added with the label to a material system for material online.
In this embodiment, after the tag is added to the material audit page, the audit filtering step is substantially completed after the system passes the audit. And the follow-up process still needs to lead, check, sign and confirm according to the company organization structure and the process, so that the process can be released to be on-line, and the purpose of preventing wrongly written characters or some policy risks is achieved.
According to the technical scheme, material information uploaded to a material system by a user terminal is obtained, wherein the material information at least comprises material pictures and material text information; performing text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information; performing word segmentation operation on the image recognition text information and the material text information by using a word segmentation tool to obtain text word segmentation; performing sensitive word filtering on the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure; if the filtering result is failure, marking the text participles which are identified as sensitive words in the picture identification text information and the material text information, and returning the marked picture identification text information and the marked material text information to the user terminal; if the filtering result is successful, screening the big data label from the big data label library as a candidate label corresponding to the text word segmentation through the preset feature words and the exclusion words in the big data label library; counting the positions and times of the feature words and the excluding words corresponding to the candidate labels in the image recognition text information and the material text information; calculating a weight score corresponding to the candidate label according to a preset weight formula, the position and the times; taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material; and adding the label into the corresponding material information, and sending the material information added with the label to a material system for material online. The method adopts an optical character recognition technology, obtains text contents by performing OCR text recognition on uploaded picture contents, marks the text contents through a label library, and filters sensitive words, so that the marking of the picture and the text contents and the legal and normative verification of the contents can be quickly realized. The operation process is reduced, and therefore the purpose of quickly getting on-line materials is achieved.
Referring to fig. 2, a second embodiment of the material on-line method in the embodiment of the present invention includes:
step 201, material information uploaded to a material system by a user terminal is obtained, wherein the material information at least comprises material pictures and material text information;
202, performing text recognition on a material picture through a preset optical character recognition technology to obtain picture recognition text information;
in this embodiment, the material picture is any picture that needs to be subjected to character recognition, and the obtained picture recognition text information may be null, or may include at least one character recognized from the material picture, where when the picture recognition text information includes the at least one character recognized from the material picture, the step 203 is triggered to be executed. The character recognition model described in the embodiment of the present invention is a character recognition model based on a connection timing classification CTC (connection timing classification), and the character recognition model based on the connection timing classification CTC mainly includes two parts: a picture feature extraction network (such as CNN, VGG, ResNet, etc.) and a text classification network (such as RNN, LSTM, etc.).
Step 203, converting each character into a corresponding character vector, and inputting the character vector into the semantic correction model;
in the embodiment of the invention, after the characters in the material picture are identified, the character identification device can convert each identified character into a corresponding character vector through a word-embedding (word-embedding) technology, and each character corresponds to one character vector. For example, when the recognized character set is "flat", the character recognition apparatus may convert "flat" into a 2 × 512 matrix, that is, each character is replaced by a 512-dimensional character vector.
Step 204, multiplying the character vector corresponding to each character by a predetermined matrix WQ, matrix WK and matrix WV respectively through the semantic correction model to obtain a query character vector, a key character vector and a value character vector corresponding to the character;
in the step, the matrix WQ is a query matrix, and a character vector is multiplied by the query matrix to obtain a query character vector; the matrix WK is a key matrix, and a character vector is multiplied by the key matrix to obtain a key character vector; the matrix WV is a value matrix, and a character vector is multiplied by the value matrix to obtain a value character vector; through three predetermined matrixes WQ, WK and WV, input character vectors are multiplied respectively, and three corresponding character characteristics can be understood to be obtained.
Specifically, the values of the matrix WQ, the matrix WK, and the matrix WV are set at random at the beginning of model training, and the semantic correction model is trained and continuously adjusted through a large number of training samples to obtain the predetermined matrix WQ, matrix WK, and matrix WV.
Step 205, performing encoding and decoding operations on the query character vector, the key character vector and the value character vector corresponding to the characters to obtain a predicted character corresponding to each character;
in the step, encoding and decoding operations are performed on the obtained query character vector, key character vector and value character vector, so that the corresponding features of the initial character are extracted through three matrixes, and finally, the feature values of the three vectors are fused to obtain a predicted character corresponding to each character through decoding.
In specific application, because the existing optical character recognition technology has recognition errors with certain probability, in order to reduce the probability of error occurrence, a semantic correction model is required to be added, and the recognition character vectors with lower recognition precision can be corrected by linking the front character vectors and the rear character vectors which need to be corrected according to the semantic correction model. The training process of the semantic correction module needs to pass through a manual labeling result and a detection recognition result under a historical optical character recognition technical scene; for example, the result of manual annotation is "5/3/2019", while the result obtained by deep learning the text detection and recognition model is "3/5/2019/m". These two pairs constitute a set of training samples. And enabling the semantic correction module to learn a decoding mode according to the category information to finally obtain standard text information conforming to the semantics. For example, when the picture recognition text information recognized by the pre-trained character recognition model is 'splendid achnatherum dangerous', the character recognition device inputs a vector corresponding to 'splendid achnatherum dangerous' to the trained semantic correction module to perform semantic correction, obtains characters 'at risk' after the semantic correction of 'splendid achnatherum dangerous', and further takes 'at risk' as the final character recognition result of the image.
Step 206, summarizing the predicted characters corresponding to all the characters in sequence to obtain picture identification text information of the material picture;
step 207, performing word segmentation operation on the image recognition text information and the material text information by using a word segmentation tool to obtain text word segmentation;
208, performing sensitive word filtering on the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure;
step 209, if the filtering result is failure, marking the text participles which are identified as sensitive words in the picture identification text information and the material text information, and returning the marked picture identification text information and the marked material text information to the user terminal;
step 210, if the filtering result is successful, screening the big data label from the big data label library as a candidate label corresponding to the text word segmentation through the feature words and the excluded words in the preset big data label library;
step 211, counting positions and times of the feature words and the exclusion words corresponding to the candidate tags in the image recognition text information and the material text information;
step 212, calculating a weight score corresponding to the candidate label according to a preset weight formula, a position and times;
step 213, taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material;
and 214, adding the label into the corresponding material information, and sending the material information added with the label to a material system for material online.
On the basis of the previous embodiment, the embodiment describes in detail that the characters in the material picture are recognized through a preset optical character recognition technology to obtain a character set corresponding to the material picture; converting each character in the character set into a corresponding character vector, and inputting the character vector into a semantic correction model of an optical character recognition technology to obtain a predicted character corresponding to each character; and summarizing the predicted characters corresponding to all the characters in sequence to obtain the picture identification text information of the material picture. By adopting the semantic correction model, the accuracy of character recognition is improved, and the probability of occurrence of wrongly written characters is reduced.
Referring to fig. 3, a third embodiment of the material threading method in the embodiment of the present invention includes:
301, acquiring material information uploaded to a material system by a user terminal, wherein the material information at least comprises material pictures and material text information;
step 302, performing text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information;
step 303, calling a preset word segmentation tool, and dividing the image recognition text information and the material text information into N reference words with preset character lengths through a dictionary of the word segmentation tool, wherein N is a natural number not less than 1;
preferably, the basic process of jieba word segmentation is as follows: efficient word graph scanning is achieved based on a Trie tree (dictionary tree) structure, and a Directed Acyclic Graph (DAG) formed by all possible word forming conditions of Chinese characters in a sentence is generated. And generating the trie tree according to the dictionary file. The dictionary converts the occurrence frequency of each word into frequency while generating a trie tree; the method comprises the steps of generating a DAG (DAG) according to a trie tree generated by a dictionary file for a sentence to be participled, and generating several possible sentence segmentations by performing dictionary lookup operation according to a given dictionary in a popular way in practice. jieba records the starting position of a word in a sentence in DAG, from 0 to n-1(n is the length of the sentence), each starting position is used as a key of a dictionary, and value is a list, in which the ending positions of possible words are stored (the word is obtained by looking up the dictionary, and the starting position + the length of the word is obtained as the ending position). And searching a maximum probability path by adopting dynamic programming, and finding out a maximum segmentation combination based on the word frequency. Searching for the word which is already segmented in the sentence to be segmented, searching for the frequency (times/total number) of occurrence of the word for the word, and if the word does not exist, taking the frequency of the word with the minimum frequency of occurrence in the dictionary as the frequency of the word, namely, P (a word) ═ freq. According to the method for searching the maximum probability path by dynamic planning, the maximum probability is calculated reversely from right to left for the sentence, wherein P (NodeN) is 1.0, P (NodeN-1) is P (NodeN) Max (P (penultimate word)) …, and the maximum probability path is obtained finally, and the segmentation combination of the maximum probability is obtained.
Step 304, acquiring the reference participles divided by the participle dividing tool, searching a corresponding preset dictionary according to the character length of the reference participles, and judging whether the reference participles exist in the preset dictionary or not;
in the embodiment, the picture recognition text information and the material text information are obtained, and the word segmentation tool is called to divide the picture recognition text information and the material text information into a plurality of characters with preset lengths, a corresponding preset dictionary is searched according to the characters with the preset lengths, whether the characters exist in the preset dictionary is judged, when the characters do not exist in the preset dictionary, the situation that the characters are abnormal in recognition is indicated, under the situation, the target characters are screened out through the non-existing characters through the fuzzy matching algorithm, so that character recognition is achieved through the fuzzy matching algorithm, and the character recognition efficiency is improved.
305, when the reference participles are not stored in the preset dictionary, screening the reference participles which are not stored through a fuzzy matching algorithm to obtain text participles;
in practical application, the fuzzy matching algorithm is the minimum number of editing operations required for converting one string into another string, the number of operations is taken as an editing distance, two strings with smaller editing distances are more similar, and when the editing distance is 0, the two strings are equal, so that the recognition of characters is realized.
In a specific implementation, words with the editing distance not greater than the word length are searched for through a BK-tree algorithm, for example, a 'Zhongyuan' does not exist, words with the editing distance not greater than the word length are searched for through the BK-tree algorithm and can be 'China', wherein the editing distance is the editing distance from the character strings A to B, and the A can be changed into the B by using three operations of inserting, deleting and replacing, and the A can be changed into the B by using as few as possible steps. For example, two steps are needed from FAME to GATE, two times of replacement are needed, three steps are needed from GAME to ACM, G and E are deleted, C is added, and the screened Chinese is displayed as a target character, so that the text recognition is realized through a fuzzy matching algorithm, and the accuracy of the text recognition is improved.
Step 306, performing sensitive word filtering on the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure;
step 307, if the filtering result is failure, marking the text participles which are identified as sensitive words in the picture identification text information and the material text information, and returning the marked picture identification text information and the marked material text information to the user terminal;
step 308, if the filtering result is successful, screening the big data label from the big data label library as a candidate label corresponding to the text word segmentation through the feature words and the excluded words in the preset big data label library;
step 309, counting the positions and times of the feature words and the excluding words corresponding to the candidate tags in the image recognition text information and the material text information;
step 310, calculating a weight score corresponding to the candidate label according to a preset weight formula, a position and times;
step 311, taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material;
and step 312, adding the label into the corresponding material information, and sending the material information added with the label to a material system for material online.
The method comprises the steps that on the basis of the previous invention example, a preset word segmentation tool is called in detail, and the picture recognition text information and the material text information are divided into N reference word segmentations with preset character lengths through a dictionary of the word segmentation tool, wherein N is a natural number not less than 1; acquiring reference participles divided by a participle dividing tool, searching a corresponding preset dictionary according to the character length of the reference participle, and judging whether the reference participle exists in the preset dictionary or not; and when the reference participles are not stored in the preset dictionary, screening the reference participles which are not stored by a fuzzy matching algorithm to obtain text participles, and displaying the text participles. Through adopting the preset word segmentation tool, the recognized characters quickly form the text word segmentation with practical significance, and the character recognition efficiency is improved.
Referring to fig. 4, a fourth embodiment of the material on-line method in the embodiment of the present invention includes:
step 401, acquiring material information uploaded to a material system by a user terminal, wherein the material information at least comprises material pictures and material text information;
step 402, performing text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information;
step 403, performing word segmentation operation on the image recognition text information and the material text information by using a word segmentation tool to obtain text word segmentation;
step 404, performing sensitive word filtering on the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure;
step 405, if the filtering result is failure, marking the text participles which are identified as sensitive words in the picture identification text information and the material text information, and returning the marked picture identification text information and the marked material text information to the user terminal;
step 406, traversing the big data label database based on the feature words and the excluded words, and matching out big data labels corresponding to the text participles;
in the present embodiment, the word meaning of a word is sometimes confused due to the flexibility of the language and the diversity of the word meaning, for example, the "gold" label in the banking scene refers to the business related to gold buying and selling, and the resource of the word of "gold" appearing is not necessarily related to the business, so that an exclusion word is set for constraint, for example, the exclusion word of "gold" has "gold week", "gold holiday" and the like, and when the exclusion word "gold week" appears in the participle of the text, the feature word "gold" is known to be ineffective, thereby avoiding the generation of the wrong "gold" label.
In this embodiment, the big data tag library is mainly obtained through the following steps:
obtaining a historical audit text, performing word segmentation processing on the historical audit text to obtain historical text word segmentation, converting the historical text word segmentation into corresponding word vectors, and setting the number of clusters to be N; randomly selecting N historical text participles as an initial clustering center; respectively calculating cosine distances from all historical text participles to each initial clustering center; dividing historical text participles with cosine distance from an initial clustering center within a preset error range into the same label category; calculating a mean vector of the word vectors in the label category, reselecting a current clustering center according to the mean vector, and calculating the cosine distance between the current clustering center and the corresponding initial clustering center; if the cosine distance between the current clustering center and the corresponding initial clustering center is smaller than or equal to a preset threshold value, obtaining N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories; if the cosine distance between the current clustering center and the corresponding initial clustering center is larger than a preset threshold, re-clustering is carried out by using the current clustering center until the cosine distance between the current clustering center and the last clustering center is smaller than or equal to the preset threshold, and N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories are obtained; taking the clustering center as a big data label, and taking the historical text participles in the corresponding label category as characteristic words corresponding to the big data label; and obtaining a big data label library according to the big data labels and the feature words.
In this step, before calculating the cosine distance, the tagged language segment needs to be converted into a word vector form, and the tagged language segment is mainly input into the word2vec model, and the word vector corresponding to the tagged language segment is output through the word2vec model. word2vec, also called word templates, the Chinese name "word Vector", functions to transform words in natural language into Dense vectors (Dense vectors) that can be understood by computers. word2vec is mainly divided into two modes, CBOW (continuous Bag of words) and Skip-Gram. CBOW is the inference of the target word from the original sentence; and the Skip-Gram is just the opposite, and the original sentence is deduced from the target word. CBOW is more suitable for small databases, while Skip-Gram performs better in large corpora.
In the embodiment, in order to accurately divide the feature words and exclude the words in the big data labels and improve the accuracy of the big data labels, historical materials are adopted to train and update the big data label database. After the clustering center is obtained through clustering, the clustering center can be used as a standard corresponding to a big data label category and used as a big data label, because the clustering is to divide a sample set into a plurality of clusters according to the distance between samples, so that points in the clusters are connected together as closely as possible, and the point closest to the average distance in each point in the clusters is the clustering center, so that the semantic meaning between a feature word corresponding to the clustering center and other feature words in the same index category is the closest, and the feature word can be used as the standard of the index category and used as the big data label, and the mapping relation between the feature word and the big data label is established. If a database table is maintained, wherein the database table contains two columns, the keyword is all text participles to be extracted, and the index _ name is a big data tag and is used as a big data tag with the same meaning.
Step 407, judging whether the text participles contain an excluded word corresponding to the big data label;
in the step, in order to prevent stop words such as auxiliary words, conjunctions and adverbs which have high frequency and no specific meanings from being recognized, the system is also provided with a flow for judging whether the stop words exist or not, and the accuracy and the relevance of automatically generating the label are effectively improved.
Step 408, if yes, determining that the text participle has no corresponding candidate label;
step 409, if not, using the big data label as a candidate label of the text word segmentation;
step 410, counting the positions and times of the feature words and the exclusion words corresponding to the candidate labels in the image recognition text information and the material text information;
step 411, calculating a weight score corresponding to the candidate tag according to a preset weight formula, the position and the times;
step 412, taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material;
step 413, adding the label to the corresponding material information, and uploading the material information added with the label to a material system;
step 414, submitting the material information added with the label to a preset office automation system for process signing through a material system;
and 415, when the office automation system finishes signing the labeled material information, releasing the labeled material information on line.
The embodiment of the invention describes the matching process of the labels in detail on the basis of the previous embodiment of the invention, and selects the big data labels from the big data label database as candidate labels corresponding to the text participles through the characteristic words and the exclusion words; counting the positions and times of the feature words and the excluding words corresponding to the candidate labels in the image recognition text information and the material text information; calculating a weight score corresponding to the candidate label according to a preset weight formula, the position and the times; and taking the candidate label corresponding to the weight score with the maximum value in all the weight scores as the label corresponding to the material. Traversing the feature words in the big data label library, matching all the feature words with the text participles, and determining the big data labels corresponding to the text participles; judging whether the text participles contain stop words corresponding to the big data labels or not; if so, determining that the text word segmentation does not have a corresponding candidate label; if not, the big data label is used as a candidate label of the text word segmentation; and traversing all text participles to execute the steps to obtain candidate labels corresponding to all the text participles. By adopting the method of characteristic words and eliminating words, the weight is added to the screened candidate labels, the labels with higher relevance can be more effectively obtained, and the automation efficiency and accuracy are improved.
In the above description of the material on-line method in the embodiment of the present invention, referring to fig. 5, a material on-line device in the embodiment of the present invention is described below, where an embodiment of the material on-line device in the embodiment of the present invention includes:
the acquiring module 501 is configured to acquire material information uploaded to a material system by a user terminal, where the material information at least includes material pictures and material text information;
the identification module 502 is configured to perform text identification on the material picture through a preset optical character identification technology to obtain picture identification text information;
a word segmentation module 503, configured to perform word segmentation on the image recognition text information and the material text information by using a word segmentation tool, so as to obtain text word segmentation;
the filtering module 504 is configured to perform sensitive word filtering on the text segmentation words according to a preset sensitive word lexicon to obtain a filtering result, where the filtering result includes filtering success or filtering failure;
a marking module 505, configured to mark, when the filtering result is a filtering failure, text segments identified as sensitive words in the picture identification text information and the material text information, and return the marked picture identification text information and the material text information to the user terminal;
a screening module 506, configured to, when the filtering result is that filtering is successful, screen a big data tag from a big data tag library as a candidate tag corresponding to the text segmentation word through a feature word and an exclusion word in the preset big data tag library;
a counting module 507, configured to count positions and times of occurrence of the feature words and the exclusion words corresponding to the candidate tags in the image recognition text information and the material text information;
a calculating module 508, configured to calculate a weight score corresponding to the candidate tag according to a preset weight formula, the position, and the number of times;
the marking module 509 is configured to use the candidate tag corresponding to the weight score with the largest value among all the weight scores as the tag corresponding to the material;
and the online module 510 is configured to add the tag to the corresponding material information, and send the material information with the tag added to the material system for online material.
In the embodiment of the invention, the material online device runs the material online method, and material information uploaded to a material system by a user terminal is obtained, wherein the material information at least comprises material pictures and material text information; performing text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information; performing word segmentation operation on the image identification text information and the material text information by using a word segmentation tool to obtain text word segmentation; performing sensitive word filtering on the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure; if the filtering result is failure, marking the text participles which are identified as sensitive words in the picture identification text information and the material text information, and returning the marked picture identification text information and the material text information to the user terminal; if the filtering result is successful, screening a big data label from a big data label library through a preset feature word and an exclusion word in the big data label library as a candidate label corresponding to the text word segmentation; counting the positions and times of the feature words and the excluding words corresponding to the candidate labels in the image recognition text information and the material text information; calculating a weight score corresponding to the candidate label according to a preset weight formula, the position and the times; taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material; and adding the label into the corresponding material information, and sending the material information added with the label to the material system for material online. The technical problem of low labeling accuracy in the existing material feeding process is solved.
Referring to fig. 6, a second embodiment of the material threading device in the embodiment of the present invention includes:
the acquiring module 501 is configured to acquire material information uploaded to a material system by a user terminal, where the material information at least includes material pictures and material text information;
the identification module 502 is configured to perform text identification on the material picture through a preset optical character identification technology to obtain picture identification text information;
a word segmentation module 503, configured to perform word segmentation on the image recognition text information and the material text information by using a word segmentation tool, so as to obtain text word segmentation;
the filtering module 504 is configured to perform sensitive word filtering on the text segmentation words according to a preset sensitive word lexicon to obtain a filtering result, where the filtering result includes filtering success or filtering failure;
a marking module 505, configured to mark, when the filtering result is a filtering failure, text segments identified as sensitive words in the picture identification text information and the material text information, and return the marked picture identification text information and the material text information to the user terminal;
a screening module 506, configured to, when the filtering result is that filtering is successful, screen a big data tag from a big data tag library as a candidate tag corresponding to the text segmentation word through a feature word and an exclusion word in the preset big data tag library;
a counting module 507, configured to count positions and times of occurrence of the feature words and the exclusion words corresponding to the candidate tags in the image recognition text information and the material text information;
a calculating module 508, configured to calculate a weight score corresponding to the candidate tag according to a preset weight formula, the position, and the number of times;
the marking module 509 is configured to use the candidate tag corresponding to the weight score with the largest value among all the weight scores as the tag corresponding to the material;
and the online module 510 is configured to add the tag to the corresponding material information, and send the material information with the tag added to the material system for online material.
In this embodiment, the identifying module 502 includes:
the character set recognition unit 5021 is used for recognizing characters in the material picture through a preset optical character recognition technology to obtain a character set corresponding to the material picture;
a character correction unit 5022, configured to convert each character in the character set into a corresponding character vector, and input the character vector into a preset semantic correction model to obtain a predicted character corresponding to each character;
and the character summarizing unit 5023 is used for summarizing the predicted characters corresponding to all the characters in sequence to obtain the picture identification text information of the material picture.
In this embodiment, the character correcting unit 5022 is specifically configured to: converting each character into a corresponding character vector, and inputting the character vector into the semantic correction model; multiplying the character vector corresponding to each character by a predetermined matrix WQ, matrix WK and matrix WV respectively through the semantic correction model to obtain a query character vector, a key character vector and a value character vector corresponding to the character; and executing coding and decoding operations on the query character vector, the key character vector and the value character vector corresponding to the characters to obtain a predicted character corresponding to each character.
In this embodiment, the word segmentation module 503 is configured to: calling a preset word segmentation tool, and dividing the image recognition text information and the material text information into N reference words with preset character lengths through a dictionary of the word segmentation tool, wherein N is a natural number not less than 1; acquiring reference participles divided by the participle dividing tool, searching a corresponding preset dictionary according to the character length of the reference participle, and judging whether the reference participle exists in the preset dictionary or not; and when the reference participles are not stored in the preset dictionary, screening the reference participles which are not stored by a fuzzy matching algorithm to obtain the text participles.
In this embodiment, the screening module 506 is specifically configured to: traversing the big data label database based on the feature words and the exclusion words, and matching out big data labels corresponding to the text participles; judging whether the text word segmentation contains an excluded word corresponding to the big data label;
if yes, determining that the text word segmentation does not have a corresponding candidate label; and if not, taking the big data label as a candidate label of the text word segmentation.
In this embodiment, the material online device further includes a tag clustering module 511, where the tag clustering module 511 is configured to: obtaining a history audit text, performing word segmentation processing on the history audit text to obtain a history text word segmentation, and converting the history text word segmentation into a corresponding word vector; randomly selecting N historical text participles from the word vector as an initial clustering center, wherein N is a positive integer; respectively calculating cosine distances from all historical text participles to each initial clustering center; dividing the history text participles which are closest to the cosine distance of the initial clustering center into the same label category; calculating a mean vector of word vectors in the label category, reselecting a current clustering center according to the mean vector, and calculating the cosine distance between the current clustering center and a corresponding initial clustering center; if the cosine distance between the current clustering center and the corresponding initial clustering center is smaller than or equal to a preset threshold value, obtaining N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories; if the cosine distance between the current clustering center and the corresponding initial clustering center is larger than a preset threshold, re-clustering is carried out by using the current clustering center until the cosine distance between the current clustering center and the last clustering center is smaller than or equal to the preset threshold, and N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories are obtained; taking the clustering center as a big data label, and taking historical text participles in the corresponding label category as feature words corresponding to the big data label; and obtaining a big data label database according to the big data labels and the feature words.
In this embodiment, the online module 510 is specifically configured to: adding the label into corresponding material information, and uploading the material information added with the label to the material system; submitting the material information added with the label to a preset office automation system for process signing through the material system; and when the office automation system finishes signing the material information added with the label, releasing the material information added with the label on line.
On the basis of the previous embodiment, the specific functions of each module and the unit composition of part of the modules are described in detail, and the large data label library is trained and updated by using the history materials through the newly added modules. After the clustering center is obtained through clustering, the clustering center can be used as a standard corresponding to a big data label category and used as a big data label, because the clustering is to divide a sample set into a plurality of clusters according to the distance between samples, so that points in the clusters are connected together as closely as possible, and the point closest to the average distance in each point in the clusters is the clustering center, so that the semantic meaning between a feature word corresponding to the clustering center and other feature words in the same index category is the closest, and the feature word can be used as the standard of the index category and used as the big data label, and the mapping relation between the feature word and the big data label is established. The relevance and the accuracy of the label and the material are improved.
Fig. 5 and 6 describe the material loading device in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the material loading device in the embodiment of the present invention is described in detail from the perspective of hardware processing.
Fig. 7 is a schematic structural diagram of an on-line apparatus for material 700 according to an embodiment of the present invention, where the on-line apparatus for material 700 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 710 (e.g., one or more processors) and a memory 720, and one or more storage media 730 (e.g., one or more mass storage devices) storing applications 733 or data 732. Memory 720 and storage medium 730 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 730 may include one or more modules (not shown), each of which may include a series of instructions operating on the materials on-line apparatus 700. Further, the processor 710 may be configured to communicate with the storage medium 730, and execute a series of instruction operations in the storage medium 730 on the material on-line device 700 to implement the steps of the material on-line method.
The in-line apparatus 700 may also include one or more power supplies 740, one or more wired or wireless network interfaces 750, one or more input-output interfaces 760, and/or one or more operating systems 731, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art will appreciate that the in-line equipment configuration of material illustrated in FIG. 7 does not constitute a limitation of the in-line equipment of material provided herein and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.
The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium, having stored therein instructions, which, when executed on a computer, cause the computer to perform the steps of the method for bringing materials on line.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses, and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A material online method is characterized by comprising the following steps:
acquiring material information uploaded to a material system by a user terminal, wherein the material information at least comprises material pictures and material text information;
performing text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information;
performing word segmentation operation on the image identification text information and the material text information by using a word segmentation tool to obtain text word segmentation;
performing sensitive word filtering on the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure;
if the filtering result is failure, marking the text participles which are identified as sensitive words in the picture identification text information and the material text information, and returning the marked picture identification text information and the material text information to the user terminal;
if the filtering result is successful, screening a big data label from a big data label library through a preset feature word and an exclusion word in the big data label library as a candidate label corresponding to the text word segmentation;
counting the positions and times of the feature words and the excluding words corresponding to the candidate labels in the image recognition text information and the material text information;
calculating a weight score corresponding to the candidate label according to a preset weight formula, the position and the times;
taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material;
and adding the label into the corresponding material information, and sending the material information added with the label to the material system for material online.
2. The material online method according to claim 1, wherein the text recognition of the material picture by a preset optical character recognition technology to obtain picture recognition text information comprises:
recognizing characters in the material picture through a preset optical character recognition technology to obtain a character set corresponding to the material picture;
converting each character in the character set into a corresponding character vector, and inputting the character vector into a preset semantic correction model to obtain a predicted character corresponding to each character;
and summarizing the predicted characters corresponding to all the characters in sequence to obtain the picture identification text information of the material picture.
3. The material online method according to claim 2, wherein the converting each character in the character set into a corresponding character vector and inputting the character vector into a preset semantic correction model to obtain a predicted character corresponding to each character comprises:
converting each of the characters into a corresponding character vector;
inputting the character vectors into the semantic correction model, and multiplying the character vectors corresponding to the characters by a predetermined matrix WQ, a predetermined matrix WK and a predetermined matrix WV through the semantic correction model to obtain query character vectors, key character vectors and value character vectors corresponding to the characters;
and executing coding and decoding operations on the query character vector, the key character vector and the value character vector corresponding to the characters to obtain a predicted character corresponding to each character.
4. The material online method according to claim 1, wherein the performing a word segmentation operation on the image recognition text information and the material text information by using a word segmentation tool to obtain a text word comprises:
calling a preset word segmentation tool, and dividing the image recognition text information and the material text information into N reference words with preset character lengths through a dictionary of the word segmentation tool, wherein N is a natural number not less than 1;
acquiring reference participles divided by the participle dividing tool, searching a corresponding preset dictionary according to the character length of the reference participle, and judging whether the reference participle exists in the preset dictionary or not;
and when the reference participles are not stored in the preset dictionary, screening the reference participles which are not stored by a fuzzy matching algorithm to obtain the text participles.
5. The material online method according to claim 1, wherein the screening of the big data tag from the big data tag library as the candidate tag corresponding to the text segmentation through the feature words and the exclusion words in the preset big data tag library comprises:
traversing the big data label database based on the feature words and the exclusion words, and matching out big data labels corresponding to the text participles;
judging whether the text word segmentation contains an excluded word corresponding to the big data label;
if yes, determining that the text word segmentation does not have a corresponding candidate label;
and if not, taking the big data label as a candidate label of the text word segmentation.
6. The material online method according to claim 1, before the obtaining the material information uploaded to the material system by the user terminal, further comprising:
obtaining a history audit text, performing word segmentation processing on the history audit text to obtain a history text word segmentation, and converting the history text word segmentation into a corresponding word vector;
randomly selecting N historical text participles from the word vector as an initial clustering center, wherein N is a positive integer;
respectively calculating cosine distances from all historical text participles to each initial clustering center;
dividing the history text participles which are closest to the cosine distance of the initial clustering center into the same label category;
calculating a mean vector of word vectors in the label category, reselecting a current clustering center according to the mean vector, and calculating the cosine distance between the current clustering center and a corresponding initial clustering center;
if the cosine distance between the current clustering center and the corresponding initial clustering center is smaller than or equal to a preset threshold value, obtaining N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories;
if the cosine distance between the current clustering center and the corresponding initial clustering center is larger than a preset threshold, re-clustering is carried out by using the current clustering center until the cosine distance between the current clustering center and the last clustering center is smaller than or equal to the preset threshold, and N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories are obtained;
taking the clustering center as a big data label, and taking historical text participles in the corresponding label category as feature words corresponding to the big data label;
and obtaining a big data label database according to the big data labels and the feature words.
7. The material online method according to claim 1, wherein the adding the tag to the corresponding material information and sending the tagged material information to the material system for material online comprises:
adding the label into corresponding material information, and uploading the material information added with the label to the material system;
submitting the material information added with the label to a preset office automation system for process signing through the material system;
and when the office automation system finishes signing the material information added with the label, releasing the material information added with the label on line.
8. The utility model provides a material threading device which characterized in that, material threading device includes:
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring material information uploaded to a material system by a user terminal, and the material information at least comprises material pictures and material text information;
the identification module is used for carrying out text identification on the material picture through a preset optical character identification technology to obtain picture identification text information;
the word segmentation module is used for performing word segmentation operation on the image identification text information and the material text information by using a word segmentation tool to obtain text word segmentation;
the filtering module is used for filtering the sensitive words of the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure;
the marking module is used for marking the text participles which are identified as sensitive words in the picture identification text information and the material text information when the filtering result is filtering failure, and returning the marked picture identification text information and the material text information to the user terminal;
the screening module is used for screening the big data labels from the big data label library as candidate labels corresponding to the text word segmentation through the preset feature words and the exclusion words in the big data label library when the filtering result is that the filtering is successful;
the counting module is used for counting the positions and times of the feature words and the exclusion words corresponding to the candidate labels in the image identification text information and the material text information;
the calculating module is used for calculating a weight score corresponding to the candidate label according to a preset weight formula, the position and the times;
the marking module is used for taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material;
and the online module is used for adding the label into the corresponding material information and sending the material information added with the label to the material system for material online.
9. The utility model provides a material equipment of getting on line which characterized in that, the material equipment of getting on line includes: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;
the at least one processor invoking the instructions in the memory to cause the mobile device to perform the steps of the online method of materials of any of claims 1-7.
10. A computer-readable storage medium, having a computer program stored thereon, which, when being executed by a processor, carries out the steps of the method of material-online according to any one of claims 1-7.
CN202111273705.7A 2021-10-29 2021-10-29 Material online method, device, equipment and storage medium Active CN114003750B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111273705.7A CN114003750B (en) 2021-10-29 2021-10-29 Material online method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111273705.7A CN114003750B (en) 2021-10-29 2021-10-29 Material online method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114003750A true CN114003750A (en) 2022-02-01
CN114003750B CN114003750B (en) 2024-03-26

Family

ID=79925245

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111273705.7A Active CN114003750B (en) 2021-10-29 2021-10-29 Material online method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114003750B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115994772A (en) * 2023-02-22 2023-04-21 中信联合云科技有限责任公司 Book data processing method and system, book rapid goods laying method and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649818A (en) * 2016-12-29 2017-05-10 北京奇虎科技有限公司 Recognition method and device for application search intentions and application search method and server
CN106709040A (en) * 2016-12-29 2017-05-24 北京奇虎科技有限公司 Application search method and server
CN108334489A (en) * 2017-01-19 2018-07-27 百度在线网络技术(北京)有限公司 Text core word recognition method and device
CN112148852A (en) * 2020-09-10 2020-12-29 深圳康佳电子科技有限公司 Intelligent customer service method and device, storage medium and computer equipment
CN113076748A (en) * 2021-04-16 2021-07-06 平安国际智慧城市科技股份有限公司 Method, device and equipment for processing bullet screen sensitive words and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649818A (en) * 2016-12-29 2017-05-10 北京奇虎科技有限公司 Recognition method and device for application search intentions and application search method and server
CN106709040A (en) * 2016-12-29 2017-05-24 北京奇虎科技有限公司 Application search method and server
CN108334489A (en) * 2017-01-19 2018-07-27 百度在线网络技术(北京)有限公司 Text core word recognition method and device
CN112148852A (en) * 2020-09-10 2020-12-29 深圳康佳电子科技有限公司 Intelligent customer service method and device, storage medium and computer equipment
CN113076748A (en) * 2021-04-16 2021-07-06 平安国际智慧城市科技股份有限公司 Method, device and equipment for processing bullet screen sensitive words and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115994772A (en) * 2023-02-22 2023-04-21 中信联合云科技有限责任公司 Book data processing method and system, book rapid goods laying method and electronic equipment
CN115994772B (en) * 2023-02-22 2024-03-08 中信联合云科技有限责任公司 Book data processing method and system, book rapid goods laying method and electronic equipment

Also Published As

Publication number Publication date
CN114003750B (en) 2024-03-26

Similar Documents

Publication Publication Date Title
US9183226B2 (en) Image classification
US8229883B2 (en) Graph based re-composition of document fragments for name entity recognition under exploitation of enterprise databases
CN102542014B (en) Image searching feedback method based on contents
Wilkinson et al. Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections
RU2613846C2 (en) Method and system for extracting data from images of semistructured documents
CN111563384B (en) Evaluation object identification method and device for E-commerce products and storage medium
CN109446885B (en) Text-based component identification method, system, device and storage medium
CN112818093B (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN112434691A (en) HS code matching and displaying method and system based on intelligent analysis and identification and storage medium
CN116244410B (en) Index data analysis method and system based on knowledge graph and natural language
CN110795942B (en) Keyword determination method and device based on semantic recognition and storage medium
CN114003750A (en) Material online method, device, equipment and storage medium
CN116737967B (en) Knowledge graph construction and perfecting system and method based on natural language
US20190095525A1 (en) Extraction of expression for natural language processing
CN112380848A (en) Text generation method, device, equipment and storage medium
CN112131341A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN111681731A (en) Method for automatically marking colors of inspection report
CN114970554B (en) Document checking method based on natural language processing
US20230134218A1 (en) Continuous learning for document processing and analysis
US20230138491A1 (en) Continuous learning for document processing and analysis
CN113822013B (en) Labeling method and device for text data, computer equipment and storage medium
CN113408287B (en) Entity identification method and device, electronic equipment and storage medium
CN115203445A (en) Multimedia resource searching method, device, equipment and medium
CN115203206A (en) Data content searching method and device, computer equipment and readable storage medium
CN115272649A (en) Image recognition, retrieval and entry method and system of molecular structure diagram and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant