CN114003750A - Material online method, device, equipment and storage medium - Google Patents
Material online method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN114003750A CN114003750A CN202111273705.7A CN202111273705A CN114003750A CN 114003750 A CN114003750 A CN 114003750A CN 202111273705 A CN202111273705 A CN 202111273705A CN 114003750 A CN114003750 A CN 114003750A
- Authority
- CN
- China
- Prior art keywords
- text
- label
- character
- information
- text information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000000463 material Substances 0.000 title claims abstract description 398
- 238000000034 method Methods 0.000 title claims abstract description 67
- 230000011218 segmentation Effects 0.000 claims abstract description 129
- 238000001914 filtration Methods 0.000 claims abstract description 118
- 238000005516 engineering process Methods 0.000 claims abstract description 30
- 230000008569 process Effects 0.000 claims abstract description 23
- 238000012550 audit Methods 0.000 claims abstract description 15
- 239000013598 vector Substances 0.000 claims description 106
- 239000011159 matrix material Substances 0.000 claims description 31
- 238000012937 correction Methods 0.000 claims description 28
- 230000007717 exclusion Effects 0.000 claims description 28
- 238000012015 optical character recognition Methods 0.000 claims description 25
- 238000012216 screening Methods 0.000 claims description 25
- 230000003287 optical effect Effects 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims 1
- 238000012795 verification Methods 0.000 abstract description 4
- 238000007726 management method Methods 0.000 abstract description 2
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 13
- 239000010931 gold Substances 0.000 description 13
- 229910052737 gold Inorganic materials 0.000 description 13
- 238000001514 detection method Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 238000013507 mapping Methods 0.000 description 7
- 238000002372 labelling Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 241001523383 Achnatherum Species 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000011068 loading method Methods 0.000 description 2
- 208000016444 Benign adult familial myoclonic epilepsy Diseases 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 208000016427 familial adult myoclonic epilepsy Diseases 0.000 description 1
- 235000019387 fatty acid methyl ester Nutrition 0.000 description 1
- ZGNITFSDLCMLGI-UHFFFAOYSA-N flubendiamide Chemical compound CC1=CC(C(F)(C(F)(F)F)C(F)(F)F)=CC=C1NC(=O)C1=CC=CC(I)=C1C(=O)NC(C)(C)CS(C)(=O)=O ZGNITFSDLCMLGI-UHFFFAOYSA-N 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000013439 planning Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/55—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/5846—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
- G06Q10/103—Workflow collaboration or project management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/02—Banking, e.g. interest calculation or account maintenance
Abstract
The invention relates to the field of online material management, and discloses a material online method, a device, equipment and a storage medium, wherein the method is used for acquiring materials uploaded to audit by a user; performing text recognition on the picture by using an OCR text recognition technology to obtain picture recognition text information; performing word segmentation operation on the image identification text information and the material text information by using a word segmentation tool to obtain text word segmentation; filtering the sensitive words of the text participles according to the sensitive word lexicon; and if filtering is successful, performing label matching on the text word segmentation through the big data label library to obtain a corresponding label, adding the corresponding label to an audit page of the material, and issuing the material on line after the label is signed. The method includes the steps of carrying out OCR text recognition on uploaded picture contents to obtain text contents, marking the text contents through a label library, filtering sensitive words, and rapidly achieving marking of the pictures and the text contents and legal and normative verification of the contents. The operation process is reduced, and therefore the purpose of quickly getting on-line materials is achieved.
Description
Technical Field
The invention relates to the field of online material management, in particular to a material online method, a device, equipment and a storage medium.
Background
With the development of artificial intelligence technology, especially in business online auditing applications, intelligent online auditing has become a trend. In the existing intelligent online auditing mode, after an online picture and a text material are displayed through a material library and other relevant operation positions, the picture and the text material are respectively labeled, but the existing labeling mode is only to label the picture and the text material based on the actual semantics of the actual words in the picture or the text material, if a plurality of words with different semantics exist in the text material or the picture, a plurality of labels exist, and the online auditing of the service can have a plurality of standards due to the plurality of labels, so that the final online classification of the service is inaccurate, and the efficiency of the service auditing is low.
Disclosure of Invention
The invention mainly aims to solve the technical problem of low labeling accuracy rate in the existing material online process.
The invention provides a material online method in a first aspect, which comprises the following steps: acquiring material information uploaded to a material system by a user terminal, wherein the material information at least comprises material pictures and material text information; performing text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information; performing word segmentation operation on the image identification text information and the material text information by using a word segmentation tool to obtain text word segmentation; performing sensitive word filtering on the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure; if the filtering result is failure, marking the text participles which are identified as sensitive words in the picture identification text information and the material text information, and returning the marked picture identification text information and the material text information to the user terminal; if the filtering result is successful, screening a big data label from a big data label library through a preset feature word and an exclusion word in the big data label library as a candidate label corresponding to the text word segmentation; counting the positions and times of the feature words and the excluding words corresponding to the candidate labels in the image recognition text information and the material text information; calculating a weight score corresponding to the candidate label according to a preset weight formula, the position and the times; taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material; and adding the label into the corresponding material information, and sending the material information added with the label to the material system for material online.
Optionally, in a first implementation manner of the first aspect of the present invention, the performing text recognition on the material picture by using a preset optical character recognition technology to obtain picture recognition text information includes: recognizing characters in the material picture through a preset optical character recognition technology to obtain a character set corresponding to the material picture; converting each character in the character set into a corresponding character vector, and inputting the character vector into a preset semantic correction model to obtain a predicted character corresponding to each character; and summarizing the predicted characters corresponding to all the characters in sequence to obtain the picture identification text information of the material picture.
Optionally, in a second implementation manner of the first aspect of the present invention, the converting each character in the character set into a corresponding character vector, and inputting the character vector into a preset semantic correction model to obtain a predicted character corresponding to each character includes: converting each character into a corresponding character vector, and inputting the character vector into the semantic correction model; multiplying the character vector corresponding to each character by a predetermined matrix WQ, matrix WK and matrix WV respectively through the semantic correction model to obtain a query character vector, a key character vector and a value character vector corresponding to the character; and executing coding and decoding operations on the query character vector, the key character vector and the value character vector corresponding to the characters to obtain a predicted character corresponding to each character.
Optionally, in a third implementation manner of the first aspect of the present invention, the performing, by using a word segmentation tool, word segmentation on the image recognition text information and the material text information to obtain text word segmentation includes: calling a preset word segmentation tool, and dividing the image recognition text information and the material text information into N reference words with preset character lengths through a dictionary of the word segmentation tool, wherein N is a natural number not less than 1; acquiring reference participles divided by the participle dividing tool, searching a corresponding preset dictionary according to the character length of the reference participle, and judging whether the reference participle exists in the preset dictionary or not; and when the reference participles are not stored in the preset dictionary, screening the reference participles which are not stored by a fuzzy matching algorithm to obtain the text participles.
Optionally, in a fourth implementation manner of the first aspect of the present invention, the screening, from the big data tag library, a big data tag as a candidate tag corresponding to the text segmentation word by using the feature word and the excluded word in the preset big data tag library includes: traversing the big data label database based on the feature words and the exclusion words, and matching out big data labels corresponding to the text participles; judging whether the text word segmentation contains an excluded word corresponding to the big data label; if yes, determining that the text word segmentation does not have a corresponding candidate label; and if not, taking the big data label as a candidate label of the text word segmentation.
Optionally, in a fifth implementation manner of the first aspect of the present invention, before the obtaining of the material information uploaded to the material system by the user terminal, the method further includes:
obtaining a history audit text, performing word segmentation processing on the history audit text to obtain a history text word segmentation, and converting the history text word segmentation into a corresponding word vector; randomly selecting N historical text participles from the word vector as an initial clustering center, wherein N is a positive integer; respectively calculating cosine distances from all historical text participles to each initial clustering center; dividing the history text participles which are closest to the cosine distance of the initial clustering center into the same label category; calculating a mean vector of word vectors in the label category, reselecting a current clustering center according to the mean vector, and calculating the cosine distance between the current clustering center and a corresponding initial clustering center; if the cosine distance between the current clustering center and the corresponding initial clustering center is smaller than or equal to a preset threshold value, obtaining N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories; if the cosine distance between the current clustering center and the corresponding initial clustering center is larger than a preset threshold, re-clustering is carried out by using the current clustering center until the cosine distance between the current clustering center and the last clustering center is smaller than or equal to the preset threshold, and N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories are obtained; taking the clustering center as a big data label, and taking historical text participles in the corresponding label category as feature words corresponding to the big data label; and obtaining a big data label database according to the big data labels and the feature words.
Optionally, in a sixth implementation manner of the first aspect of the present invention, the adding the tag to the corresponding material information and sending the material information with the tag added to the material system to perform material online includes: adding the label into corresponding material information, and uploading the material information added with the label to the material system; submitting the material information added with the label to a preset office automation system for process signing through the material system; and when the office automation system finishes signing the material information added with the label, releasing the material information added with the label on line.
The second aspect of the present invention provides a material feeding device, which comprises: the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring material information uploaded to a material system by a user terminal, and the material information at least comprises material pictures and material text information; the identification module is used for carrying out text identification on the material picture through a preset optical character identification technology to obtain picture identification text information; the word segmentation module is used for performing word segmentation operation on the image identification text information and the material text information by using a word segmentation tool to obtain text word segmentation; the filtering module is used for filtering the sensitive words of the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure; the marking module is used for marking the text participles which are identified as sensitive words in the picture identification text information and the material text information when the filtering result is filtering failure, and returning the marked picture identification text information and the material text information to the user terminal; the screening module is used for screening the big data labels from the big data label library as candidate labels corresponding to the text word segmentation through the preset feature words and the exclusion words in the big data label library when the filtering result is that the filtering is successful; the counting module is used for counting the positions and times of the feature words and the exclusion words corresponding to the candidate labels in the image identification text information and the material text information; the calculating module is used for calculating a weight score corresponding to the candidate label according to a preset weight formula, the position and the times; the marking module is used for taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material; and the online module is used for adding the label into the corresponding material information and sending the material information added with the label to the material system for material online.
Optionally, in a first implementation manner of the second aspect of the present invention, the identification module specifically includes: the character set identification unit is used for identifying characters in the material picture through a preset optical character identification technology to obtain a character set corresponding to the material picture; the character correction unit is used for converting each character in the character set into a corresponding character vector and inputting the character vector into a preset semantic correction model to obtain a predicted character corresponding to each character; the character summarizing unit is used for summarizing the predicted characters corresponding to all the characters in sequence to obtain the picture identification text information of the material picture.
Optionally, in a second implementation manner of the second aspect of the present invention, the character correcting unit specifically includes: converting each character into a corresponding character vector, and inputting the character vector into the semantic correction model; multiplying the character vector corresponding to each character by a predetermined matrix WQ, matrix WK and matrix WV respectively through the semantic correction model to obtain a query character vector, a key character vector and a value character vector corresponding to the character; and executing coding and decoding operations on the query character vector, the key character vector and the value character vector corresponding to the characters to obtain a predicted character corresponding to each character.
Optionally, in a third implementation manner of the second aspect of the present invention, the word segmentation module is configured to: calling a preset word segmentation tool, and dividing the image recognition text information and the material text information into N reference words with preset character lengths through a dictionary of the word segmentation tool, wherein N is a natural number not less than 1; acquiring reference participles divided by the participle dividing tool, searching a corresponding preset dictionary according to the character length of the reference participle, and judging whether the reference participle exists in the preset dictionary or not; and when the reference participles are not stored in the preset dictionary, screening the reference participles which are not stored by a fuzzy matching algorithm to obtain the text participles.
Optionally, in a fourth implementation manner of the second aspect of the present invention, the screening module is specifically configured to: traversing the big data label database based on the feature words and the exclusion words, and matching out big data labels corresponding to the text participles; judging whether the text word segmentation contains an excluded word corresponding to the big data label; if yes, determining that the text word segmentation does not have a corresponding candidate label; and if not, taking the big data label as a candidate label of the text word segmentation.
Optionally, in a fifth implementation manner of the second aspect of the present invention, the material online apparatus further includes a tag clustering module, where the tag clustering module is configured to: obtaining a history audit text, performing word segmentation processing on the history audit text to obtain a history text word segmentation, and converting the history text word segmentation into a corresponding word vector; randomly selecting N historical text participles from the word vector as an initial clustering center, wherein N is a positive integer; respectively calculating cosine distances from all historical text participles to each initial clustering center; dividing the history text participles which are closest to the cosine distance of the initial clustering center into the same label category; calculating a mean vector of word vectors in the label category, reselecting a current clustering center according to the mean vector, and calculating the cosine distance between the current clustering center and a corresponding initial clustering center; if the cosine distance between the current clustering center and the corresponding initial clustering center is smaller than or equal to a preset threshold value, obtaining N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories; if the cosine distance between the current clustering center and the corresponding initial clustering center is larger than a preset threshold, re-clustering is carried out by using the current clustering center until the cosine distance between the current clustering center and the last clustering center is smaller than or equal to the preset threshold, and N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories are obtained; taking the clustering center as a big data label, and taking historical text participles in the corresponding label category as feature words corresponding to the big data label; and obtaining a big data label database according to the big data labels and the feature words.
Optionally, in a sixth implementation manner of the second aspect of the present invention, the online module is specifically configured to: adding the label into corresponding material information, and uploading the material information added with the label to the material system; submitting the material information added with the label to a preset office automation system for process signing through the material system; and when the office automation system finishes signing the material information added with the label, releasing the material information added with the label on line.
The third aspect of the invention provides a material threading device, which comprises: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor calls the instructions in the memory to cause the material on-line device to perform the steps of the material on-line method.
A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the above-mentioned material-on-line method.
According to the technical scheme, material information uploaded to a material system by a user terminal is obtained, wherein the material information at least comprises material pictures and material text information; performing text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information; performing word segmentation operation on the image identification text information and the material text information by using a word segmentation tool to obtain text word segmentation; performing sensitive word filtering on the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure; if the filtering result is failure, marking the text participles which are identified as sensitive words in the picture identification text information and the material text information, and returning the marked picture identification text information and the material text information to the user terminal; if the filtering result is successful, screening a big data label from a big data label library through a preset feature word and an exclusion word in the big data label library as a candidate label corresponding to the text word segmentation; counting the positions and times of the feature words and the excluding words corresponding to the candidate labels in the image recognition text information and the material text information; calculating a weight score corresponding to the candidate label according to a preset weight formula, the position and the times; taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material; and adding the label into the corresponding material information, and sending the material information added with the label to the material system for material online. The method adopts an optical character recognition technology, obtains text contents by performing OCR text recognition on uploaded picture contents, marks the text contents through a label library, and filters sensitive words, so that the marking of the picture and the text contents and the legal and normative verification of the contents can be quickly realized. The operation process is reduced, and therefore the purpose of quickly getting on-line materials is achieved.
Drawings
FIG. 1 is a schematic diagram of a first embodiment of a material online method according to an embodiment of the present invention;
FIG. 2 is a diagram of a second embodiment of a method for on-line material in an embodiment of the invention;
FIG. 3 is a schematic diagram of a third embodiment of a material on-line method in an embodiment of the invention;
FIG. 4 is a diagram illustrating a fourth embodiment of a method for on-line material in an embodiment of the present invention;
FIG. 5 is a schematic diagram of an embodiment of a material threading device in an embodiment of the invention;
FIG. 6 is a schematic diagram of another embodiment of a material threading device in an embodiment of the invention;
fig. 7 is a schematic diagram of an embodiment of a material threading device in an embodiment of the invention.
Detailed Description
The embodiment of the invention provides a material online method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring material information uploaded to a material system by a user terminal, wherein the material information at least comprises material pictures and material text information; performing text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information; performing word segmentation operation on the image identification text information and the material text information by using a word segmentation tool to obtain text word segmentation; performing sensitive word filtering on the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure; if the filtering result is failure, marking the text participles which are identified as sensitive words in the picture identification text information and the material text information, and returning the marked picture identification text information and the material text information to the user terminal; if the filtering result is successful, screening a big data label from a big data label library through a preset feature word and an exclusion word in the big data label library as a candidate label corresponding to the text word segmentation; counting the positions and times of the feature words and the excluding words corresponding to the candidate labels in the image recognition text information and the material text information; calculating a weight score corresponding to the candidate label according to a preset weight formula, the position and the times; taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material; and adding the label into the corresponding material information, and sending the material information added with the label to the material system for material online. The method adopts an optical character recognition technology, obtains text contents by performing OCR text recognition on uploaded picture contents, marks the text contents through a label library, and filters sensitive words, so that the marking of the picture and the text contents and the legal and normative verification of the contents can be quickly realized. The operation process is reduced, and therefore the purpose of quickly getting on-line materials is achieved.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of understanding, a specific flow of an embodiment of the present invention is described below, and referring to fig. 1, a first embodiment of a material online method in an embodiment of the present invention includes:
101, acquiring material information uploaded to a material system by a user terminal, wherein the material information at least comprises material pictures and material text information;
in the embodiment, the material is an advertisement material, the basic advertisement material form is material data of characters, pictures and Flash types, in order to ensure that the content of the advertisement material data is legal and can be normally accessed after being online, the material needs to be checked, and when a user uploads the material to a checking page, the user can independently upload material pictures and material text information; an office suite interface can also be accessed to the system background, so that a user can upload an integral document including but not limited to PPT and DOC formats at one time, and the system analyzes and processes on line to obtain material pictures and material text information in the integral document.
in this embodiment, corresponding to the text information in the material picture, due to different sizes and styles of the material picture, the regions where the text information of the material picture is located are different, the text region in the material picture needs to be determined through an image detection algorithm, and then characters in the text region are identified through an optical character identification technology.
In this step, the image detection algorithm is to locate an area where the non-specific font characters are located in the given picture containing the non-specific font characters to obtain the area of the non-specific font characters. Optionally, the text area with the unspecific font is a rectangular frame. For example: an original image containing characters with an unspecific font is selected out of an unspecific font character area of a picture containing the characters with the unspecific font through an image detection algorithm frame, and the unspecific font character area (such as a blank background) in the picture containing the characters with the unspecific font is not considered. The image detection algorithm may be dlib image detection algorithm, opencv library image detection algorithm, or deep learning-based image detection algorithm.
Preferably, in this embodiment, an image detection algorithm based on deep learning is adopted to detect the original image set, and a text region of a text with an unspecific font is analyzed. By acquiring the text area of the unspecific font, the text of the unspecific font can be processed more efficiently in the following.
103, performing word segmentation operation on the image recognition text information and the material text information by using a word segmentation tool to obtain text word segmentation;
in this embodiment, the system performs word segmentation by using a jieba word segmentation tool, the word segmentation tool generates a directed acyclic graph of a sentence by comparing dictionaries, and then searches for a shortest path according to the dictionaries according to different selected modes, and then intercepts the sentence or directly intercepts the sentence.
104, performing sensitive word filtering on the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure;
in the embodiment, the text participles are compared with the sensitive word bank in a traversing way, and when the text participles are traversed and exist in the sensitive word bank, the filtering is failed; and if the text segmentation does not have the vocabulary in the sensitive word lexicon, the filtering is successful.
105, if the filtering result is filtering failure, marking the text participles which are identified as sensitive words in the picture identification text information and the material text information, and returning the marked picture identification text information and the marked material text information to the user terminal;
in the step, a mapping table needs to be made for the text word segmentation of the image recognition text information and the material text information, the position information of the text word segmentation of the recognized sensitive word in the material uploaded by the user is synchronously recorded for the subsequent system filtering failure, and the information is fed back to the image recognition text information and the material text information submitted by the user in real time, so that the user can conveniently position and modify the information.
In this embodiment, when the system detects that a text segment is a sensitive word, a specific position of the text segment in material text information or a material picture submitted by a user is found and highlighted through a mapping table, the user can edit and modify the material text information or the material picture on line, and after the user modifies the material picture, the system can perform secondary text recognition on modified content in real time, perform segmentation on the image recognition text information and the material text information, perform traversal comparison on the updated text segment and a sensitive word bank, and perform secondary filtering.
in practical application, in order to improve the material characterization capability of the tags and facilitate a user to search related materials according to the tags, the system attaches a weight to each matching in the process of matching the tags for text word segmentation, and each matched tag is provided with a weight value of one tag to represent the matching or correlation degree of the tag and the material.
In this embodiment, each big data tag in the big data tag library needs to be configured with a feature word and an exclusion word. The characteristic word refers to a series of words having the same or similar meaning with the label, and the characteristic word may be another name of the label, a subclass under the label, or other words related to the label, and usually the label itself is also one of the characteristic words, for example, the characteristic word of the label "gold" includes "gold", "gold price", "buy gold", and so on.
in this embodiment, a mapping set is established through a full text, the mapping set includes position information and total occurrence times of each feature word and its corresponding excluded word, and statistical preparation is made for calculating a weight formula in the next step by establishing the mapping table.
in this embodiment, through analysis between the historical materials and the tags, a plurality of different calculation modes can be preset, and the calculation is obtained through the weight distribution ratio at different positions. For example, the weight score of each occurrence of a feature word at a title is a score a, the weight score of each occurrence of a subtitle is a score B, the weight score of each occurrence of a first segment at a first stage is a score C, the weight score of each occurrence of a last segment at a last stage is a score D, the weight score of each occurrence of the rest of the feature words is a score E, the weight score subtracted from each occurrence of an excluded word corresponding to the feature word is a score F, and one of possible weight formulas can be simply listed, that is: the total mark of the characteristic words is A + B + C + D + E-F. By adjusting the size of different position weight scores, the obtained total score can change correspondingly, and different labels can be obtained from the same material information.
In this step, in order to improve the material characterization capability of the big data tag and provide more flexible tag service, a weight needs to be attached to the tag of the material. That is, for each big data tag generated by the system for the material, the weight value of the big data tag is taken to represent the matching or correlation degree of the big data tag and the material. The specific implementation mode is as follows: because the positions of the characteristic words appearing in the materials are importantly distinguished, the characteristic words appearing in the titles are the most important, and are the abstract, the initial position of the text and the like in sequence. Thus, the title weight is set to 3, the abstract weight is set to 2, the first sentence weight of the text is set to 1, the weight of each sentence after the first sentence decreases in an equal difference linear way, and the weight of the last sentence is 0 (since the last sentence of the resource is usually an extensional expansion effect, which may be far from the subject of the whole text). And then, calculating the number of times of the feature words appearing in the material and the weight sum calculated according to the appearance positions and the number of times, dividing the weight sum of the feature words appearing in the material by the product of the total number of the label feature words and the weight sum of the material to calculate the label weight, namely the label weight is the weight sum of hit feature words/(the number of all feature words in the label resource weight sum) × 100, and the value range of the finally obtained label weight is between 0 and 100. A higher tag weight indicates a higher correlation of the tag to the material.
in this embodiment, after the total weight score calculated by each feature word in step 108 is calculated, the candidate label with the largest weight score is determined as the label of the material.
And 110, adding the label into the corresponding material information, and sending the material information added with the label to a material system for material online.
In this embodiment, after the tag is added to the material audit page, the audit filtering step is substantially completed after the system passes the audit. And the follow-up process still needs to lead, check, sign and confirm according to the company organization structure and the process, so that the process can be released to be on-line, and the purpose of preventing wrongly written characters or some policy risks is achieved.
According to the technical scheme, material information uploaded to a material system by a user terminal is obtained, wherein the material information at least comprises material pictures and material text information; performing text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information; performing word segmentation operation on the image recognition text information and the material text information by using a word segmentation tool to obtain text word segmentation; performing sensitive word filtering on the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure; if the filtering result is failure, marking the text participles which are identified as sensitive words in the picture identification text information and the material text information, and returning the marked picture identification text information and the marked material text information to the user terminal; if the filtering result is successful, screening the big data label from the big data label library as a candidate label corresponding to the text word segmentation through the preset feature words and the exclusion words in the big data label library; counting the positions and times of the feature words and the excluding words corresponding to the candidate labels in the image recognition text information and the material text information; calculating a weight score corresponding to the candidate label according to a preset weight formula, the position and the times; taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material; and adding the label into the corresponding material information, and sending the material information added with the label to a material system for material online. The method adopts an optical character recognition technology, obtains text contents by performing OCR text recognition on uploaded picture contents, marks the text contents through a label library, and filters sensitive words, so that the marking of the picture and the text contents and the legal and normative verification of the contents can be quickly realized. The operation process is reduced, and therefore the purpose of quickly getting on-line materials is achieved.
Referring to fig. 2, a second embodiment of the material on-line method in the embodiment of the present invention includes:
202, performing text recognition on a material picture through a preset optical character recognition technology to obtain picture recognition text information;
in this embodiment, the material picture is any picture that needs to be subjected to character recognition, and the obtained picture recognition text information may be null, or may include at least one character recognized from the material picture, where when the picture recognition text information includes the at least one character recognized from the material picture, the step 203 is triggered to be executed. The character recognition model described in the embodiment of the present invention is a character recognition model based on a connection timing classification CTC (connection timing classification), and the character recognition model based on the connection timing classification CTC mainly includes two parts: a picture feature extraction network (such as CNN, VGG, ResNet, etc.) and a text classification network (such as RNN, LSTM, etc.).
in the embodiment of the invention, after the characters in the material picture are identified, the character identification device can convert each identified character into a corresponding character vector through a word-embedding (word-embedding) technology, and each character corresponds to one character vector. For example, when the recognized character set is "flat", the character recognition apparatus may convert "flat" into a 2 × 512 matrix, that is, each character is replaced by a 512-dimensional character vector.
in the step, the matrix WQ is a query matrix, and a character vector is multiplied by the query matrix to obtain a query character vector; the matrix WK is a key matrix, and a character vector is multiplied by the key matrix to obtain a key character vector; the matrix WV is a value matrix, and a character vector is multiplied by the value matrix to obtain a value character vector; through three predetermined matrixes WQ, WK and WV, input character vectors are multiplied respectively, and three corresponding character characteristics can be understood to be obtained.
Specifically, the values of the matrix WQ, the matrix WK, and the matrix WV are set at random at the beginning of model training, and the semantic correction model is trained and continuously adjusted through a large number of training samples to obtain the predetermined matrix WQ, matrix WK, and matrix WV.
in the step, encoding and decoding operations are performed on the obtained query character vector, key character vector and value character vector, so that the corresponding features of the initial character are extracted through three matrixes, and finally, the feature values of the three vectors are fused to obtain a predicted character corresponding to each character through decoding.
In specific application, because the existing optical character recognition technology has recognition errors with certain probability, in order to reduce the probability of error occurrence, a semantic correction model is required to be added, and the recognition character vectors with lower recognition precision can be corrected by linking the front character vectors and the rear character vectors which need to be corrected according to the semantic correction model. The training process of the semantic correction module needs to pass through a manual labeling result and a detection recognition result under a historical optical character recognition technical scene; for example, the result of manual annotation is "5/3/2019", while the result obtained by deep learning the text detection and recognition model is "3/5/2019/m". These two pairs constitute a set of training samples. And enabling the semantic correction module to learn a decoding mode according to the category information to finally obtain standard text information conforming to the semantics. For example, when the picture recognition text information recognized by the pre-trained character recognition model is 'splendid achnatherum dangerous', the character recognition device inputs a vector corresponding to 'splendid achnatherum dangerous' to the trained semantic correction module to perform semantic correction, obtains characters 'at risk' after the semantic correction of 'splendid achnatherum dangerous', and further takes 'at risk' as the final character recognition result of the image.
208, performing sensitive word filtering on the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure;
and 214, adding the label into the corresponding material information, and sending the material information added with the label to a material system for material online.
On the basis of the previous embodiment, the embodiment describes in detail that the characters in the material picture are recognized through a preset optical character recognition technology to obtain a character set corresponding to the material picture; converting each character in the character set into a corresponding character vector, and inputting the character vector into a semantic correction model of an optical character recognition technology to obtain a predicted character corresponding to each character; and summarizing the predicted characters corresponding to all the characters in sequence to obtain the picture identification text information of the material picture. By adopting the semantic correction model, the accuracy of character recognition is improved, and the probability of occurrence of wrongly written characters is reduced.
Referring to fig. 3, a third embodiment of the material threading method in the embodiment of the present invention includes:
301, acquiring material information uploaded to a material system by a user terminal, wherein the material information at least comprises material pictures and material text information;
preferably, the basic process of jieba word segmentation is as follows: efficient word graph scanning is achieved based on a Trie tree (dictionary tree) structure, and a Directed Acyclic Graph (DAG) formed by all possible word forming conditions of Chinese characters in a sentence is generated. And generating the trie tree according to the dictionary file. The dictionary converts the occurrence frequency of each word into frequency while generating a trie tree; the method comprises the steps of generating a DAG (DAG) according to a trie tree generated by a dictionary file for a sentence to be participled, and generating several possible sentence segmentations by performing dictionary lookup operation according to a given dictionary in a popular way in practice. jieba records the starting position of a word in a sentence in DAG, from 0 to n-1(n is the length of the sentence), each starting position is used as a key of a dictionary, and value is a list, in which the ending positions of possible words are stored (the word is obtained by looking up the dictionary, and the starting position + the length of the word is obtained as the ending position). And searching a maximum probability path by adopting dynamic programming, and finding out a maximum segmentation combination based on the word frequency. Searching for the word which is already segmented in the sentence to be segmented, searching for the frequency (times/total number) of occurrence of the word for the word, and if the word does not exist, taking the frequency of the word with the minimum frequency of occurrence in the dictionary as the frequency of the word, namely, P (a word) ═ freq. According to the method for searching the maximum probability path by dynamic planning, the maximum probability is calculated reversely from right to left for the sentence, wherein P (NodeN) is 1.0, P (NodeN-1) is P (NodeN) Max (P (penultimate word)) …, and the maximum probability path is obtained finally, and the segmentation combination of the maximum probability is obtained.
in the embodiment, the picture recognition text information and the material text information are obtained, and the word segmentation tool is called to divide the picture recognition text information and the material text information into a plurality of characters with preset lengths, a corresponding preset dictionary is searched according to the characters with the preset lengths, whether the characters exist in the preset dictionary is judged, when the characters do not exist in the preset dictionary, the situation that the characters are abnormal in recognition is indicated, under the situation, the target characters are screened out through the non-existing characters through the fuzzy matching algorithm, so that character recognition is achieved through the fuzzy matching algorithm, and the character recognition efficiency is improved.
305, when the reference participles are not stored in the preset dictionary, screening the reference participles which are not stored through a fuzzy matching algorithm to obtain text participles;
in practical application, the fuzzy matching algorithm is the minimum number of editing operations required for converting one string into another string, the number of operations is taken as an editing distance, two strings with smaller editing distances are more similar, and when the editing distance is 0, the two strings are equal, so that the recognition of characters is realized.
In a specific implementation, words with the editing distance not greater than the word length are searched for through a BK-tree algorithm, for example, a 'Zhongyuan' does not exist, words with the editing distance not greater than the word length are searched for through the BK-tree algorithm and can be 'China', wherein the editing distance is the editing distance from the character strings A to B, and the A can be changed into the B by using three operations of inserting, deleting and replacing, and the A can be changed into the B by using as few as possible steps. For example, two steps are needed from FAME to GATE, two times of replacement are needed, three steps are needed from GAME to ACM, G and E are deleted, C is added, and the screened Chinese is displayed as a target character, so that the text recognition is realized through a fuzzy matching algorithm, and the accuracy of the text recognition is improved.
and step 312, adding the label into the corresponding material information, and sending the material information added with the label to a material system for material online.
The method comprises the steps that on the basis of the previous invention example, a preset word segmentation tool is called in detail, and the picture recognition text information and the material text information are divided into N reference word segmentations with preset character lengths through a dictionary of the word segmentation tool, wherein N is a natural number not less than 1; acquiring reference participles divided by a participle dividing tool, searching a corresponding preset dictionary according to the character length of the reference participle, and judging whether the reference participle exists in the preset dictionary or not; and when the reference participles are not stored in the preset dictionary, screening the reference participles which are not stored by a fuzzy matching algorithm to obtain text participles, and displaying the text participles. Through adopting the preset word segmentation tool, the recognized characters quickly form the text word segmentation with practical significance, and the character recognition efficiency is improved.
Referring to fig. 4, a fourth embodiment of the material on-line method in the embodiment of the present invention includes:
step 401, acquiring material information uploaded to a material system by a user terminal, wherein the material information at least comprises material pictures and material text information;
step 406, traversing the big data label database based on the feature words and the excluded words, and matching out big data labels corresponding to the text participles;
in the present embodiment, the word meaning of a word is sometimes confused due to the flexibility of the language and the diversity of the word meaning, for example, the "gold" label in the banking scene refers to the business related to gold buying and selling, and the resource of the word of "gold" appearing is not necessarily related to the business, so that an exclusion word is set for constraint, for example, the exclusion word of "gold" has "gold week", "gold holiday" and the like, and when the exclusion word "gold week" appears in the participle of the text, the feature word "gold" is known to be ineffective, thereby avoiding the generation of the wrong "gold" label.
In this embodiment, the big data tag library is mainly obtained through the following steps:
obtaining a historical audit text, performing word segmentation processing on the historical audit text to obtain historical text word segmentation, converting the historical text word segmentation into corresponding word vectors, and setting the number of clusters to be N; randomly selecting N historical text participles as an initial clustering center; respectively calculating cosine distances from all historical text participles to each initial clustering center; dividing historical text participles with cosine distance from an initial clustering center within a preset error range into the same label category; calculating a mean vector of the word vectors in the label category, reselecting a current clustering center according to the mean vector, and calculating the cosine distance between the current clustering center and the corresponding initial clustering center; if the cosine distance between the current clustering center and the corresponding initial clustering center is smaller than or equal to a preset threshold value, obtaining N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories; if the cosine distance between the current clustering center and the corresponding initial clustering center is larger than a preset threshold, re-clustering is carried out by using the current clustering center until the cosine distance between the current clustering center and the last clustering center is smaller than or equal to the preset threshold, and N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories are obtained; taking the clustering center as a big data label, and taking the historical text participles in the corresponding label category as characteristic words corresponding to the big data label; and obtaining a big data label library according to the big data labels and the feature words.
In this step, before calculating the cosine distance, the tagged language segment needs to be converted into a word vector form, and the tagged language segment is mainly input into the word2vec model, and the word vector corresponding to the tagged language segment is output through the word2vec model. word2vec, also called word templates, the Chinese name "word Vector", functions to transform words in natural language into Dense vectors (Dense vectors) that can be understood by computers. word2vec is mainly divided into two modes, CBOW (continuous Bag of words) and Skip-Gram. CBOW is the inference of the target word from the original sentence; and the Skip-Gram is just the opposite, and the original sentence is deduced from the target word. CBOW is more suitable for small databases, while Skip-Gram performs better in large corpora.
In the embodiment, in order to accurately divide the feature words and exclude the words in the big data labels and improve the accuracy of the big data labels, historical materials are adopted to train and update the big data label database. After the clustering center is obtained through clustering, the clustering center can be used as a standard corresponding to a big data label category and used as a big data label, because the clustering is to divide a sample set into a plurality of clusters according to the distance between samples, so that points in the clusters are connected together as closely as possible, and the point closest to the average distance in each point in the clusters is the clustering center, so that the semantic meaning between a feature word corresponding to the clustering center and other feature words in the same index category is the closest, and the feature word can be used as the standard of the index category and used as the big data label, and the mapping relation between the feature word and the big data label is established. If a database table is maintained, wherein the database table contains two columns, the keyword is all text participles to be extracted, and the index _ name is a big data tag and is used as a big data tag with the same meaning.
Step 407, judging whether the text participles contain an excluded word corresponding to the big data label;
in the step, in order to prevent stop words such as auxiliary words, conjunctions and adverbs which have high frequency and no specific meanings from being recognized, the system is also provided with a flow for judging whether the stop words exist or not, and the accuracy and the relevance of automatically generating the label are effectively improved.
step 411, calculating a weight score corresponding to the candidate tag according to a preset weight formula, the position and the times;
and 415, when the office automation system finishes signing the labeled material information, releasing the labeled material information on line.
The embodiment of the invention describes the matching process of the labels in detail on the basis of the previous embodiment of the invention, and selects the big data labels from the big data label database as candidate labels corresponding to the text participles through the characteristic words and the exclusion words; counting the positions and times of the feature words and the excluding words corresponding to the candidate labels in the image recognition text information and the material text information; calculating a weight score corresponding to the candidate label according to a preset weight formula, the position and the times; and taking the candidate label corresponding to the weight score with the maximum value in all the weight scores as the label corresponding to the material. Traversing the feature words in the big data label library, matching all the feature words with the text participles, and determining the big data labels corresponding to the text participles; judging whether the text participles contain stop words corresponding to the big data labels or not; if so, determining that the text word segmentation does not have a corresponding candidate label; if not, the big data label is used as a candidate label of the text word segmentation; and traversing all text participles to execute the steps to obtain candidate labels corresponding to all the text participles. By adopting the method of characteristic words and eliminating words, the weight is added to the screened candidate labels, the labels with higher relevance can be more effectively obtained, and the automation efficiency and accuracy are improved.
In the above description of the material on-line method in the embodiment of the present invention, referring to fig. 5, a material on-line device in the embodiment of the present invention is described below, where an embodiment of the material on-line device in the embodiment of the present invention includes:
the acquiring module 501 is configured to acquire material information uploaded to a material system by a user terminal, where the material information at least includes material pictures and material text information;
the identification module 502 is configured to perform text identification on the material picture through a preset optical character identification technology to obtain picture identification text information;
a word segmentation module 503, configured to perform word segmentation on the image recognition text information and the material text information by using a word segmentation tool, so as to obtain text word segmentation;
the filtering module 504 is configured to perform sensitive word filtering on the text segmentation words according to a preset sensitive word lexicon to obtain a filtering result, where the filtering result includes filtering success or filtering failure;
a marking module 505, configured to mark, when the filtering result is a filtering failure, text segments identified as sensitive words in the picture identification text information and the material text information, and return the marked picture identification text information and the material text information to the user terminal;
a screening module 506, configured to, when the filtering result is that filtering is successful, screen a big data tag from a big data tag library as a candidate tag corresponding to the text segmentation word through a feature word and an exclusion word in the preset big data tag library;
a counting module 507, configured to count positions and times of occurrence of the feature words and the exclusion words corresponding to the candidate tags in the image recognition text information and the material text information;
a calculating module 508, configured to calculate a weight score corresponding to the candidate tag according to a preset weight formula, the position, and the number of times;
the marking module 509 is configured to use the candidate tag corresponding to the weight score with the largest value among all the weight scores as the tag corresponding to the material;
and the online module 510 is configured to add the tag to the corresponding material information, and send the material information with the tag added to the material system for online material.
In the embodiment of the invention, the material online device runs the material online method, and material information uploaded to a material system by a user terminal is obtained, wherein the material information at least comprises material pictures and material text information; performing text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information; performing word segmentation operation on the image identification text information and the material text information by using a word segmentation tool to obtain text word segmentation; performing sensitive word filtering on the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure; if the filtering result is failure, marking the text participles which are identified as sensitive words in the picture identification text information and the material text information, and returning the marked picture identification text information and the material text information to the user terminal; if the filtering result is successful, screening a big data label from a big data label library through a preset feature word and an exclusion word in the big data label library as a candidate label corresponding to the text word segmentation; counting the positions and times of the feature words and the excluding words corresponding to the candidate labels in the image recognition text information and the material text information; calculating a weight score corresponding to the candidate label according to a preset weight formula, the position and the times; taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material; and adding the label into the corresponding material information, and sending the material information added with the label to the material system for material online. The technical problem of low labeling accuracy in the existing material feeding process is solved.
Referring to fig. 6, a second embodiment of the material threading device in the embodiment of the present invention includes:
the acquiring module 501 is configured to acquire material information uploaded to a material system by a user terminal, where the material information at least includes material pictures and material text information;
the identification module 502 is configured to perform text identification on the material picture through a preset optical character identification technology to obtain picture identification text information;
a word segmentation module 503, configured to perform word segmentation on the image recognition text information and the material text information by using a word segmentation tool, so as to obtain text word segmentation;
the filtering module 504 is configured to perform sensitive word filtering on the text segmentation words according to a preset sensitive word lexicon to obtain a filtering result, where the filtering result includes filtering success or filtering failure;
a marking module 505, configured to mark, when the filtering result is a filtering failure, text segments identified as sensitive words in the picture identification text information and the material text information, and return the marked picture identification text information and the material text information to the user terminal;
a screening module 506, configured to, when the filtering result is that filtering is successful, screen a big data tag from a big data tag library as a candidate tag corresponding to the text segmentation word through a feature word and an exclusion word in the preset big data tag library;
a counting module 507, configured to count positions and times of occurrence of the feature words and the exclusion words corresponding to the candidate tags in the image recognition text information and the material text information;
a calculating module 508, configured to calculate a weight score corresponding to the candidate tag according to a preset weight formula, the position, and the number of times;
the marking module 509 is configured to use the candidate tag corresponding to the weight score with the largest value among all the weight scores as the tag corresponding to the material;
and the online module 510 is configured to add the tag to the corresponding material information, and send the material information with the tag added to the material system for online material.
In this embodiment, the identifying module 502 includes:
the character set recognition unit 5021 is used for recognizing characters in the material picture through a preset optical character recognition technology to obtain a character set corresponding to the material picture;
a character correction unit 5022, configured to convert each character in the character set into a corresponding character vector, and input the character vector into a preset semantic correction model to obtain a predicted character corresponding to each character;
and the character summarizing unit 5023 is used for summarizing the predicted characters corresponding to all the characters in sequence to obtain the picture identification text information of the material picture.
In this embodiment, the character correcting unit 5022 is specifically configured to: converting each character into a corresponding character vector, and inputting the character vector into the semantic correction model; multiplying the character vector corresponding to each character by a predetermined matrix WQ, matrix WK and matrix WV respectively through the semantic correction model to obtain a query character vector, a key character vector and a value character vector corresponding to the character; and executing coding and decoding operations on the query character vector, the key character vector and the value character vector corresponding to the characters to obtain a predicted character corresponding to each character.
In this embodiment, the word segmentation module 503 is configured to: calling a preset word segmentation tool, and dividing the image recognition text information and the material text information into N reference words with preset character lengths through a dictionary of the word segmentation tool, wherein N is a natural number not less than 1; acquiring reference participles divided by the participle dividing tool, searching a corresponding preset dictionary according to the character length of the reference participle, and judging whether the reference participle exists in the preset dictionary or not; and when the reference participles are not stored in the preset dictionary, screening the reference participles which are not stored by a fuzzy matching algorithm to obtain the text participles.
In this embodiment, the screening module 506 is specifically configured to: traversing the big data label database based on the feature words and the exclusion words, and matching out big data labels corresponding to the text participles; judging whether the text word segmentation contains an excluded word corresponding to the big data label;
if yes, determining that the text word segmentation does not have a corresponding candidate label; and if not, taking the big data label as a candidate label of the text word segmentation.
In this embodiment, the material online device further includes a tag clustering module 511, where the tag clustering module 511 is configured to: obtaining a history audit text, performing word segmentation processing on the history audit text to obtain a history text word segmentation, and converting the history text word segmentation into a corresponding word vector; randomly selecting N historical text participles from the word vector as an initial clustering center, wherein N is a positive integer; respectively calculating cosine distances from all historical text participles to each initial clustering center; dividing the history text participles which are closest to the cosine distance of the initial clustering center into the same label category; calculating a mean vector of word vectors in the label category, reselecting a current clustering center according to the mean vector, and calculating the cosine distance between the current clustering center and a corresponding initial clustering center; if the cosine distance between the current clustering center and the corresponding initial clustering center is smaller than or equal to a preset threshold value, obtaining N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories; if the cosine distance between the current clustering center and the corresponding initial clustering center is larger than a preset threshold, re-clustering is carried out by using the current clustering center until the cosine distance between the current clustering center and the last clustering center is smaller than or equal to the preset threshold, and N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories are obtained; taking the clustering center as a big data label, and taking historical text participles in the corresponding label category as feature words corresponding to the big data label; and obtaining a big data label database according to the big data labels and the feature words.
In this embodiment, the online module 510 is specifically configured to: adding the label into corresponding material information, and uploading the material information added with the label to the material system; submitting the material information added with the label to a preset office automation system for process signing through the material system; and when the office automation system finishes signing the material information added with the label, releasing the material information added with the label on line.
On the basis of the previous embodiment, the specific functions of each module and the unit composition of part of the modules are described in detail, and the large data label library is trained and updated by using the history materials through the newly added modules. After the clustering center is obtained through clustering, the clustering center can be used as a standard corresponding to a big data label category and used as a big data label, because the clustering is to divide a sample set into a plurality of clusters according to the distance between samples, so that points in the clusters are connected together as closely as possible, and the point closest to the average distance in each point in the clusters is the clustering center, so that the semantic meaning between a feature word corresponding to the clustering center and other feature words in the same index category is the closest, and the feature word can be used as the standard of the index category and used as the big data label, and the mapping relation between the feature word and the big data label is established. The relevance and the accuracy of the label and the material are improved.
Fig. 5 and 6 describe the material loading device in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the material loading device in the embodiment of the present invention is described in detail from the perspective of hardware processing.
Fig. 7 is a schematic structural diagram of an on-line apparatus for material 700 according to an embodiment of the present invention, where the on-line apparatus for material 700 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 710 (e.g., one or more processors) and a memory 720, and one or more storage media 730 (e.g., one or more mass storage devices) storing applications 733 or data 732. Memory 720 and storage medium 730 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 730 may include one or more modules (not shown), each of which may include a series of instructions operating on the materials on-line apparatus 700. Further, the processor 710 may be configured to communicate with the storage medium 730, and execute a series of instruction operations in the storage medium 730 on the material on-line device 700 to implement the steps of the material on-line method.
The in-line apparatus 700 may also include one or more power supplies 740, one or more wired or wireless network interfaces 750, one or more input-output interfaces 760, and/or one or more operating systems 731, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art will appreciate that the in-line equipment configuration of material illustrated in FIG. 7 does not constitute a limitation of the in-line equipment of material provided herein and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.
The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium, having stored therein instructions, which, when executed on a computer, cause the computer to perform the steps of the method for bringing materials on line.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses, and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A material online method is characterized by comprising the following steps:
acquiring material information uploaded to a material system by a user terminal, wherein the material information at least comprises material pictures and material text information;
performing text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information;
performing word segmentation operation on the image identification text information and the material text information by using a word segmentation tool to obtain text word segmentation;
performing sensitive word filtering on the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure;
if the filtering result is failure, marking the text participles which are identified as sensitive words in the picture identification text information and the material text information, and returning the marked picture identification text information and the material text information to the user terminal;
if the filtering result is successful, screening a big data label from a big data label library through a preset feature word and an exclusion word in the big data label library as a candidate label corresponding to the text word segmentation;
counting the positions and times of the feature words and the excluding words corresponding to the candidate labels in the image recognition text information and the material text information;
calculating a weight score corresponding to the candidate label according to a preset weight formula, the position and the times;
taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material;
and adding the label into the corresponding material information, and sending the material information added with the label to the material system for material online.
2. The material online method according to claim 1, wherein the text recognition of the material picture by a preset optical character recognition technology to obtain picture recognition text information comprises:
recognizing characters in the material picture through a preset optical character recognition technology to obtain a character set corresponding to the material picture;
converting each character in the character set into a corresponding character vector, and inputting the character vector into a preset semantic correction model to obtain a predicted character corresponding to each character;
and summarizing the predicted characters corresponding to all the characters in sequence to obtain the picture identification text information of the material picture.
3. The material online method according to claim 2, wherein the converting each character in the character set into a corresponding character vector and inputting the character vector into a preset semantic correction model to obtain a predicted character corresponding to each character comprises:
converting each of the characters into a corresponding character vector;
inputting the character vectors into the semantic correction model, and multiplying the character vectors corresponding to the characters by a predetermined matrix WQ, a predetermined matrix WK and a predetermined matrix WV through the semantic correction model to obtain query character vectors, key character vectors and value character vectors corresponding to the characters;
and executing coding and decoding operations on the query character vector, the key character vector and the value character vector corresponding to the characters to obtain a predicted character corresponding to each character.
4. The material online method according to claim 1, wherein the performing a word segmentation operation on the image recognition text information and the material text information by using a word segmentation tool to obtain a text word comprises:
calling a preset word segmentation tool, and dividing the image recognition text information and the material text information into N reference words with preset character lengths through a dictionary of the word segmentation tool, wherein N is a natural number not less than 1;
acquiring reference participles divided by the participle dividing tool, searching a corresponding preset dictionary according to the character length of the reference participle, and judging whether the reference participle exists in the preset dictionary or not;
and when the reference participles are not stored in the preset dictionary, screening the reference participles which are not stored by a fuzzy matching algorithm to obtain the text participles.
5. The material online method according to claim 1, wherein the screening of the big data tag from the big data tag library as the candidate tag corresponding to the text segmentation through the feature words and the exclusion words in the preset big data tag library comprises:
traversing the big data label database based on the feature words and the exclusion words, and matching out big data labels corresponding to the text participles;
judging whether the text word segmentation contains an excluded word corresponding to the big data label;
if yes, determining that the text word segmentation does not have a corresponding candidate label;
and if not, taking the big data label as a candidate label of the text word segmentation.
6. The material online method according to claim 1, before the obtaining the material information uploaded to the material system by the user terminal, further comprising:
obtaining a history audit text, performing word segmentation processing on the history audit text to obtain a history text word segmentation, and converting the history text word segmentation into a corresponding word vector;
randomly selecting N historical text participles from the word vector as an initial clustering center, wherein N is a positive integer;
respectively calculating cosine distances from all historical text participles to each initial clustering center;
dividing the history text participles which are closest to the cosine distance of the initial clustering center into the same label category;
calculating a mean vector of word vectors in the label category, reselecting a current clustering center according to the mean vector, and calculating the cosine distance between the current clustering center and a corresponding initial clustering center;
if the cosine distance between the current clustering center and the corresponding initial clustering center is smaller than or equal to a preset threshold value, obtaining N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories;
if the cosine distance between the current clustering center and the corresponding initial clustering center is larger than a preset threshold, re-clustering is carried out by using the current clustering center until the cosine distance between the current clustering center and the last clustering center is smaller than or equal to the preset threshold, and N label categories corresponding to all historical text participles and clustering centers corresponding to the label categories are obtained;
taking the clustering center as a big data label, and taking historical text participles in the corresponding label category as feature words corresponding to the big data label;
and obtaining a big data label database according to the big data labels and the feature words.
7. The material online method according to claim 1, wherein the adding the tag to the corresponding material information and sending the tagged material information to the material system for material online comprises:
adding the label into corresponding material information, and uploading the material information added with the label to the material system;
submitting the material information added with the label to a preset office automation system for process signing through the material system;
and when the office automation system finishes signing the material information added with the label, releasing the material information added with the label on line.
8. The utility model provides a material threading device which characterized in that, material threading device includes:
the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring material information uploaded to a material system by a user terminal, and the material information at least comprises material pictures and material text information;
the identification module is used for carrying out text identification on the material picture through a preset optical character identification technology to obtain picture identification text information;
the word segmentation module is used for performing word segmentation operation on the image identification text information and the material text information by using a word segmentation tool to obtain text word segmentation;
the filtering module is used for filtering the sensitive words of the text participles according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure;
the marking module is used for marking the text participles which are identified as sensitive words in the picture identification text information and the material text information when the filtering result is filtering failure, and returning the marked picture identification text information and the material text information to the user terminal;
the screening module is used for screening the big data labels from the big data label library as candidate labels corresponding to the text word segmentation through the preset feature words and the exclusion words in the big data label library when the filtering result is that the filtering is successful;
the counting module is used for counting the positions and times of the feature words and the exclusion words corresponding to the candidate labels in the image identification text information and the material text information;
the calculating module is used for calculating a weight score corresponding to the candidate label according to a preset weight formula, the position and the times;
the marking module is used for taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material;
and the online module is used for adding the label into the corresponding material information and sending the material information added with the label to the material system for material online.
9. The utility model provides a material equipment of getting on line which characterized in that, the material equipment of getting on line includes: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;
the at least one processor invoking the instructions in the memory to cause the mobile device to perform the steps of the online method of materials of any of claims 1-7.
10. A computer-readable storage medium, having a computer program stored thereon, which, when being executed by a processor, carries out the steps of the method of material-online according to any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111273705.7A CN114003750B (en) | 2021-10-29 | 2021-10-29 | Material online method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111273705.7A CN114003750B (en) | 2021-10-29 | 2021-10-29 | Material online method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114003750A true CN114003750A (en) | 2022-02-01 |
CN114003750B CN114003750B (en) | 2024-03-26 |
Family
ID=79925245
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111273705.7A Active CN114003750B (en) | 2021-10-29 | 2021-10-29 | Material online method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114003750B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115994772A (en) * | 2023-02-22 | 2023-04-21 | 中信联合云科技有限责任公司 | Book data processing method and system, book rapid goods laying method and electronic equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649818A (en) * | 2016-12-29 | 2017-05-10 | 北京奇虎科技有限公司 | Recognition method and device for application search intentions and application search method and server |
CN106709040A (en) * | 2016-12-29 | 2017-05-24 | 北京奇虎科技有限公司 | Application search method and server |
CN108334489A (en) * | 2017-01-19 | 2018-07-27 | 百度在线网络技术(北京)有限公司 | Text core word recognition method and device |
CN112148852A (en) * | 2020-09-10 | 2020-12-29 | 深圳康佳电子科技有限公司 | Intelligent customer service method and device, storage medium and computer equipment |
CN113076748A (en) * | 2021-04-16 | 2021-07-06 | 平安国际智慧城市科技股份有限公司 | Method, device and equipment for processing bullet screen sensitive words and storage medium |
-
2021
- 2021-10-29 CN CN202111273705.7A patent/CN114003750B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649818A (en) * | 2016-12-29 | 2017-05-10 | 北京奇虎科技有限公司 | Recognition method and device for application search intentions and application search method and server |
CN106709040A (en) * | 2016-12-29 | 2017-05-24 | 北京奇虎科技有限公司 | Application search method and server |
CN108334489A (en) * | 2017-01-19 | 2018-07-27 | 百度在线网络技术(北京)有限公司 | Text core word recognition method and device |
CN112148852A (en) * | 2020-09-10 | 2020-12-29 | 深圳康佳电子科技有限公司 | Intelligent customer service method and device, storage medium and computer equipment |
CN113076748A (en) * | 2021-04-16 | 2021-07-06 | 平安国际智慧城市科技股份有限公司 | Method, device and equipment for processing bullet screen sensitive words and storage medium |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115994772A (en) * | 2023-02-22 | 2023-04-21 | 中信联合云科技有限责任公司 | Book data processing method and system, book rapid goods laying method and electronic equipment |
CN115994772B (en) * | 2023-02-22 | 2024-03-08 | 中信联合云科技有限责任公司 | Book data processing method and system, book rapid goods laying method and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN114003750B (en) | 2024-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9183226B2 (en) | Image classification | |
US8229883B2 (en) | Graph based re-composition of document fragments for name entity recognition under exploitation of enterprise databases | |
CN102542014B (en) | Image searching feedback method based on contents | |
Wilkinson et al. | Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections | |
RU2613846C2 (en) | Method and system for extracting data from images of semistructured documents | |
CN111563384B (en) | Evaluation object identification method and device for E-commerce products and storage medium | |
CN109446885B (en) | Text-based component identification method, system, device and storage medium | |
CN112818093B (en) | Evidence document retrieval method, system and storage medium based on semantic matching | |
CN112434691A (en) | HS code matching and displaying method and system based on intelligent analysis and identification and storage medium | |
CN116244410B (en) | Index data analysis method and system based on knowledge graph and natural language | |
CN110795942B (en) | Keyword determination method and device based on semantic recognition and storage medium | |
CN114003750A (en) | Material online method, device, equipment and storage medium | |
CN116737967B (en) | Knowledge graph construction and perfecting system and method based on natural language | |
US20190095525A1 (en) | Extraction of expression for natural language processing | |
CN112380848A (en) | Text generation method, device, equipment and storage medium | |
CN112131341A (en) | Text similarity calculation method and device, electronic equipment and storage medium | |
CN111681731A (en) | Method for automatically marking colors of inspection report | |
CN114970554B (en) | Document checking method based on natural language processing | |
US20230134218A1 (en) | Continuous learning for document processing and analysis | |
US20230138491A1 (en) | Continuous learning for document processing and analysis | |
CN113822013B (en) | Labeling method and device for text data, computer equipment and storage medium | |
CN113408287B (en) | Entity identification method and device, electronic equipment and storage medium | |
CN115203445A (en) | Multimedia resource searching method, device, equipment and medium | |
CN115203206A (en) | Data content searching method and device, computer equipment and readable storage medium | |
CN115272649A (en) | Image recognition, retrieval and entry method and system of molecular structure diagram and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |