CN114003750B - Material online method, device, equipment and storage medium - Google Patents

Material online method, device, equipment and storage medium Download PDF

Info

Publication number
CN114003750B
CN114003750B CN202111273705.7A CN202111273705A CN114003750B CN 114003750 B CN114003750 B CN 114003750B CN 202111273705 A CN202111273705 A CN 202111273705A CN 114003750 B CN114003750 B CN 114003750B
Authority
CN
China
Prior art keywords
text
character
word
word segmentation
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111273705.7A
Other languages
Chinese (zh)
Other versions
CN114003750A (en
Inventor
李志宏
刘雪毅
王浩
卢春辉
范宜发
王芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Bank Co Ltd
Original Assignee
Ping An Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Bank Co Ltd filed Critical Ping An Bank Co Ltd
Priority to CN202111273705.7A priority Critical patent/CN114003750B/en
Publication of CN114003750A publication Critical patent/CN114003750A/en
Application granted granted Critical
Publication of CN114003750B publication Critical patent/CN114003750B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance

Abstract

The invention relates to the field of online material management, and discloses a material online method, a device, equipment and a storage medium, wherein the method acquires materials uploaded to an audit by a user; text recognition is carried out on the picture through an OCR text recognition technology, so that picture recognition text information is obtained; performing word segmentation operation on the picture identification text information and the material text information by using a word segmentation tool to obtain text word segmentation; according to the sensitive word lexicon, filtering the sensitive words of the text word segmentation; if the filtering is successful, the text word segmentation is subjected to tag matching through a large data tag library, a corresponding tag is obtained and added to an audit page of the material, and the text word segmentation is released on line after the tag is submitted. According to the method, the text content is obtained by OCR text recognition of the uploaded picture content, the text content is marked through a tag library, sensitive words are filtered, pictures are rapidly achieved, the text content is marked, and the legal validity of the content is checked. The operation flow is reduced, thereby achieving the purpose of fast feeding materials.

Description

Material online method, device, equipment and storage medium
Technical Field
The present invention relates to the field of online material management, and in particular, to a method, an apparatus, a device, and a storage medium for online material management.
Background
With the development of artificial intelligence technology, especially in business online auditing applications, intelligent online auditing has become a trend. In the existing intelligent online auditing mode, after pictures and text materials to be online are displayed through related operation positions such as a material library, the pictures and the text materials are respectively labeled, but the existing labeling mode is only based on actual semantics of real vocabularies in the pictures or the text materials, if a plurality of vocabularies with different semantics exist in the text materials or the pictures, a plurality of labels exist, and a plurality of standards possibly exist in the online auditing of the business due to the plurality of labels, so that the final online classification of the business is inaccurate, and the efficiency of the business auditing is low.
Disclosure of Invention
The invention mainly aims to solve the technical problem of low labeling accuracy in the existing material online process.
The first aspect of the invention provides a material online method, which comprises the following steps: acquiring material information uploaded to a material system by a user terminal, wherein the material information at least comprises material pictures and material text information; carrying out text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information; performing word segmentation operation on the picture identification text information and the material text information by using a word segmentation tool to obtain text word segmentation; filtering the sensitive words of the text word segmentation according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure; if the filtering result is that the filtering is failed, marking the text word which is identified as the sensitive word in the picture identification text information and the material text information, and returning the marked picture identification text information and the marked material text information to the user terminal; if the filtering result is that the filtering is successful, screening the big data tag from the big data tag library as a candidate tag corresponding to the text word segmentation through a characteristic word and an exclusion word in a preset big data tag library; counting the positions and times of occurrence of the feature words and the exclusion words corresponding to the candidate labels in the picture identification text information and the material text information; calculating weight scores corresponding to the candidate labels according to a preset weight formula, the positions and the times; taking a candidate label corresponding to the weight score with the largest value in all the weight scores as a label corresponding to the material; and adding the label into the corresponding material information, and sending the material information added with the label to the material system for material online.
Optionally, in a first implementation manner of the first aspect of the present invention, the text recognition of the material picture by a preset optical character recognition technology, and obtaining picture recognition text information includes: identifying characters in the material picture through a preset optical character identification technology to obtain a character set corresponding to the material picture; converting each character in the character set into a corresponding character vector, and inputting the character vector into a preset semantic correction model to obtain a predicted character corresponding to each character; and sequentially summarizing the predicted characters corresponding to all the characters to obtain the picture identification text information of the material picture.
Optionally, in a second implementation manner of the first aspect of the present invention, the converting each character in the character set into a corresponding character vector, and inputting the character vector into a preset semantic correction model, so as to obtain a predicted character corresponding to each character includes: converting each character into a corresponding character vector, and inputting the character vector into the semantic correction model; multiplying the character vector corresponding to each character by a predetermined matrix WQ, a predetermined matrix WK and a predetermined matrix WV through the semantic correction model to obtain a query character vector, a key character vector and a value character vector corresponding to the character; and executing encoding and decoding operations on the query character vector, the key character vector and the value character vector corresponding to the characters to obtain predicted characters corresponding to each character.
Optionally, in a third implementation manner of the first aspect of the present invention, the performing a word segmentation operation on the picture identification text information and the material text information by using a word segmentation tool, and obtaining a text word includes: calling a preset word segmentation tool, and dividing the picture identification text information and the material text information into N reference word segmentation with preset character lengths through a dictionary of the word segmentation tool, wherein N is a natural number not smaller than 1; acquiring a reference word segmentation after the word segmentation tool is used for dividing, searching a corresponding preset dictionary according to the character length of the reference word segmentation, and judging whether the reference word segmentation exists in the preset dictionary; when the reference word is not stored in the preset dictionary, screening the non-stored reference word by a fuzzy matching algorithm to obtain the text word.
Optionally, in a fourth implementation manner of the first aspect of the present invention, screening the big data tag from the big data tag library as the candidate tag corresponding to the text word includes: traversing the big data tag library based on the characteristic words and the exclusion words, and matching big data tags corresponding to the text segmentation words; judging whether the text word segmentation contains an exclusion word corresponding to the big data tag; if yes, determining that the text segmentation does not have a corresponding candidate label; and if not, taking the big data label as a candidate label of the text word segmentation.
Optionally, in a fifth implementation manner of the first aspect of the present invention, before the acquiring the material information uploaded to the material system by the user terminal, the method further includes:
acquiring a history audit text, performing word segmentation processing on the history audit text to obtain history text word segmentation, and converting the history text word segmentation into a corresponding word vector; randomly selecting N historical text word segmentation from the word vectors to serve as an initial clustering center, wherein N is a positive integer; respectively calculating cosine distances from all historical text word segmentation to each initial clustering center; dividing historical text word segmentation with the cosine distance from the initial clustering center nearest to the initial clustering center into the same label category; calculating a mean vector of word vectors in the tag class, reselecting a current clustering center for the mean vector, and calculating a cosine distance between the current clustering center and a corresponding initial clustering center; if the cosine distance between the current clustering center and the corresponding initial clustering center is smaller than or equal to a preset threshold value, N label categories corresponding to all the historical text segmentation and the clustering center corresponding to the label category are obtained; if the cosine distance between the current clustering center and the corresponding initial clustering center is greater than a preset threshold, re-clustering is carried out by the current clustering center until the cosine distance between the current clustering center and the last clustering center is less than or equal to the preset threshold, and N label categories corresponding to all historical text segmentation and the clustering centers corresponding to the label categories are obtained; taking the clustering center as a big data tag, and taking the historical text segmentation word in the corresponding tag class as a characteristic word corresponding to the big data tag; and obtaining a big data tag library according to the big data tag and the feature words.
Optionally, in a sixth implementation manner of the first aspect of the present invention, adding the tag to corresponding material information, and sending the tagged material information to the material system to perform material online includes: adding the label to the corresponding material information, and uploading the material information added with the label to the material system; submitting the labeled material information to a preset office automation system through the material system for process signing and reporting; when the office automation system finishes signing and reporting the material information with the labels, the material information with the labels is released and put on line.
The second aspect of the present invention provides a material feeding device, comprising: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring material information uploaded to a material system by a user terminal, and the material information at least comprises material pictures and material text information; the recognition module is used for carrying out text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information; the word segmentation module is used for carrying out word segmentation operation on the picture identification text information and the material text information by using a word segmentation tool to obtain text word segmentation; the filtering module is used for filtering the sensitive words of the text word segmentation according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure; the marking module is used for marking the text word segmentation of the sensitive word identified in the picture identification text information and the material text information when the filtering result is that the filtering is failed, and returning the marked picture identification text information and the marked material text information to the user terminal; the filtering module is used for filtering the big data tag from the big data tag library through the characteristic words and the exclusion words in the preset big data tag library when the filtering result is that the filtering is successful, and using the big data tag as a candidate tag corresponding to the text word segmentation; the statistics module is used for counting the positions and times of the occurrence of the feature words and the exclusion words corresponding to the candidate labels in the picture identification text information and the material text information; the calculating module is used for calculating the weight scores corresponding to the candidate labels according to a preset weight formula, the positions and the times; the marking module is used for taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material; and the online module is used for adding the label into the corresponding material information and sending the material information added with the label to the material system for material online.
Optionally, in a first implementation manner of the second aspect of the present invention, the identification module specifically includes: the character set recognition unit is used for recognizing characters in the material picture through a preset optical character recognition technology to obtain a character set corresponding to the material picture; the character correction unit is used for converting each character in the character set into a corresponding character vector, and inputting the character vector into a preset semantic correction model to obtain a predicted character corresponding to each character; and the character summarizing unit is used for summarizing the predicted characters corresponding to all the characters in sequence to obtain the picture identification text information of the material picture.
Optionally, in a second implementation manner of the second aspect of the present invention, the character correcting unit specifically includes: converting each character into a corresponding character vector, and inputting the character vector into the semantic correction model; multiplying the character vector corresponding to each character by a predetermined matrix WQ, a predetermined matrix WK and a predetermined matrix WV through the semantic correction model to obtain a query character vector, a key character vector and a value character vector corresponding to the character; and executing encoding and decoding operations on the query character vector, the key character vector and the value character vector corresponding to the characters to obtain predicted characters corresponding to each character.
Optionally, in a third implementation manner of the second aspect of the present invention, the word segmentation module is configured to: calling a preset word segmentation tool, and dividing the picture identification text information and the material text information into N reference word segmentation with preset character lengths through a dictionary of the word segmentation tool, wherein N is a natural number not smaller than 1; acquiring a reference word segmentation after the word segmentation tool is used for dividing, searching a corresponding preset dictionary according to the character length of the reference word segmentation, and judging whether the reference word segmentation exists in the preset dictionary; when the reference word is not stored in the preset dictionary, screening the non-stored reference word by a fuzzy matching algorithm to obtain the text word.
Optionally, in a fourth implementation manner of the second aspect of the present invention, the screening module is specifically configured to: traversing the big data tag library based on the characteristic words and the exclusion words, and matching big data tags corresponding to the text segmentation words; judging whether the text word segmentation contains an exclusion word corresponding to the big data tag; if yes, determining that the text segmentation does not have a corresponding candidate label; and if not, taking the big data label as a candidate label of the text word segmentation.
Optionally, in a fifth implementation manner of the second aspect of the present invention, the material online device further includes a label clustering module, where the label clustering module is configured to: acquiring a history audit text, performing word segmentation processing on the history audit text to obtain history text word segmentation, and converting the history text word segmentation into a corresponding word vector; randomly selecting N historical text word segmentation from the word vectors to serve as an initial clustering center, wherein N is a positive integer; respectively calculating cosine distances from all historical text word segmentation to each initial clustering center; dividing historical text word segmentation with the cosine distance from the initial clustering center nearest to the initial clustering center into the same label category; calculating a mean vector of word vectors in the tag class, reselecting a current clustering center for the mean vector, and calculating a cosine distance between the current clustering center and a corresponding initial clustering center; if the cosine distance between the current clustering center and the corresponding initial clustering center is smaller than or equal to a preset threshold value, N label categories corresponding to all the historical text segmentation and the clustering center corresponding to the label category are obtained; if the cosine distance between the current clustering center and the corresponding initial clustering center is greater than a preset threshold, re-clustering is carried out by the current clustering center until the cosine distance between the current clustering center and the last clustering center is less than or equal to the preset threshold, and N label categories corresponding to all historical text segmentation and the clustering centers corresponding to the label categories are obtained; taking the clustering center as a big data tag, and taking the historical text segmentation word in the corresponding tag class as a characteristic word corresponding to the big data tag; and obtaining a big data tag library according to the big data tag and the feature words.
Optionally, in a sixth implementation manner of the second aspect of the present invention, the online module is specifically configured to: adding the label to the corresponding material information, and uploading the material information added with the label to the material system; submitting the labeled material information to a preset office automation system through the material system for process signing and reporting; when the office automation system finishes signing and reporting the material information with the labels, the material information with the labels is released and put on line.
A third aspect of the present invention provides a material feeding apparatus comprising: a memory and at least one processor, the memory having instructions stored therein, the memory and the at least one processor being interconnected by a line; the at least one processor invokes the instructions in the memory to cause the material loading device to perform the steps of the material loading method described above.
A fourth aspect of the invention provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the steps of the material online method described above.
According to the technical scheme, material information uploaded to a material system by a user terminal is obtained, wherein the material information at least comprises material pictures and material text information; carrying out text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information; performing word segmentation operation on the picture identification text information and the material text information by using a word segmentation tool to obtain text word segmentation; filtering the sensitive words of the text word segmentation according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure; if the filtering result is that the filtering is failed, marking the text word which is identified as the sensitive word in the picture identification text information and the material text information, and returning the marked picture identification text information and the marked material text information to the user terminal; if the filtering result is that the filtering is successful, screening the big data tag from the big data tag library as a candidate tag corresponding to the text word segmentation through a characteristic word and an exclusion word in a preset big data tag library; counting the positions and times of occurrence of the feature words and the exclusion words corresponding to the candidate labels in the picture identification text information and the material text information; calculating weight scores corresponding to the candidate labels according to a preset weight formula, the positions and the times; taking a candidate label corresponding to the weight score with the largest value in all the weight scores as a label corresponding to the material; and adding the label into the corresponding material information, and sending the material information added with the label to the material system for material online. The method adopts an optical character recognition technology, obtains text content by OCR text recognition on the uploaded picture content, marks the text content through a tag library, filters sensitive words, and can rapidly realize the marking of the picture and the text content and the verification of legal regularity of the content. The operation flow is reduced, thereby achieving the purpose of fast feeding materials.
Drawings
FIG. 1 is a schematic view of a first embodiment of a method for feeding materials according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a second embodiment of a method for feeding materials according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a third embodiment of a method for feeding materials according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a fourth embodiment of a method for feeding materials according to an embodiment of the present invention;
FIG. 5 is a schematic view of an embodiment of a material feeding device according to an embodiment of the present invention;
FIG. 6 is a schematic view of another embodiment of a material feeding device according to an embodiment of the present invention;
fig. 7 is a schematic diagram of an embodiment of a material feeding device according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a material online method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring material information uploaded to a material system by a user terminal, wherein the material information at least comprises material pictures and material text information; carrying out text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information; performing word segmentation operation on the picture identification text information and the material text information by using a word segmentation tool to obtain text word segmentation; filtering the sensitive words of the text word segmentation according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure; if the filtering result is that the filtering is failed, marking the text word which is identified as the sensitive word in the picture identification text information and the material text information, and returning the marked picture identification text information and the marked material text information to the user terminal; if the filtering result is that the filtering is successful, screening the big data tag from the big data tag library as a candidate tag corresponding to the text word segmentation through a characteristic word and an exclusion word in a preset big data tag library; counting the positions and times of occurrence of the feature words and the exclusion words corresponding to the candidate labels in the picture identification text information and the material text information; calculating weight scores corresponding to the candidate labels according to a preset weight formula, the positions and the times; taking a candidate label corresponding to the weight score with the largest value in all the weight scores as a label corresponding to the material; and adding the label into the corresponding material information, and sending the material information added with the label to the material system for material online. The method adopts an optical character recognition technology, obtains text content by OCR text recognition on the uploaded picture content, marks the text content through a tag library, filters sensitive words, and can rapidly realize the marking of the picture and the text content and the verification of legal regularity of the content. The operation flow is reduced, thereby achieving the purpose of fast feeding materials.
The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
For ease of understanding, the following describes a specific flow of an embodiment of the present invention, referring to fig. 1, and a first embodiment of a method for feeding materials into a line in an embodiment of the present invention includes:
step 101, acquiring material information uploaded to a material system by a user terminal, wherein the material information at least comprises material pictures and material text information;
In this embodiment, the material is an advertisement material, and the basic advertisement material is in the form of text, picture and Flash material data, so that in order to ensure that the content of the advertisement material data is legal, and the advertisement material data can be normally accessed after being online, the material needs to be checked, and when the user uploads the material to a checking page, the user can independently upload the material picture and the material text information; and an office suite interface can be accessed in the background of the system, so that the user can upload the whole document in PPT and DOC formats at one time, and the system can analyze and process the whole document on line to obtain the material picture and the material text information in the whole document.
102, carrying out text recognition on a material picture through a preset optical character recognition technology to obtain picture recognition text information;
in this embodiment, corresponding to the text information in the material picture, the areas where the text information of the material picture is located are different due to different size and style of the material picture, the text area in the material picture needs to be determined by an image detection algorithm, and then the characters in the text area are identified by an optical character identification technology.
In this step, the image detection algorithm refers to locating, in a given picture containing text of an unspecific font, an area where text of the unspecific font is located, so as to obtain an area of text of the unspecific font. Optionally, the non-font specific text area is a rectangular box. For example: an original image containing unspecific font characters is selected through an image detection algorithm box, and an unspecific font character area (such as blank background) in the picture containing the unspecific font characters is not considered. The image detection algorithm can be dlib image detection algorithm, opencv library image detection algorithm or image detection algorithm based on deep learning.
Preferably, in this embodiment, an image detection algorithm based on deep learning is used to detect an original image set, and a text region of a text with an unspecified font is analyzed. By acquiring the non-font specific text region, the non-font specific text is more efficiently processed later.
Step 103, word segmentation is carried out on the picture identification text information and the material text information by using a word segmentation tool, so as to obtain text word segmentation;
in this embodiment, the system uses a jieba word segmentation tool to segment words, the word segmentation tool generates a directed acyclic graph of a sentence by comparing with a dictionary, and then intercepts the sentence or directly intercepts the sentence after searching the shortest path according to the dictionary according to different selected modes.
Step 104, filtering the sensitive words of the text word according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure;
in this embodiment, the text word segmentation should be compared with the sensitive word lexicon traversal, and when the text word segmentation is completed and the text word segmentation has the vocabulary in the sensitive word lexicon, the filtering is failed; if the text word segmentation does not have the vocabulary in the sensitive word lexicon, the filtering is successful.
Step 105, if the filtering result is that the filtering is failed, marking the text word which is identified as the sensitive word in the picture identification text information and the material text information, and returning the marked picture identification text information and the marked material text information to the user terminal;
in the step, a mapping table is needed to be manufactured for text word segmentation of the picture identification text information and the material text information, the position information of the text word segmentation of the identified sensitive word in the material uploaded by the user is synchronously recorded for the text word segmentation of the identified sensitive word, and the text word segmentation is fed back to the picture identification text information and the material text information submitted by the user in real time when the subsequent system fails to filter, so that the user can conveniently position and modify the text word segmentation.
In this embodiment, when the system detects that the text word is a sensitive word, a specific position of the text word in the material text information or the material picture submitted by the user is found and highlighted through the mapping table, the user can edit and modify the material text information or the material picture online, and when the user modifies the material picture, the system can perform text recognition on the material picture of which the modified content is secondary in real time, word the text information and the material text information identified by the picture, traverse and compare the updated text word with the sensitive word lexicon, and perform secondary filtering.
Step 106, if the filtering result is that the filtering is successful, screening the big data tag from the big data tag library as a candidate tag corresponding to the text word through the characteristic word and the exclusion word in the preset big data tag library;
in practical application, in order to improve the representation capability of the label to the material, a user can search the related material according to the label, the system attaches a weight for each matching in the label matching process of text word segmentation, and each matched label is provided with a weight value of one label to represent the matching or correlation degree of the label and the material.
In this embodiment, each big data tag in the big data tag library needs to be configured with feature words and exclusion words. The feature words refer to a series of words with the same or similar meaning as the tag, the feature words may be the name of the tag, the sub-category under the tag, or other words related to the tag, etc., and typically the tag itself is one of the feature words, for example, the feature words of the tag "gold" include "gold", "gold price", "buy gold", etc.
Step 107, counting the positions and times of occurrence of the feature words and the exclusion words corresponding to the candidate labels in the picture identification text information and the material text information;
In this embodiment, a mapping set is built through the full text, the mapping set includes the position information and the total occurrence number of each feature word and the corresponding excluded word, and statistical preparation is made for the next calculation weight formula by building the mapping table.
Step 108, calculating weight scores corresponding to the candidate labels according to a preset weight formula, positions and times;
in this embodiment, through analysis between the historical materials and the tags, a plurality of different calculation modes can be preset, and the calculation is performed through weight distribution ratios at different positions. For example, the weight score of each occurrence of the feature word at the title is a score, the weight score of each occurrence of the sub-title is a score B, the weight score of each occurrence of the first segment is a score C, the weight score of each occurrence of the end segment is a score D, the weight score of each occurrence of the rest segments is a score E, the weight score of each occurrence of the exclusion word corresponding to the feature word is a score F, and one possible weight formula can be listed simply, namely: feature word total score = a+b+c+d+e-F. By adjusting the magnitude of the weight scores of different positions, the obtained total score correspondingly changes, and the same piece of material information can possibly obtain different labels.
In this step, in order to improve the characterization capability of the big data tag on the material and provide more flexible tag service, a weight is added to the tag of the material. That is, each big data label is generated for the material by the system, and the weight value of the big data label is carried to represent the matching or correlation degree of the big data label and the material. The specific implementation mode is as follows: because the positions of the feature words in the material are distinguished by importance, the feature words in the title are the most important, and the feature words are the abstract, the beginning position of the text and the like in sequence. Thus, the title weight may be set to 3, the abstract weight to 2, the body weight to 1, the weight of each sentence after the first sentence linearly decreasing, and the weight of the last sentence to 0 (since the last sentence of the resource is usually an extension, which may be far from the subject throughout). Then calculating the number of times of occurrence of the feature words in the material and the weight sum calculated according to the occurrence position and the number of times, and then calculating the tag weight by dividing the product of the total number of the tag feature words and the weight sum of the material by the weight sum of the feature words in the material, namely, the tag weight = the weight sum of hit feature words/(the number of all feature words of the tag: the resource weight sum) ×100, wherein the value range of the tag weight finally obtained is between 0 and 100. The higher the tag weight, the higher the correlation of the tag to the material.
Step 109, using the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material;
in this embodiment, after calculating the total weight score calculated by each feature word in step 108, the candidate label with the largest weight score is determined to be the label of the material.
And 110, adding the label into the corresponding material information, and sending the material information added with the label to a material system for material online.
In this embodiment, after the tag is added to the material audit page, the audit filtering step is substantially completed after the system passes the audit. The follow-up still needs to be checked and checked through the required leadership of the process according to the organization structure of the company, and the party can issue online, so as to prevent the wrong word or some policy risks.
According to the technical scheme, the material information uploaded to the material system by the user terminal is obtained, wherein the material information at least comprises material pictures and material text information; carrying out text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information; performing word segmentation operation on the picture identification text information and the material text information by using a word segmentation tool to obtain text word segmentation; according to a preset sensitive word library, performing sensitive word filtering on text word segmentation to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure; if the filtering result is that the filtering is failed, marking the text word which is identified as the sensitive word in the picture identification text information and the material text information, and returning the marked picture identification text information and the marked material text information to the user terminal; if the filtering result is that the filtering is successful, screening the big data tag from the big data tag library as a candidate tag corresponding to the text word segmentation through the characteristic word and the exclusion word in the preset big data tag library; counting positions and times of occurrence of feature words and exclusion words corresponding to the candidate labels in the picture identification text information and the material text information; calculating weight scores corresponding to the candidate labels according to a preset weight formula, positions and times; taking the candidate label corresponding to the weight score with the largest value in the all weight scores as the label corresponding to the material; and adding the label into the corresponding material information, and sending the material information added with the label to a material system for material online. The method adopts an optical character recognition technology, obtains text content by OCR text recognition on the uploaded picture content, marks the text content through a tag library, filters sensitive words, and can rapidly realize the marking of the picture and the text content and the verification of legal regularity of the content. The operation flow is reduced, thereby achieving the purpose of fast feeding materials.
Referring to fig. 2, a second embodiment of a method for feeding materials includes:
step 201, acquiring material information uploaded to a material system by a user terminal, wherein the material information at least comprises material pictures and material text information;
step 202, carrying out text recognition on a material picture through a preset optical character recognition technology to obtain picture recognition text information;
in this embodiment, the material picture is any picture that needs to be subjected to text recognition, and the obtained picture recognition text information may be blank or may include at least one character recognized from the material picture, where when the picture recognition text information includes at least one character recognized from the material picture, the execution of step 203 is triggered. The character recognition model described in the embodiment of the invention is a character recognition model based on connection time sequence classification CTC (Connectionist Temporal Classification), and the character recognition model based on connection time sequence classification CTC mainly comprises two parts: a picture feature extraction network (e.g., CNN, VGG, resNet, etc.) and a text classification network (e.g., RNN, LSTM, etc.).
Step 203, converting each character into a corresponding character vector, and inputting the character vector into the semantic correction model;
In the embodiment of the invention, after the characters in the material picture are identified, the character identification device can convert each identified character into a corresponding character vector through a word-embedding (word-embedding) technology, and each character corresponds to one character vector. For example, when the recognized character set is "safe", the character recognition apparatus may convert "safe" into a matrix of 2×512, that is, each character is replaced by a 512-dimensional character vector.
Step 204, multiplying the character vector corresponding to each character by a predetermined matrix WQ, a predetermined matrix WK and a predetermined matrix WV through the semantic correction model to obtain a query character vector, a key character vector and a value character vector corresponding to the character;
in the step, the matrix WQ is a query matrix, and the character vector is multiplied by the query matrix to obtain a query character vector; the matrix WK is a key matrix, and the character vector is multiplied by the key matrix to obtain a key character vector; the matrix WV is a value matrix, and the character vector is multiplied by the value matrix to obtain a value character vector; by multiplying the input character vectors by three predetermined matrices WQ, WK and WV, respectively, three corresponding character features can be understood.
Specifically, at the beginning of model training, the matrix WQ, the matrix WK and the matrix WV are set randomly, and the semantic correction model is trained through a large number of training samples and continuously adjusted to obtain a predetermined matrix WQ, a predetermined matrix WK and a predetermined matrix WV.
Step 205, performing encoding and decoding operations on the query character vector, the key character vector and the value character vector corresponding to the characters to obtain predicted characters corresponding to each character;
in the step, the coding and decoding operation is carried out on the obtained query character vector, key character vector and value character vector, so that the initial character can be understood to extract the corresponding features through three matrixes, and finally the feature values on the three vectors are fused by decoding to obtain the predicted character corresponding to each character.
In a specific application, as a certain probability of recognition errors exist in the existing optical character recognition technology, a semantic correction model is needed to be added to reduce the probability of error occurrence, and recognition character vectors with lower recognition accuracy can be corrected by connecting front and rear character vectors to be corrected according to the semantic correction model. The training process of the semantic correction module needs to pass through the manual labeling result and the detection recognition result under the technical scene of the historical optical character recognition; for example, the result of manual labeling is "2019, 5, 3, and the result obtained by deep learning of the text detection and recognition model is" 2019, 5, 3. The two pairs form a set of training samples. And the semantic correction module learns the mode of guiding decoding according to the category information, and finally obtains the text information conforming to the standard semantic. For example, when the picture identification text information identified by the pre-trained text identification model is "splendid achnatherum at risk", the text identification device inputs a vector corresponding to the "splendid achnatherum at risk" to the trained semantic correction module to perform semantic correction, so as to obtain the "splendid achnatherum at risk" text after semantic correction, and further take the "splendid achnatherum at risk" as the final text identification result of the image.
Step 206, gathering the predicted characters corresponding to all the characters according to the sequence to obtain the picture identification text information of the material picture;
step 207, word segmentation is carried out on the picture identification text information and the material text information by using a word segmentation tool to obtain text word segmentation;
step 208, filtering the sensitive words of the text word according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure;
step 209, if the filtering result is that the filtering is failed, marking the text word which is identified as the sensitive word in the picture identification text information and the material text information, and returning the marked picture identification text information and the marked material text information to the user terminal;
step 210, if the filtering result is that the filtering is successful, screening the big data tag from the big data tag library as a candidate tag corresponding to the text word through the characteristic word and the exclusion word in the preset big data tag library;
step 211, counting the positions and times of occurrence of the feature words and the exclusion words corresponding to the candidate labels in the picture identification text information and the material text information;
step 212, calculating weight scores corresponding to the candidate labels according to a preset weight formula, positions and times;
Step 213, using the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material;
and step 214, adding the label to the corresponding material information, and sending the material information added with the label to a material system for material online.
On the basis of the previous embodiment, the embodiment describes the identification of characters in a material picture through a preset optical character identification technology in detail, so as to obtain a character set corresponding to the material picture; converting each character in the character set into a corresponding character vector, and inputting the character vector into a semantic correction model of an optical character recognition technology to obtain a predicted character corresponding to each character; and sequentially summarizing the predicted characters corresponding to all the characters to obtain the picture identification text information of the material picture. By adopting the semantic correction model, the accuracy of character recognition is improved, and the probability of wrongly written characters is reduced.
Referring to fig. 3, a third embodiment of a method for feeding materials includes:
step 301, acquiring material information uploaded to a material system by a user terminal, wherein the material information at least comprises material pictures and material text information;
Step 302, carrying out text recognition on a material picture through a preset optical character recognition technology to obtain picture recognition text information;
step 303, calling a preset word segmentation tool, and dividing the picture identification text information and the material text information into N reference word segments with preset character lengths through a dictionary of the word segmentation tool, wherein N is a natural number not smaller than 1;
preferably, the basic process of the jieba segmentation is as follows: based on the Trie (dictionary tree) structure, efficient word graph scanning is realized, and a Directed Acyclic Graph (DAG) formed by all possible word forming conditions of Chinese characters in sentences is generated. A trie tree is generated from the dictionary file. The dictionary converts the occurrence number of each word into frequency while generating trie tree; to-be-segmented sentences, generating DAGs according to trie trees generated by dictionary files, and actually colloquially speaking, namely, to-be-segmented sentences, performing dictionary searching operation according to a given dictionary to generate several possible sentence segmentation. jieba records in the DAG the beginning position of a word in the sentence, from 0 to n-1 (n is the length of the sentence), each beginning position being the key of the dictionary, value being a list in which the ending positions of possible words are saved (words are obtained by looking up the dictionary, beginning position + length of the word is obtained). And a maximum probability path is searched by dynamic programming, and the maximum segmentation combination based on word frequency is found. Searching for a word which is already segmented in a sentence to be segmented, searching for the frequency (times/total number) of occurrence of the word for the word, and if the word does not exist, taking the frequency of the word with the smallest occurrence frequency in a dictionary as the frequency of the word, namely P (certain word) =FREQ.get ('certain word', min_freq). According to the method for searching the maximum probability path by dynamic programming, the maximum probability is calculated for sentences from right to left in a reverse direction, P (NodeN) =1.0, P (NodeN-1) =P (NodeN) ×Max (P (last word)) … and so on, and finally the maximum probability path is obtained, and the segmentation combination of the maximum probability is obtained.
Step 304, obtaining the reference word segmentation divided by the word segmentation tool, searching a corresponding preset dictionary according to the character length of the reference word segmentation, and judging whether the reference word segmentation exists in the preset dictionary or not;
in this embodiment, the word segmentation tool is called by acquiring the picture recognition text information and the material text information, so that the word segmentation tool divides the picture recognition text information and the material text information into a plurality of characters with preset lengths, searches a corresponding preset dictionary according to the characters with the preset lengths, judges whether characters exist in the preset dictionary, and indicates that the characters have abnormal recognition conditions when the characters do not exist in the preset dictionary, and in this case, screens out target characters from the non-existing characters through a fuzzy matching algorithm, thereby realizing character recognition through the fuzzy matching algorithm and improving character recognition efficiency.
Step 305, when the reference word is not stored in the preset dictionary, screening the non-stored reference word by a fuzzy matching algorithm to obtain a text word;
in practical application, the fuzzy matching algorithm refers to the minimum editing operation number required for converting one character string into the other character string, the operation number is taken as the editing distance, the two character strings with smaller editing distance are similar, and when the editing distance is 0, the two character strings are equal, so that character recognition is realized.
In a specific implementation, searching for a word with an edit distance not greater than the word length by the BK-tree algorithm, for example, if there is no "middle garden", searching for a word with an edit distance not greater than the word length from the BK-tree may be "China", wherein the edit distance is the edit distance of the character strings A to B, and by using three operations of insertion, deletion and substitution, a can be changed into B by how many steps are least needed. For example, two steps are needed from the FAME to the GATE, two steps are needed for replacement, three steps are needed from the GAME to the ACM, G and E are deleted, C is added, and the screened Chinese character is used as a target character for displaying, so that the identification of the text is realized through a fuzzy matching algorithm, and the accuracy of the identification of the text is improved.
Step 306, filtering the sensitive words of the text word according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure;
step 307, if the filtering result is that the filtering is failed, marking the text word which is identified as the sensitive word in the picture identification text information and the material text information, and returning the marked picture identification text information and the marked material text information to the user terminal;
step 308, if the filtering result is that the filtering is successful, screening the big data tag from the big data tag library as a candidate tag corresponding to the text word through the characteristic word and the exclusion word in the preset big data tag library;
Step 309, counting the positions and times of occurrence of the feature words and the exclusion words corresponding to the candidate labels in the picture identification text information and the material text information;
step 310, calculating weight scores corresponding to the candidate labels according to a preset weight formula, positions and times;
step 311, using the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material;
and step 312, adding the label to the corresponding material information, and sending the material information with the label to a material system for material online.
On the basis of the embodiment of the invention, the method and the device for identifying the images in the text information have the advantages that a preset word segmentation tool is called, and the image identification text information and the material text information are divided into N reference word segments with preset character lengths through a dictionary of the word segmentation tool, wherein N is a natural number not smaller than 1; acquiring reference word segmentation after the word segmentation tool is used for dividing, searching a corresponding preset dictionary according to the character length of the reference word segmentation, and judging whether the reference word segmentation exists in the preset dictionary or not; when the reference word is not stored in the preset dictionary, screening the non-stored reference word through a fuzzy matching algorithm to obtain text word, and displaying the text word. By adopting the preset word segmentation tool, the recognized characters form the text word segmentation with practical meaning, and the word recognition efficiency is improved.
Referring to fig. 4, a fourth embodiment of a method for feeding materials in an embodiment of the present invention includes:
step 401, acquiring material information uploaded to a material system by a user terminal, wherein the material information at least comprises material pictures and material text information;
step 402, carrying out text recognition on a material picture through a preset optical character recognition technology to obtain picture recognition text information;
step 403, performing word segmentation operation on the picture identification text information and the material text information by using a word segmentation tool to obtain text word segmentation;
step 404, performing sensitive word filtering on text word segmentation according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure;
step 405, if the filtering result is that the filtering is failed, marking the text word which is identified as the sensitive word in the picture identification text information and the material text information, and returning the marked picture identification text information and the marked material text information to the user terminal;
step 406, traversing the big data tag library based on the characteristic words and the exclusion words, and matching the big data tag corresponding to the text word;
in this embodiment, due to the flexibility of language and the diversity of word senses, the word sense of a word may sometimes be confused, for example, the "gold" tag in a banking scene refers to a business related to gold purchase, and the resource where the word "gold" appears is not necessarily related to these businesses, so that an exclusion word is set to perform constraint, for example, the exclusion word of "gold" has "golden week", "golden holiday" and so on, and when the exclusion word "golden week" appears in the word segmentation of text, it is known that the feature word "gold" is not active, so that generation of an erroneous "gold" tag is avoided.
In this embodiment, the big data tag library is mainly obtained by the following steps:
acquiring a history audit text, performing word segmentation processing on the history audit text to obtain history text word segmentation, converting the history text word segmentation into corresponding word vectors, and setting N clusters; randomly selecting N historical text segmentation words as initial clustering centers; cosine distances from all historical text word segmentation to each initial clustering center are calculated respectively; dividing historical text word segmentation with the cosine distance from an initial clustering center within a preset error range into the same label category; calculating the mean value vector of the word vectors in the label category, reselecting the current clustering center according to the mean value vector, and calculating the cosine distance between the current clustering center and the corresponding initial clustering center; if the cosine distance between the current clustering center and the corresponding initial clustering center is smaller than or equal to a preset threshold value, N label categories corresponding to all the historical text segmentation and the clustering center corresponding to the label category are obtained; if the cosine distance between the current clustering center and the corresponding initial clustering center is greater than a preset threshold, re-clustering is carried out by the current clustering center until the cosine distance between the current clustering center and the last clustering center is less than or equal to the preset threshold, and N label categories corresponding to all the historical text segmentation and the clustering centers corresponding to the label categories are obtained; taking the clustering center as a big data tag, and taking the historical text segmentation word in the corresponding tag class as a characteristic word corresponding to the big data tag; and obtaining a big data tag library according to the big data tag and the feature words.
In the step, before the cosine distance is calculated, the labeled speech segment is required to be converted into a word vector form, the labeled speech segment is mainly input into a word2vec model, and the word vector corresponding to the labeled speech segment is output through the word2vec model. word2vec is also called word empeddings, and the Chinese name "word Vector" is used for converting words in natural language into a Dense Vector (Dense Vector) which can be understood by a computer. word2vec is largely divided into CBOW (Continuous Bag ofWords) and Skip-Gram modes. CBOW is the speculation of the target word from the original sentence; whereas Skip-Gram is the exact opposite, the original sentence is presumed from the target word. CBOW is suitable for small databases, while Skip-Gram performs better in large corpora.
In this embodiment, in order to accurately divide the feature words and exclude the words in the big data tag, the accuracy of the big data tag is improved, and the historical materials are used for training and updating the big data tag library. After the clustering center is obtained through clustering, the clustering center can be used as a standard corresponding to the category of the big data tag, and the clustering is used as the standard of the category of the index, because the given sample set is divided into a plurality of clusters according to the distance between samples, the points in the clusters are tightly connected together as far as possible, namely the clustering center is the closest average distance between the points in the clusters, so that the semantic meaning between the feature words corresponding to the clustering center and other feature words in the same category of index is closest, the feature words can be used as the standard of the category of the index, and the mapping relation between the feature words and the big data tag is established. If a database table is maintained, two columns are included, keyword is a word of all the texts to be extracted, index_name is a big data tag, and the big data tag is used as all the big data tags with the same meaning.
Step 407, judging whether the text word segmentation contains an exclusion word corresponding to the big data tag;
in the step, in order to prevent the identification of stop words such as auxiliary words, conjunctions, adverbs and the like which have higher frequency and no specific meaning, the system also sets a flow for judging whether the stop words exist or not, and the accuracy and the relevance of automatically generating the labels are effectively improved.
Step 408, if yes, determining that the text segmentation does not have a corresponding candidate label;
step 409, if not, using the big data tag as a candidate tag of text word segmentation;
step 410, counting the positions and times of occurrence of the feature words and the exclusion words corresponding to the candidate labels in the picture identification text information and the material text information;
step 411, calculating the weight scores corresponding to the candidate labels according to a preset weight formula, positions and times;
step 412, using the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material;
step 413, adding the label to the corresponding material information, and uploading the material information with the label to a material system;
step 414, submitting the labeled material information to a preset office automation system through a material system for process signing;
And step 415, when the office automation system completes the label report of the labeled material information, the labeled material information is released and put on line.
The embodiment of the invention describes the label matching process in detail on the basis of the previous embodiment of the invention, and screens the big data label from the big data label library as a candidate label corresponding to the text word segmentation through the characteristic words and the exclusion words; counting positions and times of occurrence of feature words and exclusion words corresponding to the candidate labels in the picture identification text information and the material text information; calculating weight scores corresponding to the candidate labels according to a preset weight formula, positions and times; and taking the candidate label corresponding to the weight score with the largest value in the all weight scores as the label corresponding to the material. Traversing the characteristic words in the big data tag library, matching all the characteristic words with the text word segmentation, and determining big data tags corresponding to the text word segmentation; judging whether the text word segmentation contains a stop word corresponding to the big data tag; if yes, determining that the text word does not have a corresponding candidate label; if not, the big data label is used as a candidate label of text word segmentation; and traversing all the text word segments to execute the steps to obtain the candidate labels corresponding to all the text word segments. By adopting the method of characteristic words and excluding words, weights are added to the candidate labels which are screened, the labels with higher relevance can be obtained more effectively, and the automation efficiency and accuracy are improved.
The method for feeding materials in the embodiment of the present invention is described above, and the device for feeding materials in the embodiment of the present invention is described below, referring to fig. 5, where an embodiment of the device for feeding materials in the embodiment of the present invention includes:
the acquiring module 501 is configured to acquire material information uploaded to a material system by a user terminal, where the material information at least includes material pictures and material text information;
the recognition module 502 is configured to perform text recognition on the material picture through a preset optical character recognition technology, so as to obtain picture recognition text information;
a word segmentation module 503, configured to perform word segmentation on the picture identification text information and the material text information by using a word segmentation tool, so as to obtain text word segmentation;
the filtering module 504 is configured to filter the text word according to a preset word library of sensitive words, so as to obtain a filtering result, where the filtering result includes filtering success or filtering failure;
the marking module 505 is configured to mark the text word that is identified as the sensitive word in the picture identification text information and the material text information when the filtering result is that the filtering fails, and return the marked picture identification text information and the marked material text information to the user terminal;
The screening module 506 is configured to screen, when the filtering result is that the filtering is successful, a big data tag from the big data tag library through a feature word and an exclusion word in a preset big data tag library, where the big data tag is used as a candidate tag corresponding to the text word;
the statistics module 507 is configured to count positions and times of occurrence of the feature word and the exclusion word corresponding to the candidate tag in the picture identification text information and the material text information;
the calculating module 508 is configured to calculate a weight score corresponding to the candidate tag according to a preset weight formula, the position and the number of times;
the marking module 509 is configured to use a candidate label corresponding to a weight score with the largest value in all the weight scores as a label corresponding to the material;
and the online module 510 is configured to add the tag to corresponding material information, and send the material information with the tag to the material system for online material.
In the embodiment of the invention, the material online device runs the material online method, and material information uploaded to a material system by a user terminal is obtained, wherein the material information at least comprises material pictures and material text information; carrying out text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information; performing word segmentation operation on the picture identification text information and the material text information by using a word segmentation tool to obtain text word segmentation; filtering the sensitive words of the text word segmentation according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure; if the filtering result is that the filtering is failed, marking the text word which is identified as the sensitive word in the picture identification text information and the material text information, and returning the marked picture identification text information and the marked material text information to the user terminal; if the filtering result is that the filtering is successful, screening the big data tag from the big data tag library as a candidate tag corresponding to the text word segmentation through a characteristic word and an exclusion word in a preset big data tag library; counting the positions and times of occurrence of the feature words and the exclusion words corresponding to the candidate labels in the picture identification text information and the material text information; calculating weight scores corresponding to the candidate labels according to a preset weight formula, the positions and the times; taking a candidate label corresponding to the weight score with the largest value in all the weight scores as a label corresponding to the material; and adding the label into the corresponding material information, and sending the material information added with the label to the material system for material online. The technical problem of current material online in-process, the marking rate of accuracy is low is solved.
Referring to fig. 6, a second embodiment of the material feeding device according to the present invention includes:
the acquiring module 501 is configured to acquire material information uploaded to a material system by a user terminal, where the material information at least includes material pictures and material text information;
the recognition module 502 is configured to perform text recognition on the material picture through a preset optical character recognition technology, so as to obtain picture recognition text information;
a word segmentation module 503, configured to perform word segmentation on the picture identification text information and the material text information by using a word segmentation tool, so as to obtain text word segmentation;
the filtering module 504 is configured to filter the text word according to a preset word library of sensitive words, so as to obtain a filtering result, where the filtering result includes filtering success or filtering failure;
the marking module 505 is configured to mark the text word that is identified as the sensitive word in the picture identification text information and the material text information when the filtering result is that the filtering fails, and return the marked picture identification text information and the marked material text information to the user terminal;
the screening module 506 is configured to screen, when the filtering result is that the filtering is successful, a big data tag from the big data tag library through a feature word and an exclusion word in a preset big data tag library, where the big data tag is used as a candidate tag corresponding to the text word;
The statistics module 507 is configured to count positions and times of occurrence of the feature word and the exclusion word corresponding to the candidate tag in the picture identification text information and the material text information;
the calculating module 508 is configured to calculate a weight score corresponding to the candidate tag according to a preset weight formula, the position and the number of times;
the marking module 509 is configured to use a candidate label corresponding to a weight score with the largest value in all the weight scores as a label corresponding to the material;
and the online module 510 is configured to add the tag to corresponding material information, and send the material information with the tag to the material system for online material.
In this embodiment, the identification module 502 includes:
the character set recognition unit 5021 is used for recognizing characters in the material picture through a preset optical character recognition technology to obtain a character set corresponding to the material picture;
a character correcting unit 5022, configured to convert each character in the character set into a corresponding character vector, and input the character vector into a preset semantic correction model to obtain a predicted character corresponding to each character;
and the character summarizing unit 5023 is used for summarizing the predicted characters corresponding to all the characters in sequence to obtain the picture identification text information of the material picture.
In this embodiment, the character correcting unit 5022 is specifically configured to: converting each character into a corresponding character vector, and inputting the character vector into the semantic correction model; multiplying the character vector corresponding to each character by a predetermined matrix WQ, a predetermined matrix WK and a predetermined matrix WV through the semantic correction model to obtain a query character vector, a key character vector and a value character vector corresponding to the character; and executing encoding and decoding operations on the query character vector, the key character vector and the value character vector corresponding to the characters to obtain predicted characters corresponding to each character.
In this embodiment, the word segmentation module 503 is configured to: calling a preset word segmentation tool, and dividing the picture identification text information and the material text information into N reference word segmentation with preset character lengths through a dictionary of the word segmentation tool, wherein N is a natural number not smaller than 1; acquiring a reference word segmentation after the word segmentation tool is used for dividing, searching a corresponding preset dictionary according to the character length of the reference word segmentation, and judging whether the reference word segmentation exists in the preset dictionary; when the reference word is not stored in the preset dictionary, screening the non-stored reference word by a fuzzy matching algorithm to obtain the text word.
In this embodiment, the screening module 506 is specifically configured to: traversing the big data tag library based on the characteristic words and the exclusion words, and matching big data tags corresponding to the text segmentation words; judging whether the text word segmentation contains an exclusion word corresponding to the big data tag;
if yes, determining that the text segmentation does not have a corresponding candidate label; and if not, taking the big data label as a candidate label of the text word segmentation.
In this embodiment, the material online device further includes a label clustering module 511, where the label clustering module 511 is configured to: acquiring a history audit text, performing word segmentation processing on the history audit text to obtain history text word segmentation, and converting the history text word segmentation into a corresponding word vector; randomly selecting N historical text word segmentation from the word vectors to serve as an initial clustering center, wherein N is a positive integer; respectively calculating cosine distances from all historical text word segmentation to each initial clustering center; dividing historical text word segmentation with the cosine distance from the initial clustering center nearest to the initial clustering center into the same label category; calculating a mean vector of word vectors in the tag class, reselecting a current clustering center for the mean vector, and calculating a cosine distance between the current clustering center and a corresponding initial clustering center; if the cosine distance between the current clustering center and the corresponding initial clustering center is smaller than or equal to a preset threshold value, N label categories corresponding to all the historical text segmentation and the clustering center corresponding to the label category are obtained; if the cosine distance between the current clustering center and the corresponding initial clustering center is greater than a preset threshold, re-clustering is carried out by the current clustering center until the cosine distance between the current clustering center and the last clustering center is less than or equal to the preset threshold, and N label categories corresponding to all historical text segmentation and the clustering centers corresponding to the label categories are obtained; taking the clustering center as a big data tag, and taking the historical text segmentation word in the corresponding tag class as a characteristic word corresponding to the big data tag; and obtaining a big data tag library according to the big data tag and the feature words.
In this embodiment, the online module 510 is specifically configured to: adding the label to the corresponding material information, and uploading the material information added with the label to the material system; submitting the labeled material information to a preset office automation system through the material system for process signing and reporting; when the office automation system finishes signing and reporting the material information with the labels, the material information with the labels is released and put on line.
On the basis of the previous embodiment, the specific functions of each module and the unit constitution of part of the modules are described in detail, and the large data tag library is trained and updated by using the history materials through the newly added modules. After the clustering center is obtained through clustering, the clustering center can be used as a standard corresponding to the category of the big data tag, and the clustering is used as the standard of the category of the index, because the given sample set is divided into a plurality of clusters according to the distance between samples, the points in the clusters are tightly connected together as far as possible, namely the clustering center is the closest average distance between the points in the clusters, so that the semantic meaning between the feature words corresponding to the clustering center and other feature words in the same category of index is closest, the feature words can be used as the standard of the category of the index, and the mapping relation between the feature words and the big data tag is established. The correlation and accuracy of the label and the material are improved.
The above-mentioned fig. 5 and fig. 6 describe the device for feeding medium materials in the embodiment of the present invention in detail from the point of view of modularized functional entities, and the following describes the device for feeding medium materials in the embodiment of the present invention in detail from the point of view of hardware processing.
Fig. 7 is a schematic structural diagram of a material online device according to an embodiment of the present invention, where the material online device 700 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 710 (e.g., one or more processors) and a memory 720, and one or more storage media 730 (e.g., one or more mass storage devices) storing application programs 733 or data 732. Wherein memory 720 and storage medium 730 may be transitory or persistent. The program stored on the storage medium 730 may include one or more modules (not shown), each of which may include a series of instruction operations for the material feeding device 700. Still further, the processor 710 may be configured to communicate with the storage medium 730 and execute a series of instruction operations in the storage medium 730 on the material loading device 700 to implement the steps of the material loading method described above.
The material online apparatus 700 may also include one or more power supplies 740, one or more wired or wireless network interfaces 750, one or more input/output interfaces 760, and/or one or more operating systems 731, such as Windows Serve, mac OS X, unix, linux, freeBSD, etc. It will be appreciated by those skilled in the art that the material handling apparatus structure shown in fig. 7 is not limiting of the material handling apparatus provided herein, and may include more or fewer components than shown, or may be combined with certain components, or may be arranged in a different arrangement of components.
The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, or may be a volatile computer readable storage medium, where instructions are stored in the computer readable storage medium, which when executed on a computer, cause the computer to perform the steps of the material online method.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system or apparatus and unit described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. The material feeding method is characterized by comprising the following steps of:
acquiring material information uploaded to a material system by a user terminal, wherein the material information at least comprises material pictures and material text information;
carrying out text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information;
performing word segmentation operation on the picture identification text information and the material text information by using a word segmentation tool to obtain text word segmentation;
filtering the sensitive words of the text word segmentation according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure;
if the filtering result is that the filtering is failed, marking the text word which is identified as the sensitive word in the picture identification text information and the material text information, and returning the marked picture identification text information and the marked material text information to the user terminal;
if the filtering result is that the filtering is successful, screening the big data tag from the big data tag library as a candidate tag corresponding to the text word segmentation through a characteristic word and an exclusion word in a preset big data tag library;
Counting the positions and times of occurrence of the feature words and the exclusion words corresponding to the candidate labels in the picture identification text information and the material text information;
calculating weight scores corresponding to the candidate labels according to a preset weight formula, the positions and the times;
taking a candidate label corresponding to the weight score with the largest value in all the weight scores as a label corresponding to the material;
adding the label into the corresponding material information, and sending the material information added with the label to the material system for material online;
the step of performing word segmentation on the picture identification text information and the material text information by using a word segmentation tool to obtain text word segmentation comprises the following steps:
calling a preset word segmentation tool, and dividing the picture identification text information and the material text information into N reference word segmentation with preset character lengths through a dictionary of the word segmentation tool, wherein N is a natural number not smaller than 1;
acquiring a reference word segmentation after the word segmentation tool is used for dividing, searching a corresponding preset dictionary according to the character length of the reference word segmentation, and judging whether the reference word segmentation exists in the preset dictionary;
When the reference word is not stored in the preset dictionary, screening the non-stored reference word by a fuzzy matching algorithm to obtain the text word;
the screening the big data tag from the big data tag library as the candidate tag corresponding to the text word segmentation through the characteristic word and the exclusion word in the preset big data tag library comprises the following steps:
traversing the big data tag library based on the characteristic words and the exclusion words, and matching big data tags corresponding to the text segmentation words;
judging whether the text word segmentation contains an exclusion word corresponding to the big data tag;
if yes, determining that the text segmentation does not have a corresponding candidate label;
and if not, taking the big data label as a candidate label of the text word segmentation.
2. The method for feeding materials into a line according to claim 1, wherein the text recognition of the material picture by a preset optical character recognition technology to obtain picture recognition text information comprises:
identifying characters in the material picture through a preset optical character identification technology to obtain a character set corresponding to the material picture;
converting each character in the character set into a corresponding character vector, and inputting the character vector into a preset semantic correction model to obtain a predicted character corresponding to each character;
And sequentially summarizing the predicted characters corresponding to all the characters to obtain the picture identification text information of the material picture.
3. The method for feeding materials into a line according to claim 2, wherein the converting each character in the character set into a corresponding character vector, and inputting the character vector into a preset semantic correction model, the obtaining a predicted character corresponding to each character comprises:
converting each character into a corresponding character vector;
inputting the character vectors into the semantic correction model, and multiplying the character vectors corresponding to each character by a predetermined matrix WQ, a predetermined matrix WK and a predetermined matrix WV through the semantic correction model to obtain query character vectors, key character vectors and value character vectors corresponding to each character;
and executing encoding and decoding operations on the query character vector, the key character vector and the value character vector corresponding to the characters to obtain predicted characters corresponding to each character.
4. The material online method according to claim 1, further comprising, before the acquiring the material information uploaded to the material system by the user terminal:
acquiring a history audit text, performing word segmentation processing on the history audit text to obtain history text word segmentation, and converting the history text word segmentation into a corresponding word vector;
Randomly selecting N historical text word segmentation from the word vectors to serve as an initial clustering center, wherein N is a positive integer;
respectively calculating cosine distances from all historical text word segmentation to each initial clustering center;
dividing historical text word segmentation with the cosine distance from the initial clustering center nearest to the initial clustering center into the same label category;
calculating a mean vector of word vectors in the tag class, reselecting a current clustering center for the mean vector, and calculating a cosine distance between the current clustering center and a corresponding initial clustering center;
if the cosine distance between the current clustering center and the corresponding initial clustering center is smaller than or equal to a preset threshold value, N label categories corresponding to all the historical text segmentation and the clustering center corresponding to the label category are obtained;
if the cosine distance between the current clustering center and the corresponding initial clustering center is greater than a preset threshold, re-clustering is carried out by the current clustering center until the cosine distance between the current clustering center and the last clustering center is less than or equal to the preset threshold, and N label categories corresponding to all historical text segmentation and the clustering centers corresponding to the label categories are obtained;
Taking the clustering center as a big data tag, and taking the historical text segmentation word in the corresponding tag class as a characteristic word corresponding to the big data tag;
and obtaining a big data tag library according to the big data tag and the feature words.
5. The method of claim 1, wherein adding the tag to the corresponding material information and sending the tagged material information to the material system for material loading comprises:
adding the label to the corresponding material information, and uploading the material information added with the label to the material system;
submitting the labeled material information to a preset office automation system through the material system for process signing and reporting;
when the office automation system finishes signing and reporting the material information with the labels, the material information with the labels is released and put on line.
6. The utility model provides a material device of lining up which characterized in that, the material device of lining up includes:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring material information uploaded to a material system by a user terminal, and the material information at least comprises material pictures and material text information;
the recognition module is used for carrying out text recognition on the material picture through a preset optical character recognition technology to obtain picture recognition text information;
The word segmentation module is used for carrying out word segmentation operation on the picture identification text information and the material text information by using a word segmentation tool to obtain text word segmentation;
the filtering module is used for filtering the sensitive words of the text word segmentation according to a preset sensitive word lexicon to obtain a filtering result, wherein the filtering result comprises filtering success or filtering failure;
the marking module is used for marking the text word segmentation of the sensitive word identified in the picture identification text information and the material text information when the filtering result is that the filtering is failed, and returning the marked picture identification text information and the marked material text information to the user terminal;
the filtering module is used for filtering the big data tag from the big data tag library through the characteristic words and the exclusion words in the preset big data tag library when the filtering result is that the filtering is successful, and using the big data tag as a candidate tag corresponding to the text word segmentation;
the statistics module is used for counting the positions and times of the occurrence of the feature words and the exclusion words corresponding to the candidate labels in the picture identification text information and the material text information;
the calculating module is used for calculating the weight scores corresponding to the candidate labels according to a preset weight formula, the positions and the times;
The marking module is used for taking the candidate label corresponding to the weight score with the largest value in all the weight scores as the label corresponding to the material;
the online module is used for adding the label into the corresponding material information and sending the material information added with the label to the material system for material online;
the word segmentation module is used for: calling a preset word segmentation tool, and dividing the picture identification text information and the material text information into N reference word segmentation with preset character lengths through a dictionary of the word segmentation tool, wherein N is a natural number not smaller than 1; acquiring a reference word segmentation after the word segmentation tool is used for dividing, searching a corresponding preset dictionary according to the character length of the reference word segmentation, and judging whether the reference word segmentation exists in the preset dictionary; when the reference word is not stored in the preset dictionary, screening the non-stored reference word by a fuzzy matching algorithm to obtain the text word;
the screening module is specifically used for: traversing the big data tag library based on the characteristic words and the exclusion words, and matching big data tags corresponding to the text segmentation words; judging whether the text word segmentation contains an exclusion word corresponding to the big data tag;
If yes, determining that the text segmentation does not have a corresponding candidate label; and if not, taking the big data label as a candidate label of the text word segmentation.
7. The material feeding device according to claim 6, wherein the identification module specifically comprises:
the character set recognition unit is used for recognizing characters in the material picture through a preset optical character recognition technology to obtain a character set corresponding to the material picture;
the character correction unit is used for converting each character in the character set into a corresponding character vector, and inputting the character vector into a preset semantic correction model to obtain a predicted character corresponding to each character;
and the character summarizing unit is used for summarizing the predicted characters corresponding to all the characters in sequence to obtain the picture identification text information of the material picture.
8. The material feeding device according to claim 7, wherein the character correcting unit specifically comprises:
converting each character into a corresponding character vector, and inputting the character vector into the semantic correction model;
multiplying the character vector corresponding to each character by a predetermined matrix WQ, a predetermined matrix WK and a predetermined matrix WV through the semantic correction model to obtain a query character vector, a key character vector and a value character vector corresponding to the character;
And executing encoding and decoding operations on the query character vector, the key character vector and the value character vector corresponding to the characters to obtain predicted characters corresponding to each character.
9. Material equipment of getting on line, its characterized in that, material equipment of getting on line includes: a memory and at least one processor, the memory having instructions stored therein, the memory and the at least one processor being interconnected by a line;
the at least one processor invokes the instructions in the memory to cause the material online apparatus to perform the steps of the material online method of any one of claims 1-5.
10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the material online method according to any of claims 1-5.
CN202111273705.7A 2021-10-29 2021-10-29 Material online method, device, equipment and storage medium Active CN114003750B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111273705.7A CN114003750B (en) 2021-10-29 2021-10-29 Material online method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111273705.7A CN114003750B (en) 2021-10-29 2021-10-29 Material online method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114003750A CN114003750A (en) 2022-02-01
CN114003750B true CN114003750B (en) 2024-03-26

Family

ID=79925245

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111273705.7A Active CN114003750B (en) 2021-10-29 2021-10-29 Material online method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114003750B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115994772B (en) * 2023-02-22 2024-03-08 中信联合云科技有限责任公司 Book data processing method and system, book rapid goods laying method and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649818A (en) * 2016-12-29 2017-05-10 北京奇虎科技有限公司 Recognition method and device for application search intentions and application search method and server
CN106709040A (en) * 2016-12-29 2017-05-24 北京奇虎科技有限公司 Application search method and server
CN108334489A (en) * 2017-01-19 2018-07-27 百度在线网络技术(北京)有限公司 Text core word recognition method and device
CN112148852A (en) * 2020-09-10 2020-12-29 深圳康佳电子科技有限公司 Intelligent customer service method and device, storage medium and computer equipment
CN113076748A (en) * 2021-04-16 2021-07-06 平安国际智慧城市科技股份有限公司 Method, device and equipment for processing bullet screen sensitive words and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649818A (en) * 2016-12-29 2017-05-10 北京奇虎科技有限公司 Recognition method and device for application search intentions and application search method and server
CN106709040A (en) * 2016-12-29 2017-05-24 北京奇虎科技有限公司 Application search method and server
CN108334489A (en) * 2017-01-19 2018-07-27 百度在线网络技术(北京)有限公司 Text core word recognition method and device
CN112148852A (en) * 2020-09-10 2020-12-29 深圳康佳电子科技有限公司 Intelligent customer service method and device, storage medium and computer equipment
CN113076748A (en) * 2021-04-16 2021-07-06 平安国际智慧城市科技股份有限公司 Method, device and equipment for processing bullet screen sensitive words and storage medium

Also Published As

Publication number Publication date
CN114003750A (en) 2022-02-01

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
US8787683B1 (en) Image classification
CN112632292A (en) Method, device and equipment for extracting service keywords and storage medium
WO2023060795A1 (en) Automatic keyword extraction method and apparatus, and device and storage medium
CN109190092A (en) The consistency checking method of separate sources file
US20150331936A1 (en) Method and system for extracting a product and classifying text-based electronic documents
US8510312B1 (en) Automatic metadata identification
CN109446885B (en) Text-based component identification method, system, device and storage medium
WO2017091985A1 (en) Method and device for recognizing stop word
CN112818093A (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN111563384A (en) Evaluation object identification method and device for E-commerce products and storage medium
CN107526721B (en) Ambiguity elimination method and device for comment vocabularies of e-commerce products
CN114003750B (en) Material online method, device, equipment and storage medium
CN114491034A (en) Text classification method and intelligent device
US11604923B2 (en) High volume message classification and distribution
CN112380848B (en) Text generation method, device, equipment and storage medium
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN116522872A (en) Similarity calculation-based metadata field Chinese name completion method, storage medium and system
CN114511027B (en) Method for extracting English remote data through big data network
US20220107919A1 (en) Computerized systems and methods of data compression
CN115203206A (en) Data content searching method and device, computer equipment and readable storage medium
CN111291208B (en) Front-end page element naming method and device and electronic equipment
JP5206296B2 (en) Similar sentence extraction program, method and apparatus
CN117252514B (en) Building material library data processing method based on deep learning and model training
CN115146630B (en) Word segmentation method, device, equipment and storage medium based on professional domain knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant