CN115129899B

CN115129899B - Document tag information generation method, apparatus, device, medium, and program product

Info

Publication number: CN115129899B
Application number: CN202211050377.9A
Authority: CN
Inventors: 李�杰; 马岩; 孙玉玲; 赵卫
Original assignee: State Grid Information and Telecommunication Co Ltd; Beijing Guodiantong Network Technology Co Ltd
Current assignee: State Grid Information and Telecommunication Co Ltd; Beijing Guodiantong Network Technology Co Ltd
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2022-12-23
Anticipated expiration: 2042-08-31
Also published as: CN115129899A

Abstract

The embodiment of the disclosure discloses a document tag information generation method, a device, equipment, a medium and a program product. One embodiment of the method comprises: in response to the received courseware document information, carrying out verification processing on the courseware document information to obtain verification state information; storing basic document information corresponding to the courseware document information into a preset courseware archive information table; generating target courseware document information; performing full-text index processing on the target courseware document information to obtain a target index information set; extracting each target index information to obtain a first tag information group; splitting the target courseware document information to obtain a target character information sequence group; generating a second label information group based on the target character information sequence group; and generating document tag information based on the first tag information group and the second tag information group, and updating the document tag information to a courseware archive information table. The embodiment can improve the accuracy of the document tag information.

Description

Document tag information generation method, apparatus, device, medium, and program product

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method, an apparatus, a device, a medium, and a program product for generating document tag information.

Background

The document tag information generating method is one technology for generating document tags. At present, when generating document tag information, the method generally adopted is as follows: according to the understanding of the document content, keywords are manually input, or high-frequency keywords are selected after full text indexing is carried out on the full text of the document by means of a search engine, so that document tag information is generated.

However, the inventor finds that when document tag information generation is performed in the above manner, there are often technical problems as follows:

firstly, the high-frequency keywords are selected only by performing full-text indexing on the full text of the document, so that the influence of the document structure on the selection of the key keywords is often easily ignored, and the accuracy of the document label information is low;

secondly, before generating a tag for an uploaded document, repeated uploading verification of the document is often easily ignored, and if the same document is repeatedly uploaded, repeated analysis and storage of the same document content are easily caused, so that the system operation efficiency is low;

thirdly, if a system word bank of a search engine is used for full-text indexing, the requirement of personalized word segmentation of a service system cannot be met, and therefore the accuracy of document tag information is low.

The above information disclosed in this background section is only for enhancement of understanding of the background of the inventive concept and, therefore, it may contain information that does not form the prior art that is already known to a person of ordinary skill in the art in this country.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure propose document tag information generation methods, apparatuses, devices, media, and program products to solve one or more of the technical problems mentioned in the background section above.

In a first aspect, some embodiments of the present disclosure provide a method for generating document tag information, where the method includes: in response to receiving courseware document information of a target user, carrying out verification processing on the courseware document information to obtain verification state information; in response to the fact that the checking state information meets a first preset state condition, storing basic document information corresponding to the courseware document information into a preset courseware archive information table; generating target courseware document information based on the courseware document information; in response to the fact that the target courseware document information meets second preset state information, full-text index processing is conducted on the target courseware document information based on preset courseware word bank information to obtain a target index information set; extracting each target index information in the target index information set to obtain a first tag information group; splitting the target courseware document information to obtain a target character information sequence group; generating a second label information group based on the target character information sequence group; and generating document tag information based on the first tag information group and the second tag information group, and updating the document tag information to the courseware archive information table for inquiry.

In a second aspect, some embodiments of the present disclosure provide an apparatus for generating document tag information, the apparatus including: the verification processing unit is configured to respond to the received courseware document information of the target user and carry out verification processing on the courseware document information to obtain verification state information; a storage unit, configured to respond to the first preset state condition that the checking state information meets, store the basic document information corresponding to the courseware document information to a preset courseware archive information table; a first generating unit configured to generate target courseware document information based on the courseware document information; the full-text index processing unit is configured to respond to the fact that the target courseware document information meets second preset state information, and perform full-text index processing on the target courseware document information based on preset courseware word bank information to obtain a target index information set; an extraction processing unit configured to extract each target index information in the target index information set to obtain a first tag information group; the splitting processing unit is configured to split the target courseware document information to obtain a target character information sequence group; a second generating unit configured to generate a second tag information group based on the target character information sequence group; and the generating and updating unit is configured to generate document tag information based on the first tag information group and the second tag information group, and update the document tag information to the courseware archive information table for inquiry.

In a third aspect, some embodiments of the present disclosure provide an electronic device, comprising: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement the method described in any of the implementations of the first aspect.

In a fourth aspect, some embodiments of the present disclosure provide a computer readable medium on which a computer program is stored, wherein the program, when executed by a processor, implements the method described in any of the implementations of the first aspect.

In a fifth aspect, some embodiments of the present disclosure provide a computer program product comprising a computer program that, when executed by a processor, implements the method described in any of the implementations of the first aspect above.

The above embodiments of the present disclosure have the following advantages: by the document tag information generation method of some embodiments of the present disclosure, the accuracy of the document tag information can be improved. Specifically, the reasons for the low accuracy of the document tag information are: the high-frequency keywords are selected only by performing full-text indexing on the full text of the document, so that the influence of the document structure on the selection of the key keywords is often easily ignored, and the accuracy of the document label information is low. Based on this, in the document tag information generating method according to some embodiments of the present disclosure, first, in response to receiving courseware document information of a target user, the courseware document information is checked to obtain check state information. Therefore, courseware documents meeting the document analysis requirements can be uploaded, and the documents can be conveniently analyzed by a subsequent search engine. And secondly, in response to the fact that the checking state information meets a first preset state condition, storing basic document information corresponding to the courseware document information into a preset courseware archive information table. Therefore, courseware archives can be established for the uploaded courseware documents, and the document tags can be conveniently stored in the follow-up process. And then, generating target courseware document information based on the courseware document information. Therefore, target courseware document information can be obtained by preprocessing uploaded courseware document contents, and accordingly, full-text indexes can be conveniently established for courseware documents subsequently. And then, in response to the fact that the target courseware document information meets second preset state information, full-text index processing is carried out on the target courseware document information on the basis of preset courseware word bank information, and a target index information set is obtained. And extracting each target index information in the target index information set to obtain a first tag information group. Therefore, a full-text index can be established for the courseware documents, and high-frequency keywords can be selected as tags of the courseware documents on the basis of full-text index results. And then, splitting the target courseware document information to obtain a target character information sequence group. And generating a second label information group based on the target character information sequence group. Therefore, the document structure of the courseware document content can be divided, and full-text index operation is respectively carried out on each part of courseware document content obtained through division, so that key keywords which are relatively attached to the document content can be conveniently indexed, and the accuracy of document label information can be improved. And finally, generating document tag information based on the first tag information group and the second tag information group, and updating the document tag information to the courseware archive information table for inquiry. Therefore, different label results generated by the two label generation methods can be fused so as to select key keywords which are relatively fit with the document content, and therefore the accuracy of the document label information can be improved. Therefore, according to the document tag information generation method disclosed by the invention, the influence of the document structure on the selection of the key word can be fully considered while full-text indexing is performed on the full document text to select the high-frequency key word, so that the key word which is more attached to the document content can be selected as the document tag. Thus, the accuracy of the document tag information can be improved.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and components are not necessarily drawn to scale.

FIG. 1 is a flow diagram of some embodiments of a document tag information generation method according to the present disclosure;

FIG. 2 is a schematic block diagram of some embodiments of a document tag information generation apparatus according to the present disclosure;

FIG. 3 is a schematic block diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates a flow 100 of some embodiments of a document tag information generation method according to the present disclosure. The document tag information generation method comprises the following steps:

step 101, in response to receiving courseware document information of a target user, checking the courseware document information to obtain checking state information.

In some embodiments, an executing entity (e.g., a computing device) of the document tag information generating method may perform, in response to receiving courseware document information of a target user, verification processing on the courseware document information in various ways to obtain verification state information. Wherein, the target user may be a user who has document management requirements. The courseware document information described above may be used to characterize courseware documents. The courseware document may be a file storing course content for learning. The verification state information can be used for representing whether the courseware document is verified successfully or not. The verification status information may be document verification success information or document verification failure information. The document verification success information can be used for representing that the courseware document is verified successfully. The document verification failure information can be used for representing that the courseware document is verified to fail.

In some optional implementations of some embodiments, the courseware document information may include a document type, a type identifier, a document capacity value, and document content information. The executive body can check the courseware document information to obtain check state information. The document type may be a file format corresponding to an extension of the courseware document. The document type may be represented by an extension of the courseware document. The type identifier may be a sequence obtained by arranging letters and numbers in order. The type identifier can be used to identify the file type. The type identifier and the file type may be in a one-to-one correspondence or one-to-many relationship. The document capacity value may be the number of bytes occupied by the courseware document. The document content information may be used to characterize specific content in the courseware document. The following steps may be specifically performed:

the method comprises the following steps that in response to the fact that the document type included in courseware document information meets a preset document type condition, matching processing is conducted on the type identification and the document type, and first verification information is obtained. The preset document type condition may be that the extension of the courseware document is in a preset extension set. The preset extension in the preset extension set may be a predetermined extension. The first check information may be used to characterize whether the type identifier matches the document type. The first check information may be type matching success information or type matching failure information. The type matching success information can be used for representing that the type identification is matched with the document type. The type matching failure information can be used for representing that the type identification is not matched with the document type. Firstly, determining the document type corresponding to the type identification to obtain a target document type set. Wherein the target document type set may be a set of document types. Next, it is determined whether the document type exists in the target document type set. And finally, if the document type is in the target document type set, determining the type matching success information as first verification information. Otherwise, the type matching failure information is determined as first check information.

As an example, the preset extension set may include, but is not limited to, at least one of: doc (document), ppt (powerpoint, presentation), pdf (portable document format), and the like. The file format may be, but is not limited to, one of the following: word documents, slides, tables, and the like. The type identifier may be "D0CF11E0A1B11AE10" or "504B0304", where "D0CF11E0A1B11AE10" may be used to identify a document type with an extension of doc or ppt, and "504B0304" may be used to identify a document type with an extension of pdf.

And secondly, in response to the fact that the first check information meets the preset document type matching condition, checking the document capacity value to obtain second check information. The preset document type matching condition may be that the first verification information is type matching success information. The second check information may be used to indicate whether the document capacity value is greater than a preset capacity value. The preset capacity value may be the maximum number of bytes occupied by the courseware document. The second verification information may be document capacity compliance information or document capacity non-compliance information. The document capacity compliance information may be used to characterize that the document capacity value is not greater than a predetermined capacity value. The document capacity non-compliance information may be used to indicate that the document capacity value is greater than a preset capacity value. Firstly, the difference value between the document capacity value and the preset capacity value is determined as a document exceeding value. Next, it is determined whether the above-described document excess value is greater than 0. And finally, if the document exceeding value is not more than 0, determining the document capacity compliance information as second check information. Otherwise, the document capacity non-compliance information is determined as second check information.

As an example, the preset capacity value may be 50 megabytes.

And thirdly, in response to the fact that the second check information meets the preset document capacity condition, carrying out duplicate checking processing on the document content information to obtain check state information. The preset document capacity condition may be that the second verification information is document capacity compliance information. The verification state information may be used to characterize whether the document content information is repeatedly uploaded. The check state information may be repeatedly uploaded information or non-repeatedly uploaded information. The repeated uploading information can be used for representing that the document content information is repeatedly uploaded. The non-repeated uploading information can be used for representing that the document content information is not repeatedly uploaded. In response to determining that the second verification information meets the preset document capacity condition, the document content information may be re-checked in various ways to obtain verification status information.

In some optional implementation manners of some embodiments, in response to determining that the second check information meets a preset document capacity condition, the execution main body may perform a duplicate checking process on the document content information to obtain check state information, through the following steps:

firstly, the document content information is converted to obtain document byte array information. The document byte array information may be the specific content of the courseware document expressed by the byte array. The courseware document can be read into the byte array in a file flow mode to obtain document byte array information.

And secondly, generating a target hash value based on the document byte array information. The target hash value may be a 256-bit sequence composed of letters and numbers. The target hash value can be generated based on the document byte array information through a preset encryption algorithm.

As an example, the preset encryption algorithm may include, but is not limited to, at least one of the following: a national secret hashing algorithm and the like.

And thirdly, determining the target hash value, the document type and the type identifier as document summary information. The document summary information can be used for representing courseware documents.

And fourthly, matching the document summary information with each courseware archive information in a preset courseware archive information table to obtain check state information. The courseware archive information table may be a collection of courseware archive information. The courseware archive information in the courseware archive information table can be used for representing courseware documents. The courseware profile information in the courseware profile information table may include, but is not limited to, at least one of: a digest hash value, a document type, and a type identification. The digest hash value may be a 256-bit sequence obtained by arranging letters and numbers in order. For each courseware profile information in the courseware profile information table, the following steps may be performed:

the first substep, confirm whether the summary hash value that the above-mentioned courseware archive information includes is the same with the goal hash value that the above-mentioned document summary information includes.

And a second substep of determining whether the document type included in the courseware profile information is the same as the document type included in the document summary information.

And a third substep of determining whether the type identifier included in the courseware profile information is the same as the type identifier included in the document summary information.

And a fourth substep, in response to determining that at least one item of the summary hash value, the document type and the type identifier included in the courseware archive information is not matched with the document summary information, determining the unrepeated upload information as second initial state information. Wherein the unmatched document summary information may be at least one of: the courseware archive information comprises an abstract hash value different from a target hash value comprised by the document abstract information, the courseware archive information comprises a document type different from a document type comprised by the document abstract information, and the courseware archive information comprises a type identifier different from a type identifier comprised by the document abstract information.

The above-mentioned duplication checking processing steps for document content information and related content thereof are an inventive point of the embodiment of the present disclosure, and solve the second technical problem mentioned in the background art that repeated uploading check of a document is often easily ignored before a tag is generated for an uploaded document, and if the same document is uploaded for multiple times, repeated parsing and storage of the same document content is easily caused, thereby resulting in low system operation efficiency. The problem that results in the system operating less efficiently is often as follows: before generating tags for uploaded documents, repeated uploading verification of the documents is often omitted, and if the same document is uploaded for multiple times, repeated analysis and storage of the same document content are easily caused, so that the system operation efficiency is low. If the problems are solved, the effect of improving the operation efficiency of the system can be achieved. To achieve this effect, the present disclosure may generate document summary information corresponding to the courseware document. The document summary information may include a target hash value, a document type, and a type identifier. And then, matching the document summary information with each courseware archive information corresponding to each courseware document which is successfully uploaded so as to determine whether the courseware document is uploaded repeatedly. If the courseware document is repeatedly uploaded, the system can not carry out secondary analysis and storage on the courseware document. Thus, the system operation efficiency can be improved.

And 102, in response to the fact that the verification state information meets the first preset state condition, storing basic document information corresponding to the courseware document information into a preset courseware archive information table.

In some embodiments, the execution subject may store the basic document information corresponding to the courseware document information into a preset courseware archive information table in response to determining that the check state information satisfies a first preset state condition. The first preset status condition may be that the check status information is unrepeated upload information. The basic document information corresponding to the courseware document information may be the document summary information.

And 103, generating target courseware document information based on the courseware document information.

In some embodiments, the executive agent may generate the target courseware document information based on the courseware document information in various ways. The target courseware document information can be used for representing courseware document contents. The target courseware document information may be courseware document contents encoded in the form of character strings.

In some optional implementations of some embodiments, the execution subject may generate the target courseware document information based on the courseware document information. The following steps may be specifically performed:

first, document processing information is acquired. The document processing information includes document queue information and processing thread information group. The document queue information may be used to characterize a queue of courseware documents arranged in an upload order. The processing thread information in the processing thread information group may be used to characterize the processing thread. The execution main body can acquire the document processing information in a wired connection or wireless connection mode. The document processing information may be obtained from a thread pool through an interface program.

And secondly, in response to the fact that the document queue information meets the preset length condition, adding the document content information to a target document queue. The preset length condition may be that the length of the queue represented by the document queue information is smaller than a preset threshold. The target document queue may be a queue characterized by the document queue information. The preset threshold may be any number greater than 1.

As an example, the preset threshold may be 200.

And thirdly, converting the document content information based on the document queue information and the processing thread information group to obtain target coding information. The target coding information may be courseware document content encoded in Base 64. The document content information may be converted based on the document queue information and the processing thread information group to obtain target encoding information by:

the first substep, confirm whether the content information of the above-mentioned file is arranged in the first place of the above-mentioned goal file queue. Whether the document content information is arranged at the first position of the target document queue can be determined by monitoring the courseware document at the first position of the target document queue.

And a second sub-step of determining each idle processing thread among the processing threads characterized by the respective processing thread information.

And a third substep, if the document content information is at the first position of the target document queue, calling any idle processing thread, and converting the document content information through the encoding function carried by the Base64 encoding to obtain target encoding information.

And fourthly, preprocessing the target coding information to obtain target courseware document information. The target coding information can be preprocessed by calling a predefined processor through a preset search engine to obtain target courseware document information. Wherein, the predefined processor may be a program code block with a single function.

As an example, the preset search engine may be an Elasticsearch search engine. The predefined processor may be a processor for text extraction.

And 104, in response to the fact that the target courseware document information meets the second preset state information, performing full-text index processing on the target courseware document information based on preset courseware word library information to obtain a target index information set.

In some embodiments, the execution subject may perform full-text index processing on the target courseware document information based on preset courseware lexicon information in various ways in response to determining that the target courseware document information satisfies second preset state information, so as to obtain a target index information set. The second preset state information may be that the target courseware document information is not null. The preset courseware lexicon information can be used for representing the predefined courseware lexicon. The courseware thesaurus may be a collection of words associated with a courseware. The courseware thesaurus can be obtained by crawling professional hot words on a network. The target index information in the target index information set may be index keyword information corresponding to the same courseware document. The index keyword information may be used to characterize keywords used to query courseware documents. The target index information in the target index information set may include a keyword and a keyword position information group. The keyword location information in the set of keyword location information can be used to characterize the location of the keyword as it appears in the course document.

In some optional implementations of some embodiments, the preset courseware lexicon information includes an extended participle information set and a stop word information set. The execution main body can perform full-text index processing on the target courseware document information based on preset courseware word bank information to obtain a target index information set. The expanded participle information in the expanded participle information set can be used for representing newly added keywords for indexing. The stop word information in the stop word information set may be used to characterize keywords that are no longer used for indexing. Specifically, the following steps can be performed:

firstly, a system word segmentation information set is obtained. The system word segmentation information set may be obtained by extracting the preset search engine. The system word segmentation information in the system word segmentation information set can be used for representing keywords used for establishing full-text index for the courseware document.

And secondly, determining each expanded participle information in the expanded participle information set and each system participle information in the system participle information set as first participle information to obtain a first participle information set. The first segmentation information in the first segmentation information set can be used for representing keywords used for establishing a full-text index for the courseware document.

And thirdly, deleting the first segmentation information matched with the stop word information in the stop word information set in the first segmentation information set to obtain the deleted first segmentation information set. The deleted first segmentation information set may be the first segmentation information set after deleting part of the keyword information that is no longer used for indexing the courseware document. The matching with each stop word information in the stop word information set may be: the first participle information can be found to appear in each stop word information at least once through a preset character string matching algorithm.

As an example, the preset string matching algorithm may include, but is not limited to, at least one of the following: a naive string matching algorithm, a RK (Rabin-Karp, string matching) algorithm, a KMP (Knuth-Morris-Pratt, string lookup) algorithm, and the like.

And fourthly, performing full-text index processing on the target courseware document information based on the deleted first segmentation information set to obtain a target index information set. And by using the preset search engine, taking the deleted first segmentation information set as the input of the search engine, and performing full-text index processing on the target courseware document information to obtain a target index information set.

The step of generating the target index information set and the related content thereof serve as an invention point of the embodiment of the disclosure, and the technical problem mentioned in the background art is solved, namely, if a system word stock of a search engine is used for full-text indexing, the requirement of personalized word segmentation of a service system cannot be met, so that the accuracy of document tag information is low. The problem that results in low accuracy of document tag information is often as follows: if a system word bank of a search engine is used for full-text indexing, the requirement of personalized word segmentation of a service system cannot be met, and therefore the accuracy of document tag information is low. If the problems are solved, the effect of improving the accuracy of the document label information can be achieved. In order to achieve the effect, the full-text index result of the courseware document can be further limited by combining the newly-increased keywords and the stop keywords which are defined in advance on the basis of the system word bank of the search engine, so that the personalized word segmentation requirement of a service system is met. Thus, the accuracy of the document tag information can be improved.

And 105, extracting each target index information in the target index information set to obtain a first tag information group.

In some embodiments, the executing entity may extract each target index information in the target index information set to obtain the first tag information group through the following steps. The first tag information in the first tag information group may be used to represent keywords used for querying the courseware document. Specifically, the following steps may be performed:

firstly, determining keywords included in each target index information in the target index information set as an initial tag information set. The initial tag information in the initial tag information set may be used to represent a keyword.

The second step, for each initial label information in the initial label information set, executing the following steps:

the first substep, select the target index information matching with the above-mentioned initial label information from the above-mentioned target index information set. The matching with the initial tag information may be that the keywords included in the target index information are the same as the keywords represented by the initial tag information.

And a second substep of determining the length of the keyword position information group included in the selected target index information as a tag word frequency value. The length of the keyword position information group may be the number of the keyword position information in the keyword position information group. The tag word frequency values may be used to characterize the number of times a keyword appears in a courseware document.

As an example, the tag word frequency value may be 8.

And a third substep of determining the initial tag information and the tag word frequency value as target tag information. The target tag information may be information of a keyword and the number of times the keyword appears in the courseware document.

And thirdly, sequencing the determined target label information to obtain a target label information set. The target tag information set may be a sequence of target tag information arranged from large to small according to the number of times of occurrence of the keyword and the length value of the keyword. The target tag information in the target tag information set may be information of a keyword and the number of times the keyword appears in the courseware document. And sequencing each determined target tag information according to the tag word frequency value included in each determined target tag information through a preset sequencing algorithm to obtain a target tag information set.

As an example, the preset ranking algorithm may include, but is not limited to, at least one of the following: bubble sort algorithm, quick sort algorithm, selective sort algorithm, and the like.

And fourthly, selecting target label information meeting a preset word frequency condition from the target label information set to obtain a target label information set. The preset word frequency condition may be that the first tag information is in a position range of a preset numerical value before the first tag information is concentrated. The predetermined value may be any number greater than 0. The target tag information group may be information of each keyword corresponding to the course document and the number of times that each keyword appears in the courseware document.

As an example, the preset value may be 5.

And fifthly, determining the initial tag information included by each piece of target tag information in the target tag information group as first tag information to obtain a first tag information group.

And step 106, splitting the target courseware document information to obtain a target character information sequence group.

In some embodiments, the execution subject may split the target courseware document information in various ways to obtain a target character information sequence group. The target character information sequences in the target character information sequence group can be used for representing part of contents of the courseware document.

In some optional implementations of some embodiments, the target courseware document information may include a sequence of character information. The execution subject can split the target courseware document information to obtain a target character information sequence group through the following steps. The character information sequence may be a character string obtained by orderly arranging each character of the corresponding courseware document content. Specifically, the following steps may be performed:

firstly, determining the length value of the target courseware document information as a document length value. The document length value may be the number of characters in a character string corresponding to the target courseware document information.

And secondly, determining a quotient obtained by dividing the document length value by the preset division segment number as a target length value. Wherein, the preset number of the division segments may be the preset number. The target length value may be the number of characters in the character string.

And thirdly, splitting the character information sequence based on the target length value to obtain a target character information sequence group. Specifically, the following steps can be performed:

the first substep, regard above-mentioned character information sequence as the initial character information sequence. The initial character information sequence can be used for representing part of the content of the courseware document. The initial character information sequence may be a character sequence in Unicode encoding form.

The second substep, according to the initial character information sequence, carries out the following interception steps:

and a first sub-step of determining a first target length value character in the initial character information sequence as a tail character. The tail character may be a Unicode coded character.

And a second sub-step, determining whether the tail character represents the first byte of the Chinese coding according to the regular check method carried by the Unicode coding.

A third sub-step of intercepting a first character of the initial character information sequence to a (target length value + 1) th character as a target character information sequence in response to determining that the tail character represents a first byte of a chinese code.

And a third substep, in response to determining that the initial character information sequence meets a preset residual character condition, taking the part which is not intercepted in the initial character information sequence as the initial character information sequence, and executing the intercepting step again. The preset remaining character condition may be that the number of remaining characters after the initial character information sequence is intercepted is greater than 0.

And step 107, generating a second label information group based on the target character information sequence group.

In some embodiments, the execution main body may generate the second tag information group based on the target character information sequence group in various ways. The second tag information in the second tag information group may be used to represent keywords used for querying the courseware document.

In some optional implementations of some embodiments, the executing body may generate a second tag information group based on the target character information sequence group. Specifically, the following steps can be performed:

the first step, for each target character information sequence in the target character information sequence group, executing the following steps:

and a first substep of performing full-text index processing on the target character information sequence based on the preset courseware lexicon information to obtain a courseware index information set. The courseware index information in the courseware index information set can be index keyword information of the same part of content in corresponding courseware documents. The courseware index information in the courseware index information set may include keywords and keyword position information groups. And performing full-text index processing on the target character information sequence based on the preset courseware word bank information through the preset search engine to obtain a courseware index information set.

And a second substep of extracting and processing each courseware index information in the courseware index information set to generate second tag information. The second tag information may be used to represent keywords used for querying the courseware document. The following steps may be specifically performed:

and the first sub-step, determining the key words included in each courseware index information in the courseware index information set as a segmented tag information set. The segment tag information in the segment tag information set can be used for representing keywords.

And a second sub-step of, for each piece of section tag information in the section tag information set, determining the length of a keyword position information group included in the courseware index information corresponding to the piece of section tag information as a section tag word frequency value, and determining the section tag information and the section tag word frequency value as target section tag information. The word frequency value of the segmented tag can be used for representing the times of the keywords appearing in the courseware document. The target segment tag information may be information of a keyword and the number of times the keyword appears in the courseware document.

And a third sub-step, sequencing the determined label information of each target segment to obtain a target segment label information set. The target segment tag information set may be a sequence of target segment tag information arranged from large to small according to the number of times of occurrence of the keyword and the length value of the keyword. And sequencing the determined label information of each target segment according to the label word frequency value included in the determined label information of each target segment through the preset sequencing algorithm to obtain a target segment label information set.

And a fourth sub-step of selecting target segment tag information at a first position from the target segment tag information set.

And a fifth sub-step of determining a keyword included in the selected target segment tag information as second tag information.

And secondly, performing deduplication processing on each generated second label information to obtain a second label information group. And performing duplicate removal processing on each generated second label information through a preset duplicate removal method to obtain a second label information group.

As an example, the preset deduplication method may include, but is not limited to, at least one of the following: list deduplication, hashSet deduplication, round-robin traversal deduplication, and so on.

And step 108, generating document tag information based on the first tag information group and the second tag information group, and updating the document tag information to a courseware archive information table for inquiry.

In some embodiments, the execution agent may generate document tag information based on the first tag information group and the second tag information group in various ways, and update the document tag information to the courseware profile information table for querying. The document tag information may be information that the courseware document has been parsed.

In some optional implementations of some embodiments, the executing body may generate the document tag information based on the first tag information group and the second tag information group by the following steps. Specifically, the following steps can be performed:

the method comprises the following steps that firstly, each piece of first tag information in the first tag information group and each piece of second tag information in the second tag information group are determined to be initial courseware tag information, and an initial courseware tag information set is obtained. The initial courseware tag information in the initial courseware tag information set can be used for representing keywords used for inquiring courseware documents.

And secondly, performing duplicate removal processing on the initial courseware tag information set to obtain the duplicate-removed initial courseware tag information set. The de-duplicated initial courseware tag information in the de-duplicated initial courseware tag information set can represent different keywords corresponding to the same courseware document. The initial courseware tag information set can be deduplicated by the preset deduplication method to obtain the deduplicated initial courseware tag information set.

And thirdly, extracting and processing each piece of the de-duplicated initial courseware tag information in the de-duplicated initial courseware tag information set to obtain a target courseware tag information set. The target courseware tag information in the target courseware tag information group can be different keywords corresponding to the same courseware document. Specifically, the following steps may be performed:

in the first sub-step, for each piece of deduplicated initial courseware tag information in the above set of deduplicated initial courseware tag information, the following steps may be performed:

and the first sub-step, determining the length value of the keyword represented by the deduplicated initial courseware tag information as the length of the target tag. Wherein, the length of the target label can be the number of characters in the keyword.

And a second sub-step, determining the de-duplicated initial courseware tag information and the target tag length as tag information to be sorted. The tag information to be sorted can be information of a keyword and the number of characters of the keyword.

And a second substep, sequencing the determined tag information to be sequenced to obtain a target courseware tag information sequence. The target courseware tag information sequence can be a sequence in which tag information to be sorted is arranged from large to small according to the length of a target tag. And sequencing the determined tag information to be sequenced through the preset sequencing algorithm to obtain a target courseware tag information sequence.

And a third substep, selecting target courseware tag information with a preset number from the target courseware tag information sequence to obtain a target courseware tag information group. The target courseware tag information in the target courseware tag information group can be keywords corresponding to the same courseware document.

And fourthly, responding to the fact that the target courseware tag information group meets the preset tag conditions, and generating tag analysis success information. The preset tag condition may be that the target courseware tag information group is not empty. The tag analysis success information can be used for representing that the tags of the courseware documents are successfully analyzed.

As an example, the tag parsing success information may be "parsing success".

And fifthly, determining the target courseware tag information group and the tag analysis success information as document tag information.

The above embodiments of the present disclosure have the following advantages: by the document tag information generation method of some embodiments of the present disclosure, the accuracy of the document tag information can be improved. Specifically, the reasons for the low accuracy of the document tag information are: the high-frequency keywords are selected only by performing full-text indexing on the full text of the document, so that the influence of the document structure on the selection of the key keywords is often easily ignored, and the accuracy of the document label information is low. Based on this, in the document tag information generating method according to some embodiments of the present disclosure, first, in response to receiving courseware document information of a target user, the courseware document information is checked to obtain check state information. Therefore, courseware documents meeting the document analysis requirements can be uploaded, and the documents can be conveniently analyzed by a subsequent search engine. And secondly, in response to the fact that the checking state information meets a first preset state condition, storing basic document information corresponding to the courseware document information into a preset courseware archive information table. Therefore, courseware archives can be built for the uploaded courseware documents, and the document tags can be conveniently stored in the follow-up process. And then, generating target courseware document information based on the courseware document information. Therefore, target courseware document information can be obtained by preprocessing uploaded courseware document contents, and accordingly, full-text indexes can be conveniently established for courseware documents subsequently. And then, in response to the fact that the target courseware document information meets second preset state information, full-text index processing is carried out on the target courseware document information on the basis of preset courseware word bank information, and a target index information set is obtained. And extracting each target index information in the target index information set to obtain a first tag information group. Therefore, a full-text index can be established for the courseware documents, and high-frequency keywords can be selected as tags of the courseware documents on the basis of full-text index results. And then, splitting the target courseware document information to obtain a target character information sequence group. And generating a second label information group based on the target character information sequence group. Therefore, the document structure of the courseware document contents can be divided, and full-text index operation can be respectively carried out on each part of courseware document contents obtained through division, so that key keywords which are relatively attached to the document contents can be more conveniently indexed, and the accuracy of the document label information can be improved. And finally, generating document tag information based on the first tag information group and the second tag information group, and updating the document tag information to the courseware archive information table for inquiry. Therefore, different label results generated by the two label generation methods can be fused so as to select key keywords which are relatively fit with the document content, and therefore the accuracy of the document label information can be improved. Therefore, according to some document tag information generation methods disclosed by the disclosure, full-text indexing can be performed on a full document text to select a high-frequency keyword, and meanwhile, the influence of a document structure on the selection of a key keyword is fully considered, so that the key keyword which is relatively close to the document content is selected as a document tag. Thus, the accuracy of the document tag information can be improved.

With further reference to fig. 2, as an implementation of the methods illustrated in the above figures, the present disclosure provides some embodiments of a document tag information generation apparatus, which correspond to those illustrated in fig. 1, and which may be particularly applied in various electronic devices.

As shown in fig. 2, the document tag information generating apparatus 200 of some embodiments includes: a verification processing unit 201, a storage unit 202, a first generation unit 203, a full-text index processing unit 204, an extraction processing unit 205, a splitting processing unit 206, a second generation unit 207, and a generation and update unit 208. The verification processing unit 201 is configured to, in response to receiving courseware document information of a target user, perform verification processing on the courseware document information to obtain verification state information; a storage unit 202, configured to store the basic document information corresponding to the courseware document information into a preset courseware archive information table in response to determining that the check state information satisfies a first preset state condition; a first generating unit 203 configured to generate target courseware document information based on the courseware document information; a full-text index processing unit 204, configured to perform full-text index processing on the target courseware document information based on preset courseware lexicon information to obtain a target index information set in response to determining that the target courseware document information satisfies second preset state information; an extraction processing unit 205 configured to perform extraction processing on each target index information in the target index information set to obtain a first tag information group; a splitting unit 206, configured to split the target courseware document information to obtain a target character information sequence group; a second generating unit 207 configured to generate a second tag information group based on the target character information sequence group; a generating and updating unit 208 configured to generate document tag information based on the first tag information group and the second tag information group, and update the document tag information to the courseware archive information table for querying.

It will be understood that the units described in the apparatus 200 correspond to the various steps in the method described with reference to fig. 1. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 200 and the units included therein, and are not described herein again.

With further reference to fig. 3, a schematic structural diagram of an electronic device 300 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 3, electronic device 300 may include a processing device (e.g., central processing unit, graphics processor, etc.) 301 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage device 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data necessary for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

Generally, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 308 including, for example, magnetic tape, hard disk, etc.; and a communication device 309. The communication means 309 may allow the electronic device 300 to communicate wirelessly or by wire with other devices to exchange data. While fig. 3 illustrates an electronic device 300 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 3 may represent one device or may represent multiple devices, as desired.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In some such embodiments, the computer program may be downloaded and installed from a network through the communication device 309, or installed from the storage device 308, or installed from the ROM 302. The computer program, when executed by the processing apparatus 301, performs the above-described functions defined in the methods of some embodiments of the present disclosure.

It should be noted that the computer readable medium described above in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the apparatus; or may be separate and not incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: in response to receiving courseware document information of a target user, carrying out verification processing on the courseware document information to obtain verification state information; in response to the fact that the checking state information meets a first preset state condition, storing basic document information corresponding to the courseware document information into a preset courseware archive information table; generating target courseware document information based on the courseware document information; in response to the fact that the target courseware document information meets second preset state information, full-text index processing is conducted on the target courseware document information based on preset courseware word bank information to obtain a target index information set; extracting each target index information in the target index information set to obtain a first tag information group; splitting the target courseware document information to obtain a target character information sequence group; generating a second label information group based on the target character information sequence group; and generating document tag information based on the first tag information group and the second tag information group, and updating the document tag information to the courseware archive information table for inquiry.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, which may be described as: a processor comprises a checking processing unit, a storage unit, a first generation unit, a full-text index processing unit, an extraction processing unit, a splitting processing unit, a second generation unit and a generation and updating unit. The names of the units do not form a limitation on the units themselves in some cases, for example, the verification processing unit may also be described as a unit that performs verification processing on courseware document information to obtain verification status information in response to receiving courseware document information of a target user.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

Some embodiments of the present disclosure also provide a computer program product comprising a computer program which, when executed by a processor, implements any of the document tag information generation methods described above.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A document tag information generation method includes:

in response to receiving courseware document information of a target user, carrying out verification processing on the courseware document information to obtain verification state information;

in response to the fact that the checking state information meets a first preset state condition, storing basic document information corresponding to the courseware document information into a preset courseware archive information table;

generating target courseware document information based on the courseware document information;

in response to the fact that the target courseware document information meets second preset state information, full-text index processing is carried out on the target courseware document information on the basis of preset courseware word bank information to obtain a target index information set, wherein the preset courseware word bank information is information of a predefined courseware word bank, and the courseware word bank is obtained by crawling hot words related to courseware on the network;

extracting each target index information in the target index information set to obtain a first tag information group;

splitting the target courseware document information to obtain a target character information sequence group;

generating a second label information group based on the target character information sequence group;

generating document tag information based on the first tag information group and the second tag information group, and updating the document tag information to the courseware archive information table for query;

the courseware document information comprises document types, type identifiers, document capacity values and document content information; and

the step of performing verification processing on the courseware document information to obtain verification state information includes:

in response to the fact that the document type included in the courseware document information meets a preset document type condition, matching the type identification with the document type to obtain first verification information;

in response to the fact that the first check information meets the preset document type matching condition, carrying out check processing on the document capacity value to obtain second check information;

in response to the fact that the second checking information meets the preset document capacity condition, carrying out checking and reprocessing on the document content information to obtain checking state information;

wherein the target courseware document information comprises a sequence of character information; and

the splitting processing is carried out on the target courseware document information to obtain a target character information sequence group, and the method comprises the following steps:

determining the length value of the target courseware document information as a document length value;

determining a quotient obtained by dividing the document length value by the number of preset division segments as a target length value;

splitting the character information sequence based on the target length value to obtain a target character information sequence group;

generating a second tag information group based on the target character information sequence group, wherein the generating the second tag information group comprises:

for each target character information sequence in the target character information sequence group, executing the following steps:

based on the preset courseware word library information, full-text index processing is carried out on the target character information sequence to obtain a courseware index information set;

extracting each courseware index information in the courseware index information set to generate second label information;

performing deduplication processing on each generated second label information to obtain a second label information group;

wherein generating document tag information based on the first tag information group and the second tag information group comprises:

determining each first tag information in the first tag information group and each second tag information in the second tag information group as initial courseware tag information to obtain an initial courseware tag information set;

carrying out duplication elimination processing on the initial courseware tag information set to obtain an initial courseware tag information set after duplication elimination;

extracting each piece of deduplicated initial courseware tag information in the deduplicated initial courseware tag information set to obtain a target courseware tag information set;

generating label analysis success information in response to the fact that the target courseware label information group meets the preset label condition;

determining the target courseware tag information group and the tag analysis success information as document tag information;

splitting the character information sequence based on the target length value to obtain a target character information sequence group, wherein the splitting process comprises the following steps:

taking the character information sequence as an initial character information sequence;

based on the initial character information sequence, the following intercepting steps are executed:

determining a first target length value character in the initial character information sequence as a tail character;

determining whether the tail character is the first byte of a Chinese code;

intercepting the first character of the initial character information sequence to a target length value of +1 character as a target character information sequence in response to determining that the tail character is the first byte of Chinese coding;

and in response to the fact that the initial character information sequence meets the preset residual character condition, taking the part which is not intercepted in the initial character information sequence as the initial character information sequence, and executing the intercepting step again.

2. The method of claim 1, wherein generating targeted courseware profile information based on the courseware profile information comprises:

acquiring document processing information, wherein the document processing information comprises document queue information and a processing thread information group;

in response to determining that the document queue information meets a preset length condition, adding the document content information to a target document queue;

based on the document queue information and the processing thread information group, converting the document content information to obtain target coding information;

and preprocessing the target coding information to obtain target courseware document information.

3. The method according to one of claims 1-2, wherein the performing a duplicate checking process on the document content information to obtain verification status information comprises:

converting the document content information to obtain document byte array information;

generating a target hash value based on the document byte array information;

determining the target hash value, the document type and the type identifier as document summary information;

and matching the document summary information with each courseware archive information in a preset courseware archive information table to obtain verification state information.

4. A document tag information generating apparatus comprising:

the verification processing unit is configured to respond to the received courseware document information of the target user, and carry out verification processing on the courseware document information to obtain verification state information;

the storage unit is configured to respond to the fact that the checking state information meets a first preset state condition, and basic document information corresponding to the courseware document information is stored in a preset courseware archive information table;

a first generation unit configured to generate target courseware document information based on the courseware document information;

the full-text index processing unit is configured to perform full-text index processing on the target courseware document information to obtain a target index information set based on preset courseware word bank information in response to the fact that the target courseware document information meets second preset state information, wherein the preset courseware word bank information is information of a predefined courseware word bank, and the courseware word bank is obtained by crawling hot words related to courseware on a network;

the extraction processing unit is configured to extract each target index information in the target index information set to obtain a first tag information group;

the splitting processing unit is configured to split the target courseware document information to obtain a target character information sequence group;

a second generating unit configured to generate a second tag information group based on the target character information sequence group;

a generating and updating unit configured to generate document tag information based on the first tag information group and the second tag information group, and update the document tag information to the courseware archive information table for query;

the courseware document information comprises a document type, a type identifier, a document capacity value and document content information; and

in response to the fact that the second check information meets the preset document capacity condition, carrying out duplicate checking processing on the document content information to obtain check state information;

generating a second tag information group based on the target character information sequence group, wherein the generating includes:

performing duplicate removal processing on each generated second label information to obtain a second label information group;

splitting the character information sequence based on the target length value to obtain a target character information sequence group, including:

determining whether the tail character is a first byte of a Chinese code;

5. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-3.

6. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-3.