CN116340263B

CN116340263B - Word document conversion method and device based on machine identification and storage medium

Info

Publication number: CN116340263B
Application number: CN202310639865.1A
Authority: CN
Inventors: 陈德勇; 李元海
Original assignee: Beijing Wuyou Chuangxiang Information Technology Co ltd
Current assignee: Beijing Wuyou Chuangxiang Information Technology Co ltd
Priority date: 2023-06-01
Filing date: 2023-06-01
Publication date: 2023-08-29
Anticipated expiration: 2043-06-01
Also published as: CN116340263A

Abstract

The invention discloses a word document conversion method, a word document conversion device and a storage medium based on machine identification, wherein when the document is converted, the document after format conversion is subjected to style correction processing, so that the text style in the original word document can be reserved; meanwhile, the machine recognition technology is utilized to recognize the code types of each text paragraph after the style correction so as to perform programming language identification based on the code types of each text paragraph; therefore, the invention can retain the text style in the original text and accurately identify the code block text and the affiliated programming language in the document when the document is converted, so that a user does not need repeated copying, pasting, code re-writing and text style resetting, can realize quick release of various technical articles, and is suitable for wide application and popularization in the field of document conversion.

Description

Word document conversion method and device based on machine identification and storage medium

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a word document conversion method and device based on machine identification and a storage medium.

Background

In the twenty-first century, with the rapid development of the internet, applications such as technical forums, blogs, communities and the like have become popular, and the applications provide a platform for people to communicate and communicate, so that mutual learning among different people is promoted; meanwhile, various characteristic editors are also appeared to ensure the rapid text delivery of people in the application; however, the existing editor does not support the fast import text of local word documents well or not, which has the following disadvantages:

the existing editors on the market can only convert basic formats such as paragraphs, tables and pictures, and when the formats are converted, text styles can be greatly lost, and only the integrity of the content can be ensured, but the integrity of the text styles cannot be ensured; meanwhile, when code blocks exist in the Word document, the code blocks cannot be identified during import and become common text; based on this, how to provide a conversion method capable of converting the existing word document into the release document rapidly and with high reduction degree has become a problem to be solved.

Disclosure of Invention

The invention aims to provide a word document conversion method, device and storage medium based on machine identification, which are used for solving the problems that the prior art cannot ensure the integrity of text patterns and cannot identify code blocks when word document conversion is carried out.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

in a first aspect, a word document conversion method based on machine identification is provided, including:

acquiring a target word document, and converting the target word document into an html document;

carrying out style correction processing on the html document to obtain an html document with corrected style;

performing paragraph division processing on character strings in the html document after the style correction to obtain a pre-conversion document;

screening out code text paragraphs from each text paragraph in the pre-converted document, and inputting the code text paragraphs into a code recognition model for code type recognition processing to obtain code types corresponding to the code text paragraphs;

and carrying out programming language identification processing on the code text paragraphs in the pre-conversion document based on the code types corresponding to the code text paragraphs so as to obtain the html conversion document corresponding to the target word document after the programming language identification processing is completed.

Based on the above disclosure, when the document is converted, the invention firstly converts the format, namely converts the target word document into the html document, and the operation can lead the converted document to be identified by the existing editor, thereby reducing the time cost of re-editing when the user issues; then, the invention carries out style correction processing on the html document to ensure that the style in the converted document is the same as the text style in the original document; after the pattern correction is completed, the code blocks can be identified so that the converted document can obtain the code blocks in the original document; in specific implementation, the method firstly carries out paragraph division processing on character strings in the html document after style correction to obtain a plurality of text paragraphs, and then identifies text paragraphs belonging to code blocks so as to input the text paragraphs belonging to the code blocks into a code identification model for code type identification processing to obtain code types of the code text paragraphs; finally, the identified code types are utilized to carry out programming language identification processing on each code text paragraph, and then the conversion of the target word document can be completed, and the release document (namely, html conversion document) is obtained.

Through the design, when the document is converted, the document after format conversion is subjected to style correction processing, so that the text style in the original word document can be reserved; meanwhile, the machine recognition technology is utilized to recognize the code types of each text paragraph after the style correction so as to perform programming language identification based on the code types of each text paragraph; therefore, the invention can retain the text style in the original text and accurately identify the code block text and the affiliated programming language in the document when the document is converted, so that a user does not need repeated copying, pasting, code re-writing and text style resetting, can realize quick release of various technical articles, and is suitable for wide application and popularization in the field of document conversion.

In one possible design, performing style correction processing on the html document to obtain a style-corrected html document, including:

performing tag filtering processing on the html document to filter useless tags in the html document, so as to obtain a preprocessed html document;

performing label replacement processing on each label in the preprocessed html document to obtain a label replacement document after the label replacement processing, wherein the names and the attributes of each label in the label replacement document are the same as the names and the attributes of each label in the target word document;

Performing subordinate classification processing on each first appointed label in the label replacing document to construct the first appointed labels with the same subordinate relation in the label replacing document into an ordered list or a disordered list, and obtaining a label subordinate classification document after the subordinate classification processing;

screening a second designated label from the label dependent classification document, and uploading label content corresponding to the second designated label to a cloud management platform to obtain an access address of the label content corresponding to the second designated label, wherein the second designated label comprises a picture label;

and replacing the SRC content in the second designated label with the access address of the label content corresponding to the second designated label, so as to obtain the html document after the SRC content is replaced.

In one possible design, performing a label replacement process on each label in the preprocessed html document to obtain a label replacement document after the label replacement process, including:

screening a first target label and a second target label from the preprocessed html document, wherein the first target label comprises a p label, and the second target label comprises a font label, an ins label, an i label and a del label;

Changing the name attribute content of the first target label into a first label name; and

and changing the name attribute content of the second target label into a second label name, and adding a label identification character into the second target label so as to obtain the label replacement document after the label identification character is added.

In one possible design, performing a subordinate categorization process on each first specified tag in the tag replacement document to construct the first specified tags having the same subordinate relationship in the tag replacement document into an ordered list or a unordered list, and after the subordinate categorization process, obtaining a tag subordinate categorization document, including:

for each first specified label in the label replacement document, acquiring the style attribute of each first specified label, wherein the style attribute of any first specified label comprises the sequence of the any first specified label, the hierarchy of the sequence and the sequence of the sequence;

performing subordinate division processing on each first specified label based on the style attribute of each first specified label so as to divide the first specified labels belonging to the same sequence and the same hierarchy into one class to obtain a plurality of label classes;

For any one of a plurality of label classes, sorting each first appointed label in the any one label class according to the sequence of the hierarchy of the sequence to which each first appointed label in the any one label class corresponds to, so as to obtain a sorted label class, and after all labels in all label classes are sorted, obtaining a plurality of sorted label classes;

performing pattern recognition on each sort label class to obtain a list pattern to which each sort label class belongs, wherein the list pattern comprises an ordered list and an unordered list;

based on list styles of all sorting label classes, adding style identification labels for all sorting label classes, constructing a plurality of ordered lists and unordered lists after the style identification labels are added, and obtaining the label dependent classification documents after the ordered lists and unordered lists are constructed.

In one possible design, screening out code text paragraphs from each text paragraph in the pre-conversion document includes:

judging whether characters at the starting position of any text paragraph in the pre-conversion document are of preset types or not, wherein the preset types of characters comprise English characters;

If yes, judging whether any text paragraph contains preset key characters or not;

if yes, acquiring the position of the preset key character in any text paragraph;

judging whether the position of the preset key character in any text paragraph is a preset position or not;

if yes, judging any text paragraph to be a code text paragraph.

In one possible design, the method further comprises:

acquiring a data set, wherein the data set comprises code samples corresponding to different programming languages;

performing data preprocessing on each code sample in the data set to obtain a preprocessed data set;

performing feature extraction processing on the preprocessed data set to obtain feature vectors corresponding to each code sample, and forming a feature data set by utilizing the feature vectors corresponding to each code sample;

dividing the characteristic data set into a training set and a testing set, taking each characteristic vector in the training set as input, taking the code type of a code sample corresponding to each characteristic vector in the training set as output, and training a random forest classifier to obtain an initial code recognition model after training is completed;

And carrying out model test on the initial code recognition model by using the test set, and adjusting model parameters of the initial code recognition model in the test process so as to obtain the code recognition model after the model parameters are adjusted.

In one possible design, the feature extraction processing is performed on the preprocessed data set to obtain a feature vector corresponding to each code sample, where the feature vector includes:

counting the occurrence frequency of key characters and reserved characters in any code sample in the preprocessed data set to obtain the vocabulary characteristics of the any code sample;

counting the occurrence frequency of a third target character or a target combined character string in the arbitrary code sample to obtain character distribution characteristics of the arbitrary code sample;

carrying out grammar structure analysis processing on any code sample to obtain grammar structure characteristics of the any code sample;

taking the characters which continuously appear for a plurality of times in any code sample as continuous characters, and counting the occurrence frequency of each continuous character to be taken as the N-gram characteristic of the any code sample;

and forming the feature vector of any code sample by utilizing the vocabulary feature, the character distribution feature, the grammar structure feature and the N-gram feature of any code sample.

In a second aspect, a word document conversion device based on machine identification is provided, including:

the document format conversion unit is used for acquiring a target word document and converting the target word document into an html document;

the style correction unit is used for performing style correction processing on the html document to obtain an html document with corrected style;

the style correction unit is also used for carrying out paragraph division processing on the character strings in the html document after style correction to obtain a pre-conversion document;

the code identification unit is used for screening out code text paragraphs from all text paragraphs in the pre-converted document, inputting the code text paragraphs into a code identification model for code type identification processing, and obtaining code types corresponding to the code text paragraphs;

the code identification unit is further used for carrying out programming language identification processing on the code text paragraphs in the pre-conversion document based on the code types corresponding to the code text paragraphs so as to obtain the html conversion document corresponding to the target word document after the programming language identification processing is completed.

In a third aspect, another word document conversion device based on machine identification is provided, taking the device as an electronic device, and the device includes a memory, a processor and a transceiver, which are sequentially communicatively connected, where the memory is used to store a computer program, the transceiver is used to send and receive a message, and the processor is used to read the computer program, and execute the word document conversion method based on machine identification as in the first aspect or any one of the first aspects possible designs.

In a fourth aspect, a storage medium is provided, on which instructions are stored which, when run on a computer, perform the machine-recognition-based word document conversion method as in the first aspect or any one of the possible designs of the first aspect.

In a fifth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the machine-recognition-based word document conversion method as in the first aspect or any one of the possible designs of the first aspect.

The beneficial effects are that:

(1) When the document is converted, the method carries out style correction processing on the document after format conversion, so that the text style in the original word document can be reserved; meanwhile, the machine recognition technology is utilized to recognize the code types of each text paragraph after the style correction so as to perform programming language identification based on the code types of each text paragraph; therefore, the invention can retain the text style in the original text and accurately identify the code block text and the affiliated programming language in the document when the document is converted, so that a user does not need repeated copying, pasting, code re-writing and text style resetting, can realize quick release of various technical articles, and is suitable for wide application and popularization in the field of document conversion.

Drawings

FIG. 1 is a schematic flow chart of steps of a word document conversion method based on machine identification according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a word document conversion device based on machine identification according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the present invention will be briefly described below with reference to the accompanying drawings and the description of the embodiments or the prior art, and it is obvious that the following description of the structure of the drawings is only some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art. It should be noted that the description of these examples is for aiding in understanding the present invention, but is not intended to limit the present invention.

It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of example embodiments of the present invention.

It should be understood that for the term "and/or" that may appear herein, it is merely one association relationship that describes an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a alone, B alone, and both a and B; for the term "/and" that may appear herein, which is descriptive of another associative object relationship, it means that there may be two relationships, e.g., a/and B, it may be expressed that: a alone, a alone and B alone; in addition, for the character "/" that may appear herein, it is generally indicated that the context associated object is an "or" relationship.

Examples:

referring to fig. 1, the word document conversion method based on machine recognition provided in this embodiment can retain text patterns in an original word document, and can accurately identify code block texts in the document and the associated programming language, so that a user does not need repeated copy and paste, re-write codes, and re-set patterns, and can implement quick release of various technical articles, thereby being suitable for large-scale application and popularization in the field of word document conversion; in this embodiment, the method may be, but not limited to, running on the document conversion side, alternatively, the document conversion side may be, but not limited to, a personal computer (personal computer, PC), a tablet computer or a smart phone, and it is understood that the foregoing execution subject does not constitute limitation of the embodiment of the present application, and accordingly, the running steps of the method may be, but not limited to, as shown in the following steps S1 to S5.

S1, acquiring a target word document, and converting the target word document into an html document; in this embodiment, the word document may be formatted by, for example and without limitation, using a win32 component of python (a computer programming language tool), a wps tool, or an office tool to obtain an html document; taking python's win32 component as an example, the foregoing format conversion process will be specifically described: (1) Calling a DispatchEx component in the win32 component to open a word application; (2) Invoking a documents. Open command in the win32 component, and opening a target word document; (3) And calling a SaveAs interface to save the target word document as an html file to obtain the html document.

Therefore, after the target word document is converted into the html document, the converted document can be identified by the existing editor, so that re-editing is not needed when the document is released, and further the time cost of re-editing when the user releases is reduced.

After the format conversion of the target word document is completed, a style correction process may be performed to enable the converted document to retain the text style in the original word document, where the style correction process may be, but is not limited to, as shown in step S2 below.

S2, carrying out style correction processing on the html document to obtain an html document with corrected style; in specific applications, the style correction process mainly includes filtering process of useless labels in html documents, label replacement process, label classification process and access process of picture labels, where the label process may be, but is not limited to, as shown in the following steps S21 to S25.

S21, carrying out tag filtering processing on the html document to filter useless tags in the html document, so as to obtain a preprocessed html document; in this embodiment, each character string in the html document is exemplified by a tag, and the useless tags may include, but are not limited to, notes, blank lines, spaces, blank tags, repeated continuous tags, notes, useless patterns, hidden tags, and the like, and meanwhile, for example, a useless tag library (the useless tag library is pre-stored to the document conversion end and includes a plurality of useless sample tags) may be used, and filtering processing is performed on each tag in the html document in combination with a regular expression, so that the html document after the useless tags are filtered out is obtained after the preprocessing.

After the filtering of the unnecessary tag is completed, a tag replacement process may be performed, wherein the tag replacement process may be, but is not limited to, as shown in step S22 described below.

S22, performing label replacement processing on each label in the preprocessed html document to obtain a label replacement document after the label replacement processing, wherein the names and the attributes of each label in the label replacement document are the same as the names and the attributes of each label in the target word document; in a specific application, the reason for the label replacement is: each label (such as a sequence label and the like) in the original word document does not exist in the html code converted from the word document, so that the label in the html document needs to be replaced, and the converted html document has the label identical to the original document; alternatively, the foregoing label replacement processing procedure may be, but is not limited to, as shown in the following steps S22a to S22 c.

S22a, screening a first target label and a second target label from the preprocessed html document, wherein the first target label comprises a p label, and the second target label comprises a font label, an ins label, an i label and a del label; in specific implementation, labels can be distinguished by codes, for example, but not limited to, a bs library (which is a functional library for parsing, traversing and maintaining a label tree) is used for searching the style patterns of all labels in the preprocessed html document, wherein if the style patterns in the labels are in a 'mso-list:', the labels are judged to be p labels (namely, the p labels which need to be converted into li labels are identified); similarly, the identification process of the second target tag is the same as that of the first target tag, and will not be described again.

After the first target label and the second target label are screened out from the preprocessed html document, the label can be replaced, wherein in the embodiment, the label replacement is mainly the replacement of index signature, so that the converted html document has the same label signature as the original document, and the attribute of each label is ensured to be unchanged; alternatively, the label substitution process is as shown in step S22b and step S22c below.

S22a, changing the name attribute content of the first target label into a first label name; in specific applications, the first label name can be exemplified by, but not limited to, "Li", namely, p labels in the preprocessed html document are replaced by sequence labels Li, so that only the label names are replaced without any deletion operation on the attributes, and important data can be reserved for the subordinate classification processing of the subsequent labels; the same is true for the replacement process of the second target tag as shown in step S22c described below.

S22c, changing the name attribute content of the second target label into a second label name, and adding a label identification character into the second target label to obtain the label replacement document after the label identification character is added; in this embodiment, the second label may be, but not limited to, "span", while the label identification character may be, but not limited to, represented by a class value, and may be specifically set according to the kind of the second target label; if the second target tag is a del tag, then its corresponding tag identification character may be set to be a 'del' string to be used for identifying that the span tag is a delete tag; meanwhile, css patterns of the first target tag and the second target tag may be set, so as to determine patterns (such as alignment, size, height, width, color, layout, rounded corners, etc.) of character strings corresponding to the first target tag and the second target tag, and of course, the tag identification characters of the other second target tags are added in the same principle as the foregoing examples, which is not repeated herein.

Therefore, through the steps S22 a-S22 c, the converted html document can have the same label number as the original document, and the attribute of each label is ensured to be unchanged, so that the converted html document corresponds to the label in the original word document.

After completing the replacement processing of each tag in the preprocessed html document, the post-replacement tag may be subjected to a subordinate classification processing, where the subordinate classification processing may be, but is not limited to, as shown in step S23 below.

S23, performing subordinate classification processing on each first appointed label in the label replacing document to construct the first appointed labels with the same subordinate relation in the label replacing document into an ordered list or a disordered list, and obtaining a label subordinate classification document after the subordinate classification processing; in the present embodiment, the example first designated tag may be, but is not limited to, the aforementioned first target tag, that is, the sequence tag, wherein the specific procedure of the aforementioned subordinate categorization processing may be, but is not limited to, as shown in the following steps S23a to S23 e.

S23a, for each first appointed label in the label replacement document, acquiring the style attribute of each first appointed label, wherein the style attribute of any first appointed label comprises the sequence of the any first appointed label, the hierarchy of the sequence and the sequence of the sequence; in this embodiment, the style patterns (one of the attributes) of the first specified tags (i.e., the sequence tags) are marked with corresponding dependencies, so that the style pattern attributes corresponding to each first specified tag can be directly read to obtain the sequence to which each first specified tag belongs, the hierarchy of the sequence to which each first specified tag belongs, and the sequence hierarchy of the sequence to which each first specified tag belongs; if the style attribute corresponding to any first specified tag includes mso-list l1 level1 lfo, then the sequence of any first specified tag is 11, the hierarchy is leve11, and the sequence is lfo11; of course, the content included in the style attribute of each of the remaining first specified tags is the same as that of the foregoing example, and will not be described herein.

After the style attributes of the first specified labels are obtained, the subordinate division processing may be performed according to the style attributes of the first specified labels, where the division process of the subordinate relations of the first specified labels may be, but is not limited to, as shown in the following steps S23b to S23 e.

S23b, performing subordinate division processing on each first appointed label based on the style attribute of each first appointed label so as to divide the first appointed labels belonging to the same sequence and the same level into one class to obtain a plurality of label classes; in the present embodiment, the description is made on the basis of the foregoing example, that is, the first specified tag having the sequence of 11 and the level of leve11 is classified into one type; of course, the same is true for the division of the remaining first designated tags, which is not described herein.

After performing the subordinate division processing on each first designated tag according to the style attribute, the ordered list or the unordered list may be constructed as shown in the following steps S23c to S23 e.

S23c, for any one of a plurality of label classes, ordering each first appointed label in the any one label class according to the sequence of the hierarchy of the sequence to which each first appointed label corresponds, so as to obtain an ordered label class, and after all labels in all label classes are ordered, obtaining a plurality of ordered label classes; in a specific application, an example is used to describe the foregoing step S23C, and it is assumed that any tag class includes three first specified tags, namely A, B and C, where the order of the levels of the sequence to which the first specified tag a corresponds is 2fo11, the order of the levels of the sequence to which the first specified tag B corresponds is 1fo11, and the order of the levels of the sequence to which the first specified tag C corresponds is 3fo11, and then the ordered tag class corresponding to any tag class is: first designated tags B, A and C; of course, the sorting process of each first designated tag in the rest of each tag class is the same as that of the foregoing example, and will not be repeated here.

After the ordering of each first designated tag in each tag class is completed, the list style to which each ordered tag class belongs may be determined, so that the ordered list or the unordered list may be constructed according to the list style, as shown in steps S23d and S23e below.

S23d, carrying out pattern recognition on each sort label class to obtain a list pattern of each sort label class, wherein the list pattern comprises an ordered list and an unordered list; in this embodiment, for any sort label class, the style identification character (also recorded in the style attribute) of any first designated label in any sort label class may be read, and then the list style to which the any sort label class belongs is determined according to the style identification character, where the style identification character may be, but is not limited to: the font-family, wingdins, can distinguish whether any sort label class belongs to an ordered list or an unordered list based on the style identification character.

After the list styles of each sort label class are obtained, style identification can be performed, so that an ordered list and an unordered list are obtained after style identification, wherein the identification process is as shown in the following step S23 e.

S23e, adding style identification labels for each sort label class based on list styles to which each sort label class belongs, so as to construct a plurality of ordered lists and unordered lists after the style identification labels are added, and obtaining the label dependent classification document after the ordered lists and unordered lists are constructed; in this embodiment, but not limited to using an ol tag or an ul tag, the style identification tag may be applied to the outermost layer of each sequence tag (i.e. Li tag) in the class of ordered tags, so that the style identification tag is used as a parent tag to form a complete ordered list or unordered list, and in specific application, after any ordered tag class performs style identification tag, an unordered list is obtained, which specifically includes: < ul > < li > </li > </ul >, wherein the middle label in the unordered list is the sequence label of each first appointed label in any sort label class, and the outermost ul label is the style identification label of any sort label class; of course, the construction process of the ordered list is the same as that of the previous example, and will not be described again here.

Therefore, the steps S23 a-S23 e can complete the subordinate classification processing of each sequence label in the label replacement document, so that an ordered list or an unordered list is constructed by using the sequence labels belonging to the same sequence and the same level, and the ordered list and the unordered list are used for ordering the character strings in the document in outline and format.

After the tag-dependent categorizing document is obtained, the picture tag in the tag-dependent categorizing document is further processed to ensure that the converted document can display an image, wherein the processing procedure of the picture tag can be but is not limited to the following step S24.

S24, screening out a second designated label from the label dependent classification document, and uploading label content corresponding to the second designated label to a cloud management platform to obtain an access address of the label content corresponding to the second designated label, wherein the second designated label comprises a picture label; in this embodiment, after the word is converted into html, the image is implemented by referring to the local address with the img tag, if the html is output to other places for use, the image cannot be displayed, so that all valid img tags in the html (the img tags represent image tags and the SRC content is effectively represented as not being empty) need to be uploaded to the cloud management platform through OSS or other image cloud services, and then the address capable of accessing the cloud management platform is used to replace the SRC content in the img tags, so that the processing of the image tags can be completed; further, the SRC content replacement process may be, but not limited to, as shown in step S25 below.

S25, replacing SRC content in the second appointed label with an access address of label content corresponding to the second appointed label, so as to obtain the html document after the SRC content is replaced; in specific application, the method is equivalent to replacing the original SRC content with cdn addresses which can be accessed by a public network or a system local area network, and then, the html document with the corrected style can be obtained.

Therefore, through the steps S21-S25, correction of the text style in the html document can be completed, so that the converted document retains the text style identical to that of the original word document, and the defect of resetting the text style by a user is avoided.

After the correction of the text style in the html document is completed, the code blocks in the html document after the correction of the style can be identified, in this embodiment, the segmentation of the character string is performed first, then the code blocks are identified for the text paragraphs, and finally the code types of the text paragraphs belonging to the code blocks are identified, where the segmentation process of the character string can be, but is not limited to, as shown in the following step S3.

S3, carrying out paragraph division processing on character strings in the html document after the style correction to obtain a pre-conversion document; in this embodiment, the punctuation, indentation and linefeed, semantic association, paragraph length, and grammar structures may be used, but are not limited to, to perform paragraph segmentation processing on a string: the primary segmentation process by using punctuation marks comprises the following steps: in English, punctuation marks such as a period, a question mark, an exclamation mark and the like can be used for separating sentences, in Chinese, punctuation marks such as a period, a question mark, an exclamation mark, a semicolon and the like can be used for separating sentences, and separators among sentences can be used as the basis for separating paragraphs; for indentation and line feed, text may be cut into different paragraphs when a continuous line feed or a certain number of spaces are detected; similarly, the segmentation based on semantic association is: when a paragraph is segmented, a certain semantic relevance is kept for sentences in the paragraph, each paragraph should be unfolded around a theme or view, and if the segmented paragraphs are found to be semantically incoherent, the segmentation position can be adjusted; in specific applications, the aforementioned segmentation means is a common technique for segment division, and the principle thereof is not described in detail.

After the paragraph division of the character string in the html document after the style correction is completed, the code blocks of each obtained text paragraph can be identified, wherein the identification process is as shown in the following step S4.

S4, screening out code text paragraphs from each text paragraph in the pre-conversion document, and inputting the code text paragraphs into a code recognition model for code type recognition processing to obtain code types corresponding to the code text paragraphs; in specific implementation, it may be determined whether the text passage is a code text passage according to, but not limited to, the character of the start position of the text passage, the key character of the text passage, and the position of the key character, wherein the recognition process is as shown in the following steps S41 to S45.

S41, judging whether characters at the starting position of any text paragraph are preset type characters or not for any text paragraph in the pre-conversion document, wherein the preset type characters comprise English characters; in this embodiment, the first character of any text paragraph is equivalent to determining whether it is an english character, if so, the next step is to be performed, i.e. the following step S42 is performed, otherwise, it is determined that any text paragraph is a non-code text paragraph.

S42, if yes, judging whether any text paragraph contains preset key characters or not; in this embodiment, a keyword set is set in the document conversion end, where the keyword set includes preset keywords corresponding to different programming languages, so that it is only necessary to compare characters of any text paragraph according to the keyword set to determine whether the text paragraph includes the preset keywords, if so, a next step of determining (i.e. executing the following step S43) is required, otherwise, it is determined that any text paragraph is a non-code text paragraph.

S43, if yes, the position of the preset key character in any text paragraph is obtained.

S44, judging whether the position of the preset key character in any text paragraph is a preset position or not.

S45, if yes, judging that any text paragraph is a code text paragraph; in this embodiment, if the preset key character is at the preset position, it is determined that any text paragraph is a code text paragraph, otherwise, it is determined that the text paragraph is a non-code text paragraph.

Therefore, through the foregoing steps S41 to S45, text paragraphs belonging to the code blocks can be screened from each text paragraph in the pre-converted document, and then, the code types corresponding to each text paragraph belonging to the code blocks also need to be identified, so that the identification of the programming language can be performed subsequently, and the converted document contains the code blocks in the original document without the need of user writing again.

Specifically, the present embodiment uses a machine recognition technique to perform the type recognition of the code blocks, wherein the following discloses one of the training processes of the code recognition model, and may be, but not limited to, those shown in the following S01 to S05.

S01, acquiring a data set, wherein the data set comprises code samples corresponding to different programming languages; in this embodiment, code samples corresponding to different programming languages may be, but are not limited to, crawled from a code hosting platform such as GitHub, gitee, etc.

After the data set is obtained, data preprocessing is needed to perform feature extraction better, where the data preprocessing process may be, but is not limited to, as shown in step S02 below.

S02, carrying out data preprocessing on each code sample in the data set to obtain a preprocessed data set; in this embodiment, the data preprocessing process may include, but is not limited to: (1) delete notes: annotations are generally independent of the features of the programming language itself, so they can be deleted from the code samples, in particular, multiple lines and single lines of annotations can be matched and deleted using regular expressions; (2) deleting blank characters: blank characters (e.g., spaces, tabs, line breaks, etc.) may interfere with the feature extraction process, and therefore, during the preprocessing stage, these blank characters may be deleted or replaced with uniform spaces; (3) converting the code samples into lowercase; all characters in the code sample are converted into lower cases, so that differences caused by the lower cases can be eliminated, and subsequent vocabulary feature extraction is simplified; (4) normalized setback: because different programming languages and items may employ different indentation styles, all indentations may be normalized to a uniform number of spaces for better analysis of the grammar structure; (5) delete numbers and special symbols: since numbers and special symbols may have similar distributions in different programming languages, they need to be deleted from the code samples in a preprocessing stage, thereby reducing noise, highlighting key features of the programming language; in addition, the foregoing data preprocessing process may be implemented by, but not limited to, using Python tools, and combining regular expressions, string processing functions, and the like.

After the preprocessing of the data set is completed, a feature extraction process may be performed to perform the training of the model with feature vectors later, where the feature extraction process may be, but is not limited to, as shown in step S03 below.

S03, carrying out feature extraction processing on the preprocessed data set to obtain feature vectors corresponding to each code sample, and forming a feature data set by utilizing the feature vectors corresponding to each code sample; in the present embodiment, the feature vector of any code sample may include, but is not limited to, a vocabulary feature, a character distribution feature, a grammar structure feature, and the like of any code sample, where the feature extraction process is described by taking any code sample as an example, and may be, but is not limited to, as shown in the following steps S03a to S03 e.

S03a, counting occurrence frequencies of key characters and reserved characters in any code sample in the preprocessed data set to obtain vocabulary characteristics of the any code sample; in this embodiment, the distribution of the key characters and the reserved characters in different programming languages has significant differences, so that the key characters can be used as important features for identifying the programming languages, meanwhile, for example, different key characters can be set according to the different programming languages, such as Python, def, class, import, etc., and Java, for example, the key characters can include public, class, extensions, etc.; meanwhile, the same is true of reserved characters, which are not described in detail.

After the statistics of the vocabulary features of any code sample are completed, the character distribution features may be extracted, and the extraction process is as follows in step S03b.

S03b, counting the occurrence frequency of a third target character or a target combined character string in any code sample to obtain character distribution characteristics of any code sample; in this embodiment, the character distribution of different programming languages is often different, so that the character distribution may also be used as a feature for distinguishing the programming languages, and the third target character may include, but is not limited to, symbols such as brackets, parentheses, semicolons, etc., however, the third target character and the target combination character string may also be specifically set according to the programming languages, and are not listed here one by one.

After the character distribution feature of any one of the code samples is extracted, a grammar structure analysis is performed, wherein the grammar structure analysis process may be, but is not limited to, as shown in the following step S03c.

S03c, carrying out grammar structure analysis processing on the arbitrary code sample to obtain grammar structure characteristics of the arbitrary code sample; in a specific application, different programming languages have different grammar rules, and thus, the programming languages can be distinguished by analyzing the grammar structure. Wherein the indentation, bracket matching, etc. in the arbitrary code sample can be analyzed to obtain the grammar structure feature.

Similarly, after the grammar structure feature is obtained, in order to further improve the sign enrichment, an N-gram feature extraction process is further provided, as shown in step S03d below.

S03d, taking the characters which continuously appear for a plurality of times in any code sample as continuous characters, and counting the occurrence frequency of each continuous character to be taken as the N-gram characteristic of the any code sample; in particular implementations, N-gram features can capture local patterns in programming languages that help distinguish between different programming languages; thus, the frequency of the same character in any code sample appearing multiple times in succession is counted, and the frequency is used as the characteristic of the N-gram.

After the feature extraction processing in the steps S03a to S03d, the extracted features can be used to form feature vectors of any one of the code samples, as shown in the following step S03e.

S03e, utilizing the vocabulary characteristics, character distribution characteristics, grammar structure characteristics and N-gram characteristics of any code sample to form a characteristic vector of any code sample; in this embodiment, the foregoing features may be configured to form a travel vector, so as to obtain a feature vector; of course, it may be constructed as a column vector, which is not particularly limited herein.

The feature vector corresponding to each code sample can be obtained through the steps S03a to S03e, and then training of the random forest classifier can be performed by using the feature vector, so that a code recognition model is obtained after training is finished.

S04, dividing the characteristic data set into a training set and a testing set, taking each characteristic vector in the training set as input, taking the code type of a code sample corresponding to each characteristic vector in the training set as output, and training a random forest classifier to obtain an initial code recognition model after training is completed; in this embodiment, the random forest is an integrated learning algorithm, which predicts by constructing a plurality of decision trees and combining their output results, and the method has higher accuracy and robustness.

Meanwhile, in the embodiment, the random forest model can be trained by using the feature vector of the training set and the programming language label; alternatively, 80% of the data may be used as the training set and 20% of the data as the test set, thereby evaluating the performance of the model during the training process, avoiding model overfitting, wherein the model adjustment process is as shown in step S05 below.

S05, carrying out model test on the initial code recognition model by using the test set, and adjusting model parameters of the initial code recognition model in the test process so as to obtain the code recognition model after the model parameters are adjusted; in the implementation, the generalization capability of the model on unknown data can be known by calculating indexes such as the accuracy, recall rate and F1 score of the model on a test set, meanwhile, model parameters can be adjusted according to the indexes, and measures such as adding training data or improving a feature extraction method can be adopted to improve the performance of the model.

The training of the random forest classifier can be completed through the steps S01-S05, and a code recognition model is obtained, wherein the embodiment can store the trained model into a file so as to perform programming language recognition in practical application; as in Python, the model can be saved as a file using a jobelib or a jackle library, and then loaded in a scenario where a programming language needs to be identified, thereby implementing identification of the code type.

Furthermore, when each code text paragraph is identified, feature extraction is required to be performed first to obtain feature vectors corresponding to each code text paragraph, and then the feature vectors corresponding to each code text paragraph are input into a code identification model to obtain code types corresponding to each code text paragraph.

After the code types of the respective code text paragraphs are obtained, a programming language identification process of the code blocks may be performed, wherein the identification process is as shown in step S5 below.

S5, based on the code types corresponding to the code text paragraphs, performing programming language identification processing on the code text paragraphs in the pre-conversion document to obtain an html conversion document corresponding to the target word document after the programming language identification processing is completed; in this embodiment, the programming language identification process may be, but not limited to, highlighting, annotating the corresponding programming language name, and adjusting the format of the code text paragraph according to the code type; therefore, the identification of the code blocks in the html document and the identification of the programming language can be completed through the operation, so that the code blocks in the original document are reserved by the html conversion document.

Therefore, through the word document conversion device based on machine recognition described in detail in the steps S1-S5, the invention can keep the text patterns in the original word document, and can accurately recognize the code block texts and the affiliated programming languages in the document, so that a user does not need repeated copying and pasting, re-writing codes and resetting patterns, the quick release of various technical articles can be realized, and the word document conversion device is suitable for large-scale application and popularization in the field of word document conversion.

As shown in fig. 2, a second aspect of the present embodiment provides a hardware device for implementing the word document conversion method based on machine identification in the first aspect of the present embodiment, including:

and the document format conversion unit is used for acquiring the target word document and converting the target word document into an html document.

And the style correction unit is used for performing style correction processing on the html document to obtain the html document with the corrected style.

And the style correction unit is also used for carrying out paragraph division processing on the character strings in the html document after style correction to obtain a pre-conversion document.

And the code identification unit is used for screening out code text paragraphs from all text paragraphs in the pre-converted document, inputting the code text paragraphs into a code identification model for code type identification processing, and obtaining code types corresponding to the code text paragraphs.

The working process, working details and technical effects of the device provided in this embodiment may refer to the second aspect of the embodiment, and are not described herein again.

As shown in fig. 3, a third aspect of the present embodiment provides another word document conversion device based on machine identification, taking the device as an electronic device as an example, including: the device comprises a memory, a processor and a transceiver which are connected in sequence in communication, wherein the memory is used for storing a computer program, the transceiver is used for receiving and transmitting messages, and the processor is used for reading the computer program and executing the word document conversion method based on machine identification according to the first aspect of the embodiment.

By way of specific example, the Memory may include, but is not limited to, random access Memory (random access Memory, RAM), read Only Memory (ROM), flash Memory (Flash Memory), first-in-first-out Memory (First Input First Output, FIFO) and/or first-in-last-out Memory (First In Last Out, FILO), etc.; in particular, the processor may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ), and may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state.

In some embodiments, the processor may be integrated with a GPU (Graphics Processing Unit, image processor) for taking charge of rendering and rendering of content required to be displayed by the display screen, for example, the processor may not be limited to a microprocessor employing a model number of STM32F105 family, a reduced instruction set computer (reduced instruction set computer, RISC) microprocessor, an X86 or other architecture processor, or a processor integrating an embedded neural network processor (neural-network processing units, NPU); the transceiver may be, but is not limited to, a wireless fidelity (WIFI) wireless transceiver, a bluetooth wireless transceiver, a general packet radio service technology (General Packet Radio Service, GPRS) wireless transceiver, a ZigBee protocol (low power local area network protocol based on the ieee802.15.4 standard), a 3G transceiver, a 4G transceiver, and/or a 5G transceiver, etc. In addition, the device may include, but is not limited to, a power module, a display screen, and other necessary components.

The working process, working details and technical effects of the electronic device provided in this embodiment may refer to the first aspect of the embodiment, and are not described herein again.

A fourth aspect of the present embodiment provides a storage medium storing instructions including the word document conversion method based on machine identification according to the first aspect of the present embodiment, that is, the storage medium storing instructions, when the instructions are executed on a computer, to perform the word document conversion method based on machine identification according to the first aspect of the present embodiment.

The storage medium refers to a carrier for storing data, and may include, but is not limited to, a floppy disk, an optical disk, a hard disk, a flash Memory, a flash disk, and/or a Memory Stick (Memory Stick), where the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.

The working process, working details and technical effects of the storage medium provided in this embodiment may refer to the first aspect of the embodiment, and are not described herein again.

A fifth aspect of the present embodiment provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the machine-recognition-based word document conversion method of the first aspect of the embodiment, wherein the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus.

Finally, it should be noted that: the foregoing description is only of the preferred embodiments of the invention and is not intended to limit the scope of the invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A word document conversion method based on machine identification, comprising:

based on the code types corresponding to the code text paragraphs, performing programming language identification processing on the code text paragraphs in the pre-conversion document to obtain an html conversion document corresponding to the target word document after the programming language identification processing is completed;

Performing style correction processing on the html document to obtain a style corrected html document, including:

2. The method according to claim 1, wherein performing a label replacement process on each label in the pre-processed html document to obtain a label replacement document after the label replacement process, includes:

3. The method of claim 1, wherein performing a subordinate categorization process on each first specified tag in the tag replacement document to construct the first specified tags in the tag replacement document having the same subordinate relationship as an ordered list or a unordered list, and obtaining a tag subordinate categorization document after the subordinate categorization process, comprises:

4. The method of claim 1, wherein screening out code text paragraphs from each text paragraph in the pre-converted document comprises:

if yes, judging any text paragraph to be a code text paragraph.

5. The method according to claim 1, wherein the method further comprises:

6. The method of claim 5, wherein performing feature extraction on the preprocessed data set to obtain feature vectors corresponding to each code sample, comprises:

7. A word document conversion device based on machine identification, comprising:

The code identification unit is further used for carrying out programming language identification processing on the code text paragraphs in the pre-conversion document based on the code types corresponding to the code text paragraphs so as to obtain html conversion documents corresponding to the target word documents after the programming language identification processing is completed;

8. A word document conversion device based on machine identification, comprising: the word document conversion method based on machine identification according to any one of claims 1-6, comprising a memory, a processor and a transceiver, which are connected in communication in sequence, wherein the memory is used for storing a computer program, the transceiver is used for receiving and transmitting messages, and the processor is used for reading the computer program and executing the word document conversion method based on machine identification according to any one of claims 1-6.

9. A storage medium having instructions stored thereon which, when executed on a computer, perform the machine identification based word document conversion method of any one of claims 1 to 6.