CN117828007A

CN117828007A - Construction sign land immigration archive management method and system based on natural language processing

Info

Publication number: CN117828007A
Application number: CN202410007820.7A
Authority: CN
Inventors: 江进辉; 李�浩; 王鄂豫; 李鹏; 王玉着; 黄刘芳; 钟磊; 龙章; 胡涛; 谭哲武
Original assignee: Changjiang Institute of Survey Planning Design and Research Co Ltd
Current assignee: Changjiang Institute of Survey Planning Design and Research Co Ltd
Priority date: 2024-01-02
Filing date: 2024-01-02
Publication date: 2024-04-05

Abstract

The invention relates to the technical field of text file processing, and provides a method and a system for managing a construction land moving file based on natural language processing. Further, the method can find and correct possible errors, optimize text expression and enable file information to be clearer and more accurate. Finally, according to the difference between the repeated semantic vector and the original semantic vector, the scheme can propose a reasonable archive storage strategy, so that the archive management efficiency and quality are greatly improved.

Description

Construction sign land immigration archive management method and system based on natural language processing

Technical Field

The invention relates to the technical field of text file processing, in particular to a method and a system for managing construction sign land moving files based on natural language processing.

Background

Natural language processing (Natural Language Processing, NLP) is an important branch of artificial intelligence and computer science, mainly studying how to let computers understand and generate human language. In recent years, with the rapid development of technologies such as deep learning, natural language processing has achieved remarkable results in the fields of text mining, emotion analysis, machine translation, and the like.

Construction sign land shift archive management is an important task in public administrative management, and involves a large amount of text information processing and analysis. However, the conventional text processing method depends on manual operation, which is not only inefficient, but also prone to error, and cannot meet the requirement of large-scale file management. In addition, because of the complex content of the archive information, a great amount of unstructured data exists, and how to effectively extract key information and formulate a reasonable archive storage strategy according to the information is a great challenge for current archive management.

Disclosure of Invention

In order to improve the problems, the invention provides a method and a system for managing a construction sign land migration archive based on natural language processing.

In a first aspect of the embodiment of the present invention, a method for managing a construction land immigration archive based on natural language processing is provided, and the method is applied to a natural language processing system, and includes:

carrying out archival text semantic mining on the original construction symptom place immigration archives to be archived and stored to obtain corresponding archival text semantic vector sets; each archive text semantic vector represents archive elements of a text unit in the original construction symptom moving archive to be archived and stored;

Disassembling the archive text semantic vector set into a first local text semantic vector group and a second local text semantic vector group;

according to the text semantic commonality index between the first local text semantic vector group and the second local text semantic vector group, the error part of the second local text semantic vector group relative to the archive text semantic vector set is repeated to obtain a first local text repeated semantic vector group corresponding to the first local text semantic vector group, and the error part of the first local text semantic vector group relative to the archive text semantic vector set is repeated to obtain a second local text repeated semantic vector group corresponding to the second local text semantic vector group;

obtaining a text-replying semantic vector set according to the first local text-replying semantic vector set and the second local text-replying semantic vector set;

and obtaining a corresponding archive storage suggestion strategy according to the difference between the text replication semantic vector set and the archive text semantic vector set.

In some preferred embodiments, the decomposing the set of archival text semantic vectors into a first set of local text semantic vectors and a second set of local text semantic vectors comprises:

Constructing a first text scanning frame and a second text scanning frame with the same window size, wherein the two members of the first text scanning frame and the second text scanning frame under any labels with the same distribution are respectively used for semantic extraction and non-semantic extraction;

and respectively processing the archive text semantic vector set of the original construction sign place immigration archive to be archived according to the first text scanning frame and the second text scanning frame to obtain a first local text semantic vector set and a second local text semantic vector set.

In some preferred embodiments, the text semantic commonality index between the first set of local text semantic vectors and the second set of local text semantic vectors includes a relative distribution relationship between text units represented by the text semantic vectors in the first set of local text semantic vectors and text units represented by the text semantic vectors in the second set of local text semantic vectors;

the step of performing a rendition on the second local text semantic vector group with respect to the error portion of the archive text semantic vector set according to the text semantic commonality index between the first local text semantic vector group and the second local text semantic vector group to obtain a first local text rendition semantic vector group corresponding to the first local text semantic vector group, and performing a rendition on the first local text semantic vector group with respect to the error portion of the archive text semantic vector set to obtain a second local text rendition semantic vector group corresponding to the second local text semantic vector group, including:

Adjusting the error part of the second local text semantic vector group relative to the archive text semantic vector set to be a first semantic extraction knowledge feature;

according to the second local text semantic vector group and the relative distribution relation between text units represented by file text semantic vectors in the second local text semantic vector group and text units represented by the first semantic extraction knowledge features, the first semantic extraction knowledge features are repeated to obtain a first local text repeated semantic vector group corresponding to the first local text semantic vector group;

adjusting the error part of the first local text semantic vector group relative to the archive text semantic vector set to a second semantic extraction knowledge feature;

according to the first local text semantic vector group and the relative distribution relation between text units represented by file text semantic vectors in the first local text semantic vector group and text units represented by the second semantic extraction knowledge features, the second semantic extraction knowledge features are repeated to obtain a second local text repeated semantic vector group corresponding to the second local text semantic vector group; and the relative distribution relation between the text units represented by the text semantic vectors of the files in the second local text semantic vector group and the text units represented by the first semantic extraction knowledge features and the relative distribution relation between the first local text semantic vector group and the second semantic extraction knowledge features are determined according to the relative distribution relation between the text units represented by the text semantic vectors of the files in the first local text semantic vector group and the text units represented by the text semantic vectors of the files in the second local text semantic vector group.

In some preferred embodiments, the method further comprises:

obtaining the first semantic extraction knowledge feature according to the regional positioning feature and the linkage semantic feature of the second local text semantic vector group relative to the error part of the archive text semantic vector set;

and obtaining the second semantic extraction knowledge feature according to the region positioning feature of the first local text semantic vector group relative to the error part of the archive text semantic vector set and the linkage semantic feature.

In some preferred embodiments, the obtaining the corresponding archive storage suggestion policy according to the difference between the text replication semantic vector set and the archive text semantic vector set includes:

aiming at each text unit in the original construction symptom place immigration file to be archived and stored, the following processing is implemented:

based on the text reproduction semantic vector corresponding to one text unit in the text reproduction semantic vector set, obtaining a structured storage viewpoint variable corresponding to the one text unit from the distinction between the archive text semantic vector corresponding to the archive text semantic vector set and the archive text semantic vector corresponding to the one text unit;

And acquiring an archive storage suggestion policy corresponding to the original construction symptom moving archive to be archived according to the structured storage viewpoint variables corresponding to each text unit in the original construction symptom moving archive to be archived.

In some preferred embodiments, the obtaining, according to the structured storage perspective variable corresponding to each text unit in the original construction site-moving archive to be archived, an archive storage suggestion policy corresponding to the original construction site-moving archive to be archived includes:

determining a text unit with a structured storage viewpoint variable larger than a first threshold value as a structured text unit in the original construction land-feature immigration archive to be archived and stored;

and if the number of the structured text units contained in the original construction symptom moving file to be archived is determined to be larger than a second threshold value, determining an archive storage suggestion policy of the original construction symptom moving file to be archived as a structured storage policy.

In some preferred embodiments, the method is performed by a natural language processing algorithm whose debugging process is as follows:

Based on a preset construction sign land shift file training sample set, carrying out cyclic debugging on an algorithm to be debugged for a plurality of times, wherein in the cyclic debugging process, the following processing is implemented:

carrying out file text semantic mining on the extracted construction symptom shift file training samples to obtain corresponding file text semantic vector training sample sets; each archive text semantic vector training sample represents an archive element training sample of one text unit training sample in the construction feature civil shift archive training samples;

disassembling the archive text semantic vector training sample set into a first local text semantic vector group training sample and a second local text semantic vector group training sample;

according to the involvement coefficients between the first local text semantic vector group training sample and the second local text semantic vector group training sample, the error part of the first local text semantic vector group training sample relative to the archive text semantic vector training sample set is repeated, and the error part of the second local text semantic vector group training sample relative to the archive text semantic vector training sample set is repeated, so that a corresponding text repeated semantic vector training sample set is obtained;

Obtaining a corresponding algorithm training cost index according to the difference between the text repeated semantic vector training sample set and the archive text semantic vector training sample set, and optimizing algorithm parameters of the algorithm to be debugged according to the algorithm training cost index.

In some preferred embodiments, the obtaining the corresponding algorithm training cost index according to the difference between the text replication semantic vector training sample set and the archive text semantic vector training sample set includes:

aiming at each text unit training sample in the extracted construction sign land immigration file training samples, the following processing is implemented:

based on the difference between a text-unit training sample corresponding to the text-repeating semantic vector training sample in the text-repeating semantic vector training sample set and a archive text semantic vector training sample corresponding to the archive text semantic vector training sample in the archive text semantic vector training sample set, obtaining text unit loss corresponding to the text-unit training sample;

and determining the text unit loss corresponding to each text unit training sample in the extracted construction sign land immigration file training samples, and obtaining a corresponding algorithm training cost index.

In a second aspect of an embodiment of the present invention, there is provided a natural language processing system, including: a processor, a memory and a bus connected to the processor; the processor and the memory complete communication with each other through the bus; the processor is used for calling the computer program in the memory to execute the construction sign land migration archive management method based on natural language processing.

In a third aspect of the embodiments of the present invention, there is provided a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements the above-described method for managing a construction sign land migration archive based on natural language processing.

According to the method and the system for managing the construction land immigration archives based on natural language processing, provided by the embodiment of the invention, the original construction land immigration archives to be archived and stored are subjected to deep mining and analysis by the natural language processing and machine learning methods, key information in the archives is effectively extracted, and the information is converted into a computable semantic vector. Further, the method can find and correct possible errors, optimize text expression and enable file information to be clearer and more accurate. Finally, according to the difference between the repeated semantic vector and the original semantic vector, the scheme can propose a reasonable archive storage strategy, so that the archive management efficiency and quality are greatly improved. In addition, the method is not only suitable for building the land-sign immigration archives, but also can be widely applied to document management in other fields, and has wide application prospect and important practical value.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for managing a construction sign land migration archive based on natural language processing according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a product module of a natural language processing system according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In order to better understand the above technical solutions, the following detailed description of the technical solutions of the present invention is made by using the accompanying drawings and specific embodiments, and it should be understood that the specific features of the embodiments and the embodiments of the present invention are detailed descriptions of the technical solutions of the present invention, and not limiting the technical solutions of the present invention, and the technical features of the embodiments and the embodiments of the present invention may be combined with each other without conflict.

Referring to fig. 1, a flowchart of a method for managing a file of a building land moving people based on natural language processing according to an embodiment of the present invention is applied to a natural language processing system, and the specific content description included in the method includes S110-S150.

S110, carrying out archival text semantic mining on an original construction symptom place immigration archive to be archived and stored to obtain a corresponding archival text semantic vector set; wherein each archive text semantic vector characterizes: and the original construction feature place immigration archives to be archived and stored have archival elements of a text unit.

The original construction place immigration archives to be archived refer to files which are not processed and managed yet and need to be archived and stored. For example, placement compensation agreements, compensation funds calculation forms, funds approval forms, funds transfer letters, etc., with respect to a particular construction project.

Archival text semantic mining is a process of understanding the meaning and context of words, phrases, sentences, etc. in text by analyzing them. For example, by analyzing the "compensation funds calculation table", the system can understand key information such as compensation amount, compensation object, compensation standard, etc. that this document may relate to.

Archive text semantic vector set: in NLP, text is typically converted into a numeric form (or vector) so that a computer can process. Each text unit (e.g., word, phrase, sentence) is converted into a vector, and the set of vectors forms the archival text semantic vector set.

The text unit may be a word, a phrase, or a sentence. In the above example, "compensation", "funds", "calculation", "table" in the "compensation funds calculation table" can be regarded as text units.

Archive elements refer to key information in an archive, typically including, but not limited to, the name, author, date, location, subject, etc. of the archive. For example, in the "placement compensation agreement" document, possible document elements include the date the agreement was entered, the construction project involved, the number and identity of immigration, etc.

S120, the archive text semantic vector set is disassembled into a first local text semantic vector set and a second local text semantic vector set.

The archive text semantic vector set is disassembled into a first local text semantic vector group and a second local text semantic vector group. This process can be understood as breaking down the semantic representation of the entire archive text into two smaller parts.

The first set of local text semantic vectors and the second set of local text semantic vectors are a collection of text semantic vectors that contain semantic representations of a certain portion or type of information in the original archive. Such splitting may be based on a variety of factors, such as different portions of the archive (e.g., title, body, etc.), or different types of information (e.g., date, place, etc.).

Taking a specific symptomatic immigration profile as an example: assume that there is a file containing the following information: "7 of 2021, a land activity was performed in XX city, affecting 1000 residents, and the affected residents were properly located. "

After semantic mining of the file, a semantic vector set containing various information may be obtained, such as "time: 2021, 7 months "," location: XX city "," event: land activity "," number of affected people: 1000 user "and" processing mode: is properly arranged.

This set of semantic vectors can then be broken down into two sets of local semantic vectors. For example, the first local text semantic vector group contains "time: 2021, 7 month "and" places: XX City ", while the second local text semantic vector group contains" event: land activity "," number of affected people: 1000 user "and" processing mode: is properly arranged. The disassembly method can help better understand and manage the information in the files, and is convenient for subsequent steps such as repeated description and error correction.

S130, according to the text semantic commonality index between the first local text semantic vector group and the second local text semantic vector group, the error part of the second local text semantic vector group relative to the archive text semantic vector set is repeated to obtain a first local text repeated semantic vector group corresponding to the first local text semantic vector group, and the error part of the first local text semantic vector group relative to the archive text semantic vector set is repeated to obtain a second local text repeated semantic vector group corresponding to the second local text semantic vector group.

The text semantic similarity index is a method for measuring semantic similarity between two groups of texts. In particular, if two sets of text are very similar in terms of topic, style, perspective, etc., then their semantic commonality index is high.

The erroneous portion refers to a portion that is misinterpreted or misclassified in the semantic vector group. For example, when processing the file "compensation funds calculation table", if the system wrongly classifies it as "funds approval table", it is an erroneous part.

The restation is to reformulate the original text, but keep its original meaning unchanged. For example, the repetition of the "placement compensation protocol" may be the "compensation placement protocol".

The first local text recites the set of semantic vectors and the second local text recites the set of semantic vectors: in S120, the archive text semantic vector set is broken down into two local text semantic vector sets. Then in S130, the error parts in the two vector groups are repeated according to the semantic commonality index, so as to obtain a new repeated semantic vector group. For example, if the "compensation funds calculation table" in the first local text semantic vector group is incorrectly classified as "funds approval table", it will be corrected to the correct category after the rendition and a corresponding rendition semantic vector will be generated. Through the steps, the error part in the semantic vector group can be corrected, and the accuracy and the efficiency of file management are improved.

S140, according to the first local text compound semantic vector group and the second local text compound semantic vector group, a text compound semantic vector set is obtained.

The text re-description semantic vector set is semantic representation after re-description or re-expression of the original archive information. The process is mainly to modify and optimize the errors, unclear or ambiguous parts possibly existing in the original archive to improve the readability and accuracy of the archive.

For example, in a first local text semantic vector group, "time: 2021, 7, is "covered by" date: 7 months of 2021. Meanwhile, in the second local text semantic vector group, "event: land activity "," number of affected people: 1000 user "and" processing mode: properly place the "action to be described as: performed symptomatic work "," influence: the "sum" solution involving 1000 residents: appropriate placement is performed.

In this case, combining the two sets of complex semantic vectors forms a text complex semantic vector set. This set includes all the repeated information, such as: date: 7 months of 2021 "," location: XX city "," action: performed symptomatic work "," influence: the "sum" solution involving 1000 residents: appropriate placement is performed. In this way, not only can problems that may exist in the original archive be improved, but the information in the archive can be better understood and represented.

S150, obtaining a corresponding archive storage suggestion strategy according to the difference between the text replication semantic vector set and the archive text semantic vector set.

Among them, archive storage suggestion policies are policies for how to store and manage archives more efficiently, which generally take into account aspects such as classification, indexing, searching, confidentiality, etc. of archives. This strategy is derived based on the difference between the text-to-text semantic vector set and the original archive text semantic vector set, and therefore it can reflect the optimization and improvement of the system in processing the archive.

For example, in S130, it is found that the file "compensation funds calculation table" is mistakenly classified as "funds approval table". After the error is repeated and corrected, a new repeated semantic vector is obtained. Then, in S150, the system compares the difference between this complex semantic vector and the original semantic vector. If there are large differences, the system may suggest ways to modify the profile classification, for example, using finer classification criteria, or using more advanced NLP techniques to increase the accuracy of the classification. Also, if the system finds that certain archive elements (e.g., dates, places, etc.) are often misinterpreted, it may suggest improvements in the representation of those elements, e.g., using a more explicit format to represent dates. In this way, archive storage suggestion policies may help to better optimize and manage archives, improving work efficiency.

Therefore, the technical scheme performs deep mining and analysis on the original construction land-moving archives to be archived and stored by means of natural language processing and machine learning, effectively extracts key information in the archives, and converts the key information into computable semantic vectors. Further, the method can find and correct possible errors, optimize text expression and enable file information to be clearer and more accurate. Finally, according to the difference between the repeated semantic vector and the original semantic vector, the scheme can propose a reasonable archive storage strategy, so that the archive management efficiency and quality are greatly improved. In addition, the method is not only suitable for building the land-sign immigration archives, but also can be widely applied to document management in other fields, and has wide application prospect and important practical value.

In some aspects, the decomposing the archive text semantic vector set into a first local text semantic vector set and a second local text semantic vector set includes: constructing a first text scanning frame and a second text scanning frame with the same window size, wherein the two members of the first text scanning frame and the second text scanning frame under any labels with the same distribution are respectively used for semantic extraction and non-semantic extraction; and respectively processing the archive text semantic vector set of the original construction sign place immigration archive to be archived according to the first text scanning frame and the second text scanning frame to obtain a first local text semantic vector set and a second local text semantic vector set.

In this approach, a first text scan box and a second text scan box are used to process a set of text semantic vectors of an original construction symbolically immigration profile. These two "text scan boxes" are tools for extracting text information, which have the same window size but different functions: the first text box is used mainly for semantic extraction, while the second text box is used for non-semantic extraction.

For example, if the original archive includes a report on land acquisition, the following may be included: "7 of 2021, XX market for a certain infrastructure project, makes a land acquisition in YY area, affects 1000 residents, and all affected residents are properly arranged and compensated. "

In this example, the first text box may extract the following semantic information: time: 2021, 7 months "," action: land acquisition "," site: XX city YY region "," influence: 1000 resident "and" solution: properly positioned and compensated.

At the same time, the second text box is extracted from a non-semantic point of view, for example, to identify format errors, grammar errors or other unstructured information in the file.

After the two text scanning boxes are processed, a first local text semantic vector group and a second local text semantic vector group can be obtained. The two vector groups can respectively represent semantic information and non-semantic information in the archive, so that more comprehensive and more accurate data support is provided for subsequent archive management work.

In this way, the original archive can be better understood and analyzed, possible problems are identified, and the storage and management of the archive are optimized, thereby improving the efficiency and quality of archive work.

In some aspects, the text semantic commonality index between the first local text semantic vector group and the second local text semantic vector group includes a relative distribution relationship between text units represented by the text semantic vectors in the first local text semantic vector group and text units represented by the text semantic vectors in the second local text semantic vector group; the step of performing a rendition on the second local text semantic vector group with respect to the error portion of the archive text semantic vector set according to the text semantic commonality index between the first local text semantic vector group and the second local text semantic vector group to obtain a first local text rendition semantic vector group corresponding to the first local text semantic vector group, and performing a rendition on the first local text semantic vector group with respect to the error portion of the archive text semantic vector set to obtain a second local text rendition semantic vector group corresponding to the second local text semantic vector group, including: adjusting the error part of the second local text semantic vector group relative to the archive text semantic vector set to be a first semantic extraction knowledge feature; according to the second local text semantic vector group and the relative distribution relation between text units represented by file text semantic vectors in the second local text semantic vector group and text units represented by the first semantic extraction knowledge features, the first semantic extraction knowledge features are repeated to obtain a first local text repeated semantic vector group corresponding to the first local text semantic vector group; adjusting the error part of the first local text semantic vector group relative to the archive text semantic vector set to a second semantic extraction knowledge feature; according to the first local text semantic vector group and the relative distribution relation between text units represented by file text semantic vectors in the first local text semantic vector group and text units represented by the second semantic extraction knowledge features, the second semantic extraction knowledge features are repeated to obtain a second local text repeated semantic vector group corresponding to the second local text semantic vector group; and the relative distribution relation between the text units represented by the text semantic vectors of the files in the second local text semantic vector group and the text units represented by the first semantic extraction knowledge features and the relative distribution relation between the first local text semantic vector group and the second semantic extraction knowledge features are determined according to the relative distribution relation between the text units represented by the text semantic vectors of the files in the first local text semantic vector group and the text units represented by the text semantic vectors of the files in the second local text semantic vector group.

In this scheme, the relative distribution relationship between the first local text semantic vector group and the second local text semantic vector group, i.e., the text semantic commonality index, is considered first. Such a relationship may reflect the similarity and variability of the two sets of semantic vectors as a whole.

For example, there is a profile that contains descriptions about "activities in the ground," such as "actions: performed symptomatic work "," influence: the "sum" solution involving 1000 residents: the CC mechanism is properly positioned. The information is converted into a first set of local text semantic vectors. On the other hand, the archive also contains information about "compensation measures", such as "compensation criteria: 10 ten thousand yuan per home "and" implementation: is responsible for by the local CC organization). The information is then converted into a second set of local text semantic vectors.

Next, possible errors in the two sets of semantic vectors are found and adjusted to obtain more accurate knowledge features. For example, if "Compensation criteria: 10 ten thousand yuan for each household "misclassified as" total amount of compensation: 10 ten thousand ", it is necessary to adjust the correct" compensation standard "according to the relative distribution relation with other semantic vectors: 10 ten thousand yuan for each household.

In this way, a more accurate and representative set of semantic vectors, namely a first set of local text reproduction semantic vectors and a second set of local text reproduction semantic vectors, can be obtained. The repeated semantic vectors not only can better reflect the information in the files, but also can provide more accurate guidance for the storage of the files at the later time.

Summarizing, the beneficial effects of this approach are mainly manifested in two aspects: firstly, key information in the archive can be effectively mined, and errors which may exist are corrected; secondly, by analyzing and comparing the relative distribution relation among different semantic vector groups, the scheme can provide a finer and more accurate archive management strategy, thereby greatly improving the quality and efficiency of archive processing.

In some aspects, the method further comprises: obtaining the first semantic extraction knowledge feature according to the regional positioning feature and the linkage semantic feature of the second local text semantic vector group relative to the error part of the archive text semantic vector set; and obtaining the second semantic extraction knowledge feature according to the region positioning feature of the first local text semantic vector group relative to the error part of the archive text semantic vector set and the linkage semantic feature.

In this approach, information in the archive is further extracted and understood by using region-locating features and linked semantic features. Specifically, according to the regional positioning features and linkage semantic features of the second local text semantic vector group relative to the error part of the archive text semantic vector set, the first semantic extraction knowledge features are obtained. Similarly, the second semantic extraction knowledge feature can also be obtained according to the region positioning feature and the linkage semantic feature of the first local text semantic vector group relative to the error part of the archive text semantic vector set.

For example, in the above reporting example on land acquisition, it is assumed that an error is found: the date therein is written as "2021, 7 months" and should actually be "2022, 7 months". This erroneous portion is marked by the second set of local text semantic vectors. Then, the location of this erroneous portion (i.e., the region-locating feature) and its contextual information (i.e., the linked semantic features) are input into the model, which results in a first semantic extraction knowledge feature that helps to better understand the semantic content and context information of this erroneous portion.

Likewise, if an error is found in the first local text semantic vector group, such as "affects 1000 residents" should actually be "affects 2000 residents", the second semantic extraction knowledge feature can also be obtained by the same method.

By the design, errors in the archive can be found and corrected, semantic content and background information of the error part can be better understood, and accordingly the original archive can be accurately understood and analyzed. In addition, in this way, knowledge of semantic extraction can also be accumulated and updated, thereby constantly optimizing and improving archive management work.

In some aspects, the obtaining a corresponding archive storage suggestion policy according to the difference between the text replication semantic vector set and the archive text semantic vector set includes: aiming at each text unit in the original construction symptom place immigration file to be archived and stored, the following processing is implemented: based on the text reproduction semantic vector corresponding to one text unit in the text reproduction semantic vector set, obtaining a structured storage viewpoint variable corresponding to the one text unit from the distinction between the archive text semantic vector corresponding to the archive text semantic vector set and the archive text semantic vector corresponding to the one text unit; and acquiring an archive storage suggestion policy corresponding to the original construction symptom moving archive to be archived according to the structured storage viewpoint variables corresponding to each text unit in the original construction symptom moving archive to be archived.

In this scheme, the distinction between text-to-text semantic vector sets and archive text semantic vector sets is used to obtain the corresponding archive storage suggestion policies. Specifically, for each text unit in the original constructive immigration archive to be archived, a structured store perspective variable is obtained based on the distinction between its corresponding text-to-text semantic vector in the text-to-text semantic vector set and its corresponding archive text semantic vector in the archive text semantic vector set.

For example, assume that there is a text unit in the original constructive immigration profile to be archived that is "compensation criteria: 10 ten thousand yuan for each household. If the text unit has a large difference between the corresponding complex semantic vector in the text complex semantic vector set and the corresponding archive text semantic vector in the archive text semantic vector set, it means that the text unit may need to be reconsidered in its storage manner or location. This difference can be expressed as a structured storage perspective variable, such as { "text": "Compensation criteria: 10 ten thousand yuan for each home, "difference":0.8}.

And then, acquiring an archive storage suggestion policy corresponding to the original construction symptom moving archive to be archived according to the structured storage viewpoint variables corresponding to all the text units. For example, if the "difference" average of the structured storage perspective variables for all text units exceeds a certain threshold, it may be advisable to reorganize the storage structure of the archive for future retrieval and utilization.

In this way, by comparing the differences between the text replication semantic vector and the archive text semantic vector, possible problems in archive storage can be effectively discovered and corrected. The structured store view variables provide a quantified tool that helps to better understand and process the role of individual text units in archive storage. The archive storage suggestion strategy can provide specific optimization suggestions according to actual conditions, so that the efficiency and quality of archive management are improved.

In some aspects, the obtaining, according to the structured storage view variables corresponding to each text unit in the original construction site-moving record to be archived, an archive storage suggestion policy corresponding to the original construction site-moving record to be archived includes: determining a text unit with a structured storage viewpoint variable larger than a first threshold value as a structured text unit in the original construction land-feature immigration archive to be archived and stored; and if the number of the structured text units contained in the original construction symptom moving file to be archived is determined to be larger than a second threshold value, determining an archive storage suggestion policy of the original construction symptom moving file to be archived as a structured storage policy.

In this scheme, structured storage perspective variables are used to decide how to store the original construction site immigration profile to be archived. Specifically, if a text unit's structured store perspective variable is greater than a first threshold value, it is determined to be a structured text unit. Then, if the number of these structured text elements exceeds a second threshold value, a storage recommendation policy for the archive is determined to be a structured storage policy.

For example, assume that an original constructor immigration profile containing 500 text units is being processed. After analyzing the structured store perspective variables for the text units, it is found that 300 text units have structured store perspective variables greater than a set first threshold, then the 300 text units are determined to be structured text units. Next, since the number of the 300 structured document elements has exceeded the set second threshold value (e.g., 200), the storage recommendation policy for the archive is determined as the structured storage policy.

In this way, files can be effectively classified and stored. By using structured storage policies, information in the archive can be more easily retrieved, queried, and analyzed, thereby greatly improving the efficiency of processing the archive. Furthermore, this approach may also help to better manage and maintain the archive, as its optimal storage may be determined based on the extent of structuring of the archive.

In some aspects, the method is performed by a natural language processing algorithm, the debugging process of which is as follows: based on a preset construction sign land shift file training sample set, carrying out cyclic debugging on an algorithm to be debugged for a plurality of times, wherein in the cyclic debugging process, the following processing is implemented: carrying out file text semantic mining on the extracted construction symptom shift file training samples to obtain a corresponding file text semantic vector training sample set; wherein, each archive text semantic vector training sample characterizes: a text unit training sample in the construction sign land shifting file training samples has file element training samples; disassembling the archive text semantic vector training sample set into a first local text semantic vector group training sample and a second local text semantic vector group training sample; according to the involvement coefficients between the first local text semantic vector group training sample and the second local text semantic vector group training sample, the error part of the first local text semantic vector group training sample relative to the archive text semantic vector training sample set is repeated, and the error part of the second local text semantic vector group training sample relative to the archive text semantic vector training sample set is repeated, so that a corresponding text repeated semantic vector training sample set is obtained; obtaining a corresponding algorithm training cost index according to the difference between the text repeated semantic vector training sample set and the archive text semantic vector training sample set, and optimizing algorithm parameters of the algorithm to be debugged according to the algorithm training cost index.

The scheme is implemented by a natural language processing algorithm, and the debugging process of the algorithm is required to train a sample set based on a preset construction sign land immigration archives.

For example, assume that there is a predetermined set of training samples of the construction land shift profile, which includes a plurality of text unit training samples, such as "compensation criteria: 10 ten thousand yuan per household "," migration date: 2021, 7, 15). In the process of performing one-time cyclic debugging, firstly, archive text semantic mining is performed on the training samples so as to obtain a corresponding archive text semantic vector training sample set. Each archive text semantic vector training sample may characterize an archive element training sample that a text unit training sample has.

Next, the set of archival text semantic vector training samples is disassembled into a first local text semantic vector set training sample and a second local text semantic vector set training sample. This may involve some specific segmentation strategy, e.g. partitioning according to text length or importance etc.

And then, according to the involvement coefficients between the first local text semantic vector group training sample and the second local text semantic vector group training sample, the error parts of the first local text semantic vector group training sample and the second local text semantic vector group training sample are subjected to the repeated description, so that a corresponding text repeated semantic vector training sample set is obtained.

Finally, according to the difference between the text repeated semantic vector training sample set and the archive text semantic vector training sample set, a corresponding algorithm training cost index can be obtained. The cost index can help understand the performance of the current natural language processing algorithm when processing the construction land-moving files, and accordingly, parameter optimization is performed.

Therefore, through a natural language processing algorithm, important information in the construction land shifting archive can be effectively extracted, analyzed and understood, and therefore the archive management efficiency and quality are improved. The algorithm debugging process allows natural language processing algorithms to be continually optimized and improved to better suit specific task requirements. By comparing the difference between the text replication semantic vector and the archive text semantic vector, possible problems of the algorithm can be found and corrected, so that the accuracy and reliability of the algorithm are improved.

In some aspects, the obtaining the corresponding algorithm training cost indicator according to the difference between the text replication semantic vector training sample set and the archive text semantic vector training sample set includes: aiming at each text unit training sample in the extracted construction sign land immigration file training samples, the following processing is implemented: based on the difference between a text-unit training sample corresponding to the text-repeating semantic vector training sample in the text-repeating semantic vector training sample set and a archive text semantic vector training sample corresponding to the archive text semantic vector training sample in the archive text semantic vector training sample set, obtaining text unit loss corresponding to the text-unit training sample; and determining the training loss of the text unit corresponding to each text unit training sample in the extracted construction sign land immigration file training samples, and obtaining a corresponding algorithm training cost index.

In this scheme, the distinction between the text-to-text semantic vector training sample set and the archive text semantic vector training sample set is used to obtain an algorithm training cost indicator. Specifically, for each text unit training sample in the extracted construction symptom shift archive training samples, a text unit loss is obtained based on the difference between the text-description semantic vector training sample corresponding to the text-description semantic vector training sample set and the archive text semantic vector training sample corresponding to the archive text semantic vector training sample set.

For example, assume that there is a text unit training sample "compensation criteria: 10 ten thousand yuan for each household. If the training examples have a large difference between the corresponding repeated semantic vectors in the text repeated semantic vector training example set and the corresponding archive text semantic vectors in the archive text semantic vector training example set, then the training examples may cause a large loss in the training process. This loss can be expressed as a value, such as 0.8.

And then, training loss determination is carried out on the text unit losses corresponding to all the text unit training samples, namely, the losses of all the text units are summed or averaged, so that corresponding algorithm training cost indexes are obtained. For example, if the average value of the text unit losses corresponding to all the text unit training examples is 0.6, the algorithm training cost index may be set to 0.6.

By means of the design, the difficulty and cost of algorithm training can be effectively measured by comparing the difference between text replication semantic vectors and archive text semantic vectors. Text unit loss provides a quantified tool to help better understand and address problems that may occur with individual text units during training. The algorithm training cost index can reflect the overall difficulty of algorithm training, so that the optimization of the training strategy is facilitated, and the training efficiency and effect are improved.

Referring to fig. 2 in combination, the embodiment of the invention further provides a natural language processing system 100, which includes a processor 111, and a memory 112 and a bus 113 connected to the processor 111. Wherein the processor 111 and the memory 112 perform communication with each other via a bus 113. The processor 111 is used to call the program instructions in the memory 112 to execute the above-mentioned construction sign land-shifting archive management method based on natural language processing.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or cloud server that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or cloud server. Without further limitation, an element defined by the statement "comprising one … …" does not exclude that an additional identical element is present in a process, method, article, or cloud server comprising the element.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The foregoing is merely exemplary of the present invention and is not intended to limit the present invention. Various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are to be included in the scope of the claims of the present invention.

Claims

1. A method for managing a construction symptom land shift file based on natural language processing, which is applied to a natural language processing system, the method comprising:

2. The method of claim 1, wherein the disassembling the set of archival text semantic vectors into a first set of local text semantic vectors and a second set of local text semantic vectors comprises:

3. The method of claim 2, wherein the text semantic commonality index between the first set of local text semantic vectors and the second set of local text semantic vectors comprises: a relative distribution relation between text units represented by the text semantic vectors of the files in the first local text semantic vector group and text units represented by the text semantic vectors of the files in the second local text semantic vector group;

4. A method as claimed in claim 3, wherein the method further comprises:

5. A method according to any one of claims 1-4, wherein said obtaining a corresponding archival storage suggestion policy based on a distinction between said text replication semantic vector set and said archival text semantic vector set comprises:

6. The method of claim 5, wherein the obtaining the archive storage suggestion policy corresponding to the original construction symptom immigration archive to be archived according to the structured storage perspective variables corresponding to the text units in the original construction symptom immigration archive to be archived comprises:

7. The method of claim 1, wherein the method is performed by a natural language processing algorithm, and wherein the debugging process of the natural language processing algorithm is as follows:

8. The method of claim 7, wherein obtaining the corresponding algorithm training cost indicator based on the difference between the text-to-text semantic vector training sample set and the archive text semantic vector training sample set comprises:

9. A natural language processing system comprising a processor, and a memory and a bus coupled to the processor; the processor and the memory complete communication with each other through the bus; the processor is configured to invoke the computer program in the memory to perform the natural language processing based method of construction symptom migration archive management of any of claims 1-8.

10. A computer-readable storage medium, having stored thereon a program which, when executed by a processor, implements the natural language processing-based construction symptom migration archive management method according to any one of claims 1 to 8.