CN116821940B - Intelligent training assessment data acquisition method - Google Patents

Intelligent training assessment data acquisition method Download PDF

Info

Publication number
CN116821940B
CN116821940B CN202311061095.3A CN202311061095A CN116821940B CN 116821940 B CN116821940 B CN 116821940B CN 202311061095 A CN202311061095 A CN 202311061095A CN 116821940 B CN116821940 B CN 116821940B
Authority
CN
China
Prior art keywords
data
piece
checked
standard
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311061095.3A
Other languages
Chinese (zh)
Other versions
CN116821940A (en
Inventor
赵中元
刘鸿志
刘晓辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Aston Engineering Technology Transfer Co ltd
Original Assignee
Qingdao Aston Engineering Technology Transfer Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Aston Engineering Technology Transfer Co ltd filed Critical Qingdao Aston Engineering Technology Transfer Co ltd
Priority to CN202311061095.3A priority Critical patent/CN116821940B/en
Publication of CN116821940A publication Critical patent/CN116821940A/en
Application granted granted Critical
Publication of CN116821940B publication Critical patent/CN116821940B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education
    • G06Q50/205Education administration or guidance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Educational Technology (AREA)
  • Computer Hardware Design (AREA)
  • Tourism & Hospitality (AREA)
  • Strategic Management (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • Educational Administration (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data processing, in particular to an intelligent training and checking data acquisition method, which comprises the following steps: acquiring input data of training assessment personnel and a hash value of registration information; acquiring a second vector to be checked of each piece of sectional data of the input data under each piece of sectional standard, clustering the second vectors to be checked of all pieces of sectional data under the same piece of sectional standard to obtain a plurality of clusters under the piece of sectional standard, further acquiring data to be changed in each piece of sectional data, acquiring a first difference degree of each piece of sectional data according to the change condition of the data to be changed in the piece of sectional data, further acquiring an optimal piece of sectional standard, and checking the hash value of the piece of sectional data under the optimal piece of sectional standard and the registration information for training the checking staff; and acquiring training and checking data according to the checking result. The invention reduces the possibility of hash collision and realizes accurate verification.

Description

Intelligent training assessment data acquisition method
Technical Field
The invention relates to the technical field of data processing, in particular to an intelligent acquisition method for training and checking data.
Background
In training assessment, ensuring fair publicity is an important feature. In order to protect personal privacy and data security, intelligent encryption acquisition is required for personal information data of personnel involved in training and checking results. Meanwhile, in the acquisition and input process, data information is abnormal due to improper operation, so that the calculation accuracy in the subsequent processing process is affected. Therefore, in the acquisition process, the recorded data needs to be further checked, the identity of the training and checking personnel is verified, and meanwhile, the error condition that the data information is abnormal and calculated in the subsequent further processing process due to improper operation in the acquisition and recording process is avoided.
In the conventional verification method, a hash value of the input data and a hash value of the data stored in the server are acquired, and verification is performed by comparing the two hash values. However, since the length of the hash value is fixed, there may be a case of a hash collision, that is, different input data generates the same hash value. Because of the infinity of the input data and the finite nature of the output hash value, hash collisions are difficult to avoid, and as the length of the input data is increased, the probability of hash collisions is greatly increased, resulting in the hash value which should have a unique identification effect losing its original characteristics.
Disclosure of Invention
In order to solve the problems, the invention provides an intelligent training assessment data acquisition method.
The intelligent training and checking data acquisition method adopts the following technical scheme:
the embodiment of the invention provides an intelligent training assessment data acquisition method, which comprises the following steps of:
acquiring input data of training assessment personnel, and acquiring registration information hash values of the training assessment personnel stored in a server;
setting different segmentation standards, and respectively segmenting the input data by utilizing each segmentation standard to obtain a plurality of segmentation data under each segmentation standard; performing word segmentation on each piece of segmented data under each segmentation standard to obtain a first vector to be checked of each word segmentation; obtaining a second vector to be checked of each piece of sectional data according to a first vector to be checked of each word in each piece of sectional data;
clustering the second vectors to be checked of all the segment data under the same segment standard to obtain a plurality of clusters under the segment standard; acquiring data to be changed in each piece of sectional data under each piece of sectional standard according to all clusters under each piece of sectional standard and the distribution of each piece of sectional data;
changing the data to be changed in each piece of sectional data under each piece of sectional standard, and acquiring a first difference degree of each piece of sectional data under each piece of sectional standard according to a change result and all clusters under each piece of sectional standard; acquiring an optimal segmentation standard according to the first difference degree of all segmentation data under each segmentation standard;
performing input data verification of training assessment staff according to the hash value of the second vector to be verified of the segmented data under the optimal segmentation standard and the hash value of the registration information of the training assessment staff; and processing the input data according to the verification result and collecting training and checking data.
Preferably, the step of obtaining the first vector to be verified of each word segment includes the following specific steps:
and converting each word included in each piece of segment data under each segment standard into a vector through a word vector model, and taking the vector as a first vector to be checked of each word included in each piece of segment data under each segment standard.
Preferably, the obtaining the second vector to be verified of each piece of data according to the first vector to be verified of each word in each piece of data includes the following specific steps:
and splicing all the first vectors to be checked of each piece of segmented data under the same segmentation standard together according to the sequence of the corresponding segmentation to obtain a second vector to be checked.
Preferably, the obtaining the data to be changed in each piece of the data under each piece of the standard according to all clusters under each piece of the standard and the distribution of each piece of the data includes the following specific steps:
acquiring the informativity of each word in each piece of segmented data under each piece of segmentation standard according to all clusters under each piece of segmentation standard and the distribution of the word in each piece of segmented data; and taking the word with the information degree larger than or equal to a preset information degree threshold value as data to be changed in the segmented data to which the word belongs.
Preferably, the information degree of each word in each piece of the piece of data under each piece of standard is obtained according to all clusters under each piece of standard and the distribution of the word in each piece of data, and the specific steps include:
wherein,represent the firstUnder the segment standardThe first segment of the segment dataInformation degree of individual word segmentation;represent the firstUnder the segment standardThe first segment of the segment dataThe individual word is at the firstThe number of occurrences in the individual segment data;represent the firstUnder the segment standardContained in the segmented dataThe number of word segmentation;represent the firstUnder the segment standardThe second vectors to be checked correspond to the segmented data;represent the firstUnder the segment standardThe second vector to be checked corresponding to the segmented data belongs to the first clusterA vector;the representation is from the firstUnder the segment standardThe second vector to be checked corresponding to the segmented data will be the firstUnder the segment standardThe first segment of the segment dataVector after eliminating the first vector to be checked corresponding to the individual word;representing a cosine similarity function;represent the firstUnder the segment standardThe number of the second vectors to be checked in the cluster to which the second vectors to be checked corresponding to the segmented data belong;is an absolute value sign.
Preferably, the changing the data to be changed in each piece of the piece of data under each piece of the standard includes the following specific steps:
obtaining a hash value of a second vector to be checked of each piece of sectional data under each piece of sectional standard;
acquiring all data to be changed contained in the current segmented data, and if the data to be changed does not exist in the current segmented data or only one data to be changed exists in the current segmented data, marking the first difference degree of the current segmented data as 0;
if two or more data to be changed exist in the current segmented data, all the data to be changed are arranged and combined to obtain a plurality of arrangement sequences, and the changed segmented data corresponding to each arrangement sequence is obtained; obtaining a second vector to be checked of the changed segmented data corresponding to each arrangement sequence of the current segmented data, and obtaining a hash value of the second vector to be checked of the changed segmented data corresponding to each arrangement sequence of the current segmented data;
and acquiring a first difference degree of the current segmented data according to the second to-be-checked vectors of the changed segmented data corresponding to each arrangement sequence of the current segmented data, each second to-be-checked vector in the cluster to which the second to-be-checked vectors of the current segmented data belong and hash values of all the second to-be-checked vectors.
Preferably, the arranging and combining are performed on all the data to be changed to obtain a plurality of arrangement sequences, and the changing segmented data corresponding to each arrangement sequence is obtained, which comprises the following specific steps:
performing permutation and combination on all the data to be changed to obtain a plurality of permutation and combination, removing permutation and combination which are the same as the sequence of the data to be changed in the current segmented data in all permutation and combination, and taking each remaining permutation and combination as a permutation sequence respectively; and replacing each piece of data to be changed in the current segmented data by each piece of data in each permutation sequence in sequence to obtain the changed segmented data corresponding to each permutation sequence.
Preferably, the obtaining the first difference degree of the current segment data according to the second to-be-checked vector of the changed segment data corresponding to each permutation sequence of the current segment data, each second to-be-checked vector in the cluster to which the second to-be-checked vector of the current segment data belongs, and hash values of all the second to-be-checked vectors includes the following specific steps:
wherein,a first degree of discrepancy for the current segment data;is the first of the current segment dataA second vector to be checked of the variable segment data corresponding to the arrangement sequence;the second vector to be checked of the current segment data belongs to the first clusterA second vector to be checked;the number of the second vectors to be checked is contained in the cluster to which the second vectors to be checked of the current segmented data belong;the number of permutation sequences for the current segment data;is the first of the current segment dataHash value of second vector to be checked of variable segmented data corresponding to each permutation sequence and first cluster of second vector to be checked of current segmented dataHamming distances of hash values of the second vectors to be checked;the hash value of the second vector to be checked of the current segment data and the first cluster of the second vector to be checked of the current segment dataHamming distances of hash values of the second vectors to be checked;is the length of the hash value;representing a cosine similarity function.
Preferably, the obtaining the optimal segmentation standard according to the first difference degree of all the segmentation data under each segmentation standard includes the following specific steps:
taking the sum of the first difference degrees of all the segment data under each segment standard as a second difference degree of each segment standard; and taking the segmentation standard with the second greatest degree of difference as the optimal segmentation standard.
Preferably, the processing of the input data and the acquisition of the training and checking data according to the checking result comprise the following specific steps:
the method comprises the steps of intelligently encrypting personal sensitive data and passwords in input data of verification-passing training assessment personnel, collecting training assessment information of the verification-passing training assessment personnel and presenting the training assessment information to the examination personnel; and prompting the training assessment personnel which do not pass the verification to re-enter data for verification, and collecting the training assessment data which do not pass the verification.
The technical scheme of the invention has the beneficial effects that: according to the invention, the acquired input data of the training assessment personnel and the registration information of the training assessment personnel are subjected to self-adaptive segmentation processing, so that the possibility of hash collision is reduced as much as possible, and accurate verification is performed. According to the invention, clustering processing is carried out on the segmented data, the data distribution of the corresponding hash value is mapped to the hash space, the data with strong information representation capability in each segmented data is selected as the data to be changed, the data to be changed in each segmented data under each segmented standard is changed, errors of influence results of unique characteristics of the hash value are avoided due to the fact that some data with high repeatability and no information representation capability are changed, the possibility of hash collision in the traditional verification process is reduced, unique identification effect of the hash value is ensured, accurate verification is realized, and a foundation is provided for subsequent intelligent training and assessment data acquisition processing.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of steps of an intelligent training assessment data acquisition method of the present invention.
Detailed Description
In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following is a specific implementation, structure, characteristics and effects of the training and assessment data intelligent acquisition method according to the invention in combination with the accompanying drawings and the preferred embodiment. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The following specifically describes a specific scheme of the training and checking data intelligent acquisition method provided by the invention with reference to the accompanying drawings.
Referring to fig. 1, a flowchart of steps of a training assessment data intelligent acquisition method according to an embodiment of the present invention is shown, and the method includes the following steps:
s001, acquiring input data of the training assessment personnel, and acquiring registration information hash values of the training assessment personnel.
In the training and checking process, the input data of the training and checking personnel are required to be collected so as to encrypt personal sensitive data in the input data, thereby ensuring fairness and fairness of the training and checking. However, the problem of abnormal data information caused by improper operation in the recording process may exist, so that the recording data needs to be checked, and the recording data needs to be compared with the registration information of the training assessment personnel in the checking process. In order to protect personal privacy and data security of the training and checking personnel, sensitive information such as personal basic information in the registration information of the training and passwords set during registration is stored in a server in a hash value mode, and in the embodiment of the invention, the hash value of the sensitive information such as personal basic information in the registration information of the training and checking personnel and passwords set during registration is called a registration information hash value for short.
In the embodiment of the invention, collecting the input data of training assessment personnel comprises the following steps: personal basic information and sensitive information such as passwords of the examination staff are trained. The input data of the training assessment personnel corresponds to the registration information of the training assessment personnel. The input data is in a text form and contains Chinese and English characters and numbers.
And acquiring the hash value of the registration information of the training assessment personnel stored in the server.
Therefore, the acquisition of the input data of the training assessment personnel is realized, and the hash value of the registration information of the training assessment personnel is acquired.
S002, segmenting the input data by adopting different segmentation standards, obtaining second vectors to be checked of the input data under each segmentation standard, clustering the second vectors to be checked, and obtaining clustering results of the second vectors to be checked under each segmentation standard and hash values corresponding to each cluster.
In addition, since the hash values are irreversible, in order to realize the comparison between the input data of the training assessment personnel and the registration information, the hash values of the input data of the training assessment personnel are required to be acquired, and the hash values are compared, so that the verification of the training assessment personnel is realized. Because the possibility of hash collision exists, the possibility of hash collision becomes larger along with the length of the data, so that the hash value which originally has the unique identification function can lose the original characteristic, the embodiment of the invention expects to reduce the possibility of hash collision to the maximum possibility by carrying out self-adaptive segmentation processing on the input data, and realizes the accurate verification of training assessment personnel.
It should be further noted that, the data of different parts in the input data have some relations, for example, the first 6 bits in the identification card number are address codes, the 7 th bit to the 14 th bit are date codes, the 15 th bit to the 17 th bit are sequence codes, and the 18 th bit is check code. And the influence of the data of different parts on the hash value is different, so that the embodiment of the invention adopts different segmentation standards to segment the input data, when the excellent degree of each segmentation standard is measured, all the segmentation data under each segmentation standard are clustered, the unique characteristic of the hash value is calculated through the influence of the distribution of the segmentation data on clustering, and the optimal segmentation standard is further obtained. In order to analyze the influence of different segmentation standards on the unique characteristics of the hash value, in the embodiment of the invention, clustering processing is carried out on the segmented data, and each cluster is mapped to a hash space to obtain corresponding data distribution. Firstly, the input data needs to be segmented, and vector conversion is carried out on each piece of segmented data.
In the embodiment of the invention, a segmentation range is presetOne segmentation stepEmbodiments of the invention toBy way of example and not limitation, in other embodiments, the practitioner may set the segmentation limit and segmentation step size according to the particular implementation. Starting from the first numerical value in the segmentation range, sampling one data every interval segmentation step number value, and taking all the sampled data as a segmentation standard. For example, a segmentation limit ofThe segmentation step length isThe segmentation criteria obtained are respectively
And respectively segmenting the input data by utilizing each segmentation standard to obtain a plurality of segmentation data under each segmentation standard, wherein the length of each segmentation data under the same segmentation standard is the same as the value of the corresponding segmentation standard.
And taking any one segmentation standard as a target segmentation standard, performing word segmentation processing on each piece of segmentation data under the target segmentation standard by utilizing the Jieba word segmentation, and converting each word segment contained in each piece of segmentation data under the target segmentation standard into a vector form through a word vector model, wherein the dimensions of each vector set in the word vector model are the same. The word vector model and the vector dimension are not particularly limited, the word vector model used in the embodiment of the invention is a doc2vec model, the set dimension is 10, and in other embodiments, the implementation personnel can set the word vector model and the vector dimension according to the actual implementation situation.
To this end, each piece of segment data under the target segment standard is converted into a number of vectors.
The vector of each word segment of each piece of segment data under the target segment standard is marked as a first vector to be checked, all the first vectors to be checked of each piece of segment data under the target segment standard are spliced together according to the sequence of the corresponding word segments to obtain a vector, and the vector is marked as a second vector to be checked. And performing Single-Pass text clustering on the second vectors to be verified of all the segment data under the target segment standard to obtain a plurality of clusters under the target segment standard. It should be noted that, single-Pass text clustering is a well-known technique, and detailed descriptions are not repeated in the embodiment of the present invention. In the embodiment of the invention, the Single-Pass text clustering is taken as an example for description, and the method is not limited in particular, and in other embodiments, an operator can select a clustering algorithm to cluster the second vectors to be checked of all the segment data under the target segment standard according to the actual implementation condition.
And acquiring the hash value of each second vector to be checked in the same cluster under the target segmentation standard through an MD5 hash algorithm, and mapping each second vector to be checked in the same cluster under the target segmentation standard to a hash space. And forming a set of hash values of each second vector to be checked in the same cluster under the target segmentation standard, and recording the set as a hash value cluster.
Thus, all clusters under the target segmentation standard and hash value clusters corresponding to each cluster are obtained.
And similarly, acquiring all clusters under each segmentation standard and hash value clusters corresponding to each cluster under each segmentation standard.
S003, obtaining data to be changed in each piece of sectional data under each piece of sectional standard according to the cluster under each piece of sectional standard.
It should be noted that, each cluster corresponds to one hash value cluster under each segmentation standard, when the segmentation data under the segmentation standard changes, the clusters change, and meanwhile, the corresponding hash value clusters also change, and the degree of the change of the hash value clusters is characterized by the influence of the segmentation standard on the unique characteristic of the hash value. Therefore, the embodiment of the invention changes each piece of segment data under each piece of segment standard, and the influence of the segment standard on the unique characteristic of the hash value is measured according to the change of the hash value cluster reflected by the change of the piece of segment data, so that the optimal segment standard is selected.
It should be further explained that when the segment data is changed under each segment standard, the data with strong information characterization capability is selected from different segment data to change, so that some data with large repeatability and no information characterization capability are prevented from changing, and errors occur in the influence result of the unique characteristic of the hash value. In an embodiment of the present invention, the unique characteristic is represented by the non-repeatability of the hash value. In order to obtain data with strong information representation capability in the segmented data, the informativity of each segmented word in the segmented data needs to be obtained according to the personal information degree of each segmented word representation training assessment personnel in the segmented data, and the information degree is utilized to reflect the information representation capability of the data.
In the embodiment of the invention, the informativity of each word in each piece of segmented data under each piece of segmentation standard is obtained according to all clusters under each piece of segmentation standard and the distribution of each piece of segmented data in a data space:
wherein,represent the firstUnder the segment standardThe first segment of the segment dataInformation degree of individual word segmentation;represent the firstUnder the segment standardThe first segment of the segment dataThe individual word is at the firstThe number of occurrences in the individual segment data;represent the firstUnder the segment standardThe number of the segmented words contained in the segmented data;represent the firstUnder the segment standardThe second vectors to be checked correspond to the segmented data;represent the firstUnder the segment standardThe second vector to be checked corresponding to the segmented data belongs to the first clusterA vector;the representation is from the firstUnder the segment standardThe second vector to be checked corresponding to the segmented data will be the firstUnder the segment standardThe first segment of the segment dataVector after eliminating the first vector to be checked corresponding to the individual word;representing a cosine similarity function;represent the firstUnder the segment standardThe number of the second vectors to be checked in the cluster to which the second vectors to be checked corresponding to the segmented data belong;is an absolute value sign.
It should be noted that, whenUnder the segment standardThe first segment of the segment dataThe individual word is at the firstThe more the number of times the segmented data appear, the greater the repeatability thereof, the smaller the capability of characterizing information, and the smaller the corresponding information degree, otherwise, when the data is the firstUnder the segment standardThe first segment of the segment dataThe individual word is at the firstThe fewer the number of times of occurrence of the segmented data, the lower the repeatability of the segmented data, the stronger the information characterization capability and the larger the corresponding information degree;represent the firstUnder the segment standardThe cosine similarity of the second vector to be checked corresponding to the segmented data and each vector in the cluster to which the second vector to be checked belongs is accumulated and summed,the representation is from the firstUnder the segment standardThe second vector to be checked corresponding to the segmented data will be the firstThe cosine similarity accumulation sum of each vector in the cluster to which the vector to be checked is removed and the second vector to be checked corresponding to each word is added,reflecting the difference of two cosine similarity accumulated sums, the larger the difference of the two cosine similarity accumulated sums is, the moreUnder the segment standardThe first segment of the segment dataThe greater the effect of the individual word on the data distribution in the data space, at this pointUnder the segment standardThe first segment of the segment dataThe greater the informativity of the individual word; by taking the difference of the sum of the two cosine similarity sums as the reference value of the informativeness, the first method is utilizedUnder the segment standardThe first segment of the segment dataDistribution of individual segmentations among all segmentationsAdjusting the reference value to obtain the firstUnder the segment standardThe first segment of the segment dataInformativity of individual word segmentation.
And similarly, acquiring the informativeness of each word in each piece of segmented data under each segmentation standard.
Presetting an informativity threshold valueEmbodiments of the invention toTo describe an example, the implementation is not limited, and the implementation personnel can set the informativity threshold according to the specific implementation condition
Judging the informativity of each word in each segment data under each segment standard, if the informativity is greater than or equal to the informativity threshold valueAnd taking the corresponding segmentation word as data to be changed in the segment data to which the segmentation word belongs.
So far, the data to be changed in each piece of sectional data under each piece of sectional standard is obtained.
S004, changing the data to be changed in each piece of sectional data under each piece of sectional standard, obtaining the changing condition of the corresponding hash value cluster, and further obtaining the optimal sectional standard.
In the embodiment of the present invention, all the data to be changed included in each piece of data under each piece of standard are changed, and a first difference degree of each piece of data under each piece of standard is obtained, which specifically includes:
acquiring all data to be changed contained in the current segmented data, if the data to be changed does not exist in the current segmented data or only one data to be changed exists in the current segmented data, not changing the current segmented data, and recording the first difference degree of the current segmented data as 0;
if two or more data to be changed exist in the current segmented data, all the data to be changed are arranged and combined to obtain a plurality of arrangement combinations, the arrangement combinations with the same sequence as the data to be changed in the current segmented data in all the arrangement combinations are removed, and each remaining arrangement combination is used as an arrangement sequence. And replacing each piece of data to be changed in the current segmented data by each piece of data in each arrangement sequence in sequence to obtain changed segmented data corresponding to each arrangement sequence, for example, when the current segmented data is 2,1,4,3 and the data to be changed is 2 and 4, the corresponding arrangement sequence is {4 and 2}, and the changed segmented data corresponding to the arrangement sequence is 4,1,2 and 3.
And (2) acquiring a second vector to be checked of the changed segmented data corresponding to each arrangement sequence of the current segmented data by using the method in the step (S002), and acquiring a hash value of the second vector to be checked of the changed segmented data corresponding to each arrangement sequence of the current segmented data.
Obtaining a first difference degree of the current segmented data according to a second vector to be checked of the changed segmented data corresponding to each arrangement sequence of the current segmented data and the hash value:
wherein,a first degree of discrepancy for the current segment data;is the first of the current segment dataA second vector to be checked of the variable segment data corresponding to the arrangement sequence;the second vector to be checked of the current segment data belongs to the first clusterA second vector to be checked;the number of the second vectors to be checked is contained in the cluster to which the second vectors to be checked of the current segmented data belong;the number of permutation sequences for the current segment data;is the first of the current segment dataHash value of second vector to be checked of variable segmented data corresponding to each permutation sequence and first cluster of second vector to be checked of current segmented dataHamming distances of hash values of the second vectors to be checked;the hash value of the second vector to be checked of the current segment data and the first cluster of the second vector to be checked of the current segment dataHamming distances of hash values of the second vectors to be checked;is the length of the hash value; as the data to be changed is changed, between the hash value corresponding to the second vector to be checked after the data conversion after the change in the hash space and the hash value corresponding to the second vector to be checked corresponding to the data before the change, if the number of times of the same data in the same data position is reduced, namely the hamming distance is increased, the repeated reduction of the hash value before and after the change is indicated, and the first difference degree is larger when the unique characteristic of the hash value is increased; if the cosine similarity between the second to-be-checked vector of the changed segmented data corresponding to the arrangement sequence and the rest second to-be-checked vectors of the same cluster in the data space is larger, the larger the weight occupied by the second to-be-checked vector of the changed segmented data corresponding to the arrangement sequence when the repetition number of the calculated hash value changes is indicated.
Thus, a first degree of difference for each piece of segment data under each segment criterion is obtained.
The sum of the first degree of difference of all the segment data under each segment standard is taken as the second degree of difference of each segment standard. The greater the sum of the first degrees of difference of all the segment data under each segment criterion, the greater the second degree of difference of the corresponding segment criterion, and the greater the influence of the segment criterion on the hash value uniqueness feature.
And taking the segmentation standard with the second greatest degree of difference as the optimal segmentation standard.
So far, the optimal segmentation standard is obtained.
S005, carrying out sectional verification on the input data of the training assessment personnel according to the optimal sectional standard.
And (3) segmenting the input data by utilizing the method in the step S002 according to the optimal segmentation standard, obtaining corresponding segmentation data under the optimal segmentation standard, obtaining the hash value of the second vector to be checked of each segmentation data, and recording the hash value as the hash value of each segmentation data.
When the training assessment personnel registers, registration information of the training assessment personnel is processed in the same way as the input data, so that the registration information hash value of the training assessment personnel acquired in the step S001 is the hash value of each piece of sectional data after being segmented according to the optimal segmentation standard.
In the embodiment of the invention, the hash value of each piece of segmented data corresponding to the input data of the training assessment personnel under the optimal segmentation standard is verified with the hash value of each piece of segmented data corresponding to the training assessment personnel under the optimal segmentation standard, so that a verification result is obtained. The hash value checking is a well-known technique, and detailed description thereof is omitted in the embodiment of the present invention.
The personal sensitive data and the password in the input data of the training and checking personnel passing the verification are intelligently encrypted, so that fairness and fairness in the training and checking process are ensured. And acquiring training assessment data of the training assessment personnel passing the verification and presenting the training assessment data to the examination personnel. And prompting the training assessment personnel which do not pass the verification to re-enter data for verification, and collecting the training assessment data which do not pass the verification. The embodiment of the invention adopts the DES encryption algorithm for encryption, is not particularly limited, and an operator can set the encryption algorithm according to actual implementation conditions.
Through the steps, the acquisition and processing of the training assessment data for training assessment personnel are completed.
It should be noted that, in the embodiment of the present invention, when the cosine similarity of two vectors is calculated, if the dimensions of the two vectors are different, the last 0 is added to the vector with low dimension so that the vector is the same as the vector with high dimension, and at this time, the cosine similarity of the two vectors is calculated.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims (4)

1. The intelligent training and checking data acquisition method is characterized by comprising the following steps of:
acquiring input data of training assessment personnel, and acquiring registration information hash values of the training assessment personnel stored in a server;
setting different segmentation standards, and respectively segmenting the input data by utilizing each segmentation standard to obtain a plurality of segmentation data under each segmentation standard; performing word segmentation on each piece of segmented data under each segmentation standard to obtain a first vector to be checked of each word segmentation; obtaining a second vector to be checked of each piece of sectional data according to a first vector to be checked of each word in each piece of sectional data;
clustering the second vectors to be checked of all the segment data under the same segment standard to obtain a plurality of clusters under the segment standard; acquiring data to be changed in each piece of sectional data under each piece of sectional standard according to all clusters under each piece of sectional standard and the distribution of each piece of sectional data;
changing the data to be changed in each piece of sectional data under each piece of sectional standard, and acquiring a first difference degree of each piece of sectional data under each piece of sectional standard according to a change result and all clusters under each piece of sectional standard; acquiring an optimal segmentation standard according to the first difference degree of all segmentation data under each segmentation standard;
performing input data verification of training assessment staff according to the hash value of the second vector to be verified of the segmented data under the optimal segmentation standard and the hash value of the registration information of the training assessment staff; processing input data and acquiring training assessment data according to the verification result;
the method for acquiring the data to be changed in each piece of sectional data under each piece of sectional standard according to all clusters under each piece of sectional standard and the distribution of each piece of sectional data comprises the following specific steps:
acquiring the informativity of each word in each piece of segmented data under each piece of segmentation standard according to all clusters under each piece of segmentation standard and the distribution of the word in each piece of segmented data; taking the word with the information degree larger than or equal to a preset information degree threshold value as data to be changed in the segmented data to which the word belongs;
the information degree of each word in each piece of the piece of data under each piece of standard is obtained according to all clusters under each piece of standard and the distribution of the word in each piece of data, and the method comprises the following specific steps:
wherein,indicate->The third part under the segmentation standard>The>Information degree of individual word segmentation; />Indicate->The third part under the segmentation standard>Individual divisionFirst->The individual word is at->The number of occurrences in the individual segment data; />Indicate->The third part under the segmentation standard>The number of the segmented words contained in the segmented data; />Indicate->The third part under the segmentation standard>The second vectors to be checked correspond to the segmented data; />Indicate->The third part under the segmentation standard>The second vector to be checked corresponding to the segment data belongs to the first cluster +.>A vector; />Representing from->The third part under the segmentation standard>The second vector to be checked corresponding to the segment data will be +.>The third part under the segmentation standard>The>Vector after eliminating the first vector to be checked corresponding to the individual word; />Representing a cosine similarity function; />Indicate->The third part under the segmentation standard>The number of the second vectors to be checked in the cluster to which the second vectors to be checked corresponding to the segmented data belong; />Is an absolute value symbol;
the method comprises the following specific steps of:
obtaining a hash value of a second vector to be checked of each piece of sectional data under each piece of sectional standard;
acquiring all data to be changed contained in the current segmented data, and if the data to be changed does not exist in the current segmented data or only one data to be changed exists in the current segmented data, marking the first difference degree of the current segmented data as 0;
if two or more data to be changed exist in the current segmented data, all the data to be changed are arranged and combined to obtain a plurality of arrangement sequences, and the changed segmented data corresponding to each arrangement sequence is obtained; obtaining a second vector to be checked of the changed segmented data corresponding to each arrangement sequence of the current segmented data, and obtaining a hash value of the second vector to be checked of the changed segmented data corresponding to each arrangement sequence of the current segmented data;
obtaining a first difference degree of the current segmented data according to second to-be-checked vectors of the changed segmented data corresponding to each arrangement sequence of the current segmented data, each second to-be-checked vector in a cluster to which the second to-be-checked vectors of the current segmented data belong and hash values of all the second to-be-checked vectors;
the method comprises the following specific steps of:
performing permutation and combination on all the data to be changed to obtain a plurality of permutation and combination, removing permutation and combination which are the same as the sequence of the data to be changed in the current segmented data in all permutation and combination, and taking each remaining permutation and combination as a permutation sequence respectively; each piece of data in each permutation sequence is utilized to replace each piece of data to be changed in the current piece of data in sequence, and the changing piece of data corresponding to each permutation sequence is obtained;
the step of obtaining the first difference degree of the current segment data according to the second to-be-checked vector of the changed segment data corresponding to each arrangement sequence of the current segment data, each second to-be-checked vector in the cluster to which the second to-be-checked vector of the current segment data belongs and hash values of all the second to-be-checked vectors comprises the following specific steps:
wherein,a first degree of discrepancy for the current segment data; />Is the +.>A second vector to be checked of the variable segment data corresponding to the arrangement sequence; />The second vector to be checked for the current segment data belongs to the cluster +.>A second vector to be checked; />The number of the second vectors to be checked is contained in the cluster to which the second vectors to be checked of the current segmented data belong; />The number of permutation sequences for the current segment data; />Is the +.>Hash value of second vector to be checked of variable segment data corresponding to each permutation sequence and the first part of cluster to which the second vector to be checked of the current segment data belongs>Hamming distances of hash values of the second vectors to be checked; />Hash value of second vector to be checked of current segment data and first part of cluster to which second vector to be checked of current segment data belongs>Hamming distances of hash values of the second vectors to be checked; />Is the length of the hash value; />Representing a cosine similarity function;
the method comprises the following specific steps of:
taking the sum of the first difference degrees of all the segment data under each segment standard as a second difference degree of each segment standard; and taking the segmentation standard with the second greatest degree of difference as the optimal segmentation standard.
2. The method for intelligently collecting training assessment data according to claim 1, wherein the step of obtaining the first vector to be checked for each word segment comprises the following specific steps:
and converting each word included in each piece of segment data under each segment standard into a vector through a word vector model, and taking the vector as a first vector to be checked of each word included in each piece of segment data under each segment standard.
3. The intelligent training assessment data collection method according to claim 1, wherein the step of obtaining the second vector to be verified of each piece of data according to the first vector to be verified of each word in each piece of data comprises the following specific steps:
and splicing all the first vectors to be checked of each piece of segmented data under the same segmentation standard together according to the sequence of the corresponding segmentation to obtain a second vector to be checked.
4. The intelligent training assessment data acquisition method according to claim 1, wherein the processing of the input data and the acquisition of the training assessment data according to the verification result comprise the following specific steps:
the method comprises the steps of intelligently encrypting personal sensitive data and passwords in input data of verification-passing training assessment personnel, collecting the training assessment data of the verification-passing training assessment personnel and presenting the training assessment data to the examination personnel; and prompting the training assessment personnel which do not pass the verification to re-enter data for verification, and collecting the training assessment data which do not pass the verification.
CN202311061095.3A 2023-08-23 2023-08-23 Intelligent training assessment data acquisition method Active CN116821940B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311061095.3A CN116821940B (en) 2023-08-23 2023-08-23 Intelligent training assessment data acquisition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311061095.3A CN116821940B (en) 2023-08-23 2023-08-23 Intelligent training assessment data acquisition method

Publications (2)

Publication Number Publication Date
CN116821940A CN116821940A (en) 2023-09-29
CN116821940B true CN116821940B (en) 2024-02-13

Family

ID=88113082

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311061095.3A Active CN116821940B (en) 2023-08-23 2023-08-23 Intelligent training assessment data acquisition method

Country Status (1)

Country Link
CN (1) CN116821940B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107958020A (en) * 2017-10-24 2018-04-24 中国南方电网有限责任公司超高压输电公司检修试验中心 It is a kind of based on cluster electric network data processing and data visualization method
CN113704569A (en) * 2021-04-13 2021-11-26 腾讯科技(深圳)有限公司 Information processing method and device and electronic equipment
CN113836272A (en) * 2021-09-29 2021-12-24 平安资产管理有限责任公司 Key information display method and system, computer equipment and readable storage medium
CN114757302A (en) * 2022-05-25 2022-07-15 河北经贸大学 Clustering method system for text processing
CN115994137A (en) * 2023-03-23 2023-04-21 无锡弘鼎软件科技有限公司 Data management method based on application service system of Internet of things
CN116383450A (en) * 2023-06-05 2023-07-04 沧州中铁装备制造材料有限公司 Railway and highway logistics transportation information comprehensive management system
CN116433249A (en) * 2021-12-31 2023-07-14 数界(深圳)科技有限公司 Method for proving content using behavior and method for verifying content using behavior
CN116523320A (en) * 2023-07-04 2023-08-01 山东省标准化研究院(Wto/Tbt山东咨询工作站) Intellectual property risk intelligent analysis method based on Internet big data

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107958020A (en) * 2017-10-24 2018-04-24 中国南方电网有限责任公司超高压输电公司检修试验中心 It is a kind of based on cluster electric network data processing and data visualization method
CN113704569A (en) * 2021-04-13 2021-11-26 腾讯科技(深圳)有限公司 Information processing method and device and electronic equipment
CN113836272A (en) * 2021-09-29 2021-12-24 平安资产管理有限责任公司 Key information display method and system, computer equipment and readable storage medium
CN116433249A (en) * 2021-12-31 2023-07-14 数界(深圳)科技有限公司 Method for proving content using behavior and method for verifying content using behavior
CN114757302A (en) * 2022-05-25 2022-07-15 河北经贸大学 Clustering method system for text processing
CN115994137A (en) * 2023-03-23 2023-04-21 无锡弘鼎软件科技有限公司 Data management method based on application service system of Internet of things
CN116383450A (en) * 2023-06-05 2023-07-04 沧州中铁装备制造材料有限公司 Railway and highway logistics transportation information comprehensive management system
CN116523320A (en) * 2023-07-04 2023-08-01 山东省标准化研究院(Wto/Tbt山东咨询工作站) Intellectual property risk intelligent analysis method based on Internet big data

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Hardware Acceleration of k-Mer Clustering using Locality-Sensitive Hashing;Javier E. Soto;《IEEE》;全文 *
基于位置信息熵的局部敏感哈希聚类方法;徐彭娜;魏静;林劼;江育娥;;计算机应用与软件(第03期);全文 *
基于词语权重分析的中文文本相似检测技术研究;陈靖元;《硕士电子期刊》;全文 *

Also Published As

Publication number Publication date
CN116821940A (en) 2023-09-29

Similar Documents

Publication Publication Date Title
Joseph et al. RETRACTED ARTICLE: A multimodal biometric authentication scheme based on feature fusion for improving security in cloud environment
Tuyls et al. Practical biometric authentication with template protection
JP5930056B2 (en) Binary data conversion method, apparatus and program
CN105718502B (en) Method and apparatus for efficient feature matching
KR102388698B1 (en) Method for enrolling data in a base to protect said data
US10083194B2 (en) Process for obtaining candidate data from a remote storage server for comparison to a data to be identified
CN111881991A (en) Method and device for identifying fraud and electronic equipment
CN115269304A (en) Log anomaly detection model training method, device and equipment
CN116319110B (en) Data acquisition and management method for industrial multi-source heterogeneous time sequence data
EP2517150B1 (en) Method and system for generating a representation of a finger print minutiae information
Rathgeb et al. Preventing the cross-matching attack in Bloom filter-based cancelable biometrics
US10733415B1 (en) Transformed representation for fingerprint data with high recognition accuracy
CN116821940B (en) Intelligent training assessment data acquisition method
EP3451233A1 (en) Biological-image processing unit and method and program for processing biological image
JP6343081B1 (en) Recording medium recording code code classification search software
Patel et al. Random forest profiling attack on advanced encryption standard
CN116049905A (en) Tamper-proof system based on detecting system file change
WO2022074840A1 (en) Domain feature extractor learning device, domain prediction device, learning method, learning device, class identification device, and program
KR102168937B1 (en) Bit string transforming method for fingerprint images using normalized regional structure and identifng method for two fingerprint images thereof
Mon et al. Evaluating biometrics fingerprint template protection for an emergency situation
CN112989815A (en) Text similarity recognition method, device, equipment and medium based on information interaction
EP3093793A1 (en) Fingerprint identification method and device using same
CN111914276A (en) Chip information leakage analysis method and device
CN114088400B (en) Rolling bearing fault diagnosis method based on envelope permutation entropy
CN117235137B (en) Professional information query method and device based on vector database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant