CN109299738B - Manuscript gene selection method and device and electronic equipment - Google Patents

Manuscript gene selection method and device and electronic equipment Download PDF

Info

Publication number
CN109299738B
CN109299738B CN201811096577.1A CN201811096577A CN109299738B CN 109299738 B CN109299738 B CN 109299738B CN 201811096577 A CN201811096577 A CN 201811096577A CN 109299738 B CN109299738 B CN 109299738B
Authority
CN
China
Prior art keywords
manuscript
genome
genes
genomes
maximum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811096577.1A
Other languages
Chinese (zh)
Other versions
CN109299738A (en
Inventor
张芃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Iol Wuhan Information Technology Co ltd
Original Assignee
Iol Wuhan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Iol Wuhan Information Technology Co ltd filed Critical Iol Wuhan Information Technology Co ltd
Priority to CN201811096577.1A priority Critical patent/CN109299738B/en
Publication of CN109299738A publication Critical patent/CN109299738A/en
Application granted granted Critical
Publication of CN109299738B publication Critical patent/CN109299738B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06311Scheduling, planning or task assignment for a person or group

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Evolutionary Computation (AREA)
  • Educational Administration (AREA)
  • Game Theory and Decision Science (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The embodiment of the invention provides a manuscript gene selection method, a manuscript gene selection device and electronic equipment, wherein the method comprises the following steps: selecting a plurality of manuscript genomes respectively; for each manuscript genome, obtaining a plurality of matching success rate samples, and calculating the mean value and the standard deviation of the matching success rates corresponding to the manuscript genome according to the matching success rate samples; selecting a manuscript genome corresponding to the largest of all the mean values as a largest manuscript genome, and respectively defining the mean value and the standard deviation as a largest mean value and a largest standard deviation; for each manuscript genome except the maximum manuscript genome, calculating a corresponding Z value based on the corresponding mean value and standard deviation, the maximum mean value and the maximum standard deviation; and combining the genes in the manuscript genome meeting the set conditions with the genes in the maximum manuscript genome based on the Z value corresponding to each manuscript genome to obtain the finally selected manuscript genes. The embodiment of the invention can enable the selected manuscript genes to better reflect the difference between manuscripts.

Description

Manuscript gene selection method and device and electronic equipment
Technical Field
The embodiment of the invention relates to the technical field of data processing, in particular to a manuscript gene selection method, a manuscript gene selection device and electronic equipment.
Background
The high-speed and massive data of the internet comprise a great variety of complicated documents. Different documents contain different key information, and different documents can be processed in a mode suitable for the documents according to the key information. For example, in the translation industry, for different manuscripts to be translated, the most appropriate translator can be matched for the manuscripts according to the key information contained in the manuscripts, so that the translation efficiency and the translation accuracy are effectively improved.
The gene matching of the manuscript and the translator refers to a process of finding the optimal translator for the manuscript by matching the translator gene and the manuscript gene through a matching model under a set strategy. Compared with other manuscript genes, the chosen manuscript gene for gene matching can better reflect the difference of manuscripts to be matched, so that a more suitable translator can be matched for the manuscripts to be translated.
The manuscript gene mainly refers to a relatively unique characterization formed by extracting a plurality of characteristics from the manuscript and effectively combining the characteristics to naturally describe the manuscript. It is also possible to consider a unique key information combination that is obtained by performing analysis calculation and quantization processing on the document feature attributes and that is different from other documents and exists in the document.
The sources of the manuscript genes are various. The manuscript genes exist in all manuscripts, and different manuscripts have different genes. Due to different specific applications, the existing document gene matching algorithm selects corresponding gene combinations according to experience when selecting genes to be matched of manuscripts for matching calculation.
However, the internet has a great variety and complexity of manuscripts in high-speed and massive data, and the selection mode of the manuscript genes has certain limitations, so that the selected manuscript genes cannot well reflect the differences among the manuscripts. Therefore, when selecting manuscript genes, it is more important to extract different genes, so that the genes can be treated differently.
Disclosure of Invention
In order to overcome the above problems or at least partially solve the above problems, embodiments of the present invention provide a method, an apparatus, and an electronic device for selecting a manuscript gene, so that the selected manuscript gene can better reflect the difference between manuscripts.
In a first aspect, an embodiment of the present invention provides a manuscript gene selection method, including: selecting a plurality of groups of different genes from the alternative manuscript gene list to form a plurality of manuscript genomes; for each manuscript genome, sampling a plurality of matching results to obtain a plurality of matching success rate samples, and calculating a mean value and a standard deviation of matching success rates corresponding to the manuscript genome based on the plurality of matching success rate samples; selecting a manuscript genome corresponding to the largest of all the mean values, defining the manuscript genome as the largest manuscript genome, defining the mean value of the largest manuscript genome as the largest mean value, and defining the standard deviation of the largest manuscript genome as the largest standard deviation; for each manuscript genome except for the maximum manuscript genome in all the manuscript genomes, calculating a Z value corresponding to the manuscript genome based on the mean value and the standard deviation corresponding to the manuscript genome, and the maximum mean value and the maximum standard deviation; selecting a manuscript genome meeting set conditions from all the manuscript genomes based on the Z value corresponding to each manuscript genome except the maximum manuscript genome in all the manuscript genomes, and merging genes in the manuscript genomes meeting the set conditions and genes in the maximum manuscript genome to obtain finally selected manuscript genes; wherein the Z value represents a Z value in large sample differential validation.
In a second aspect, an embodiment of the present invention provides a manuscript gene selecting device, including: the initial gene selection module is used for respectively selecting a plurality of groups of different genes from the alternative manuscript gene list to form a plurality of manuscript genomes; the first calculation module is used for sampling the matching result of each manuscript genome for multiple times to obtain multiple matching success rate samples, and calculating the mean value and the standard deviation of the matching success rate corresponding to the manuscript genome based on the multiple matching success rate samples; a maximum genome selecting module, configured to select a manuscript genome corresponding to a maximum of all the mean values, and define the largest manuscript genome as a maximum mean value, and define the standard deviation of the largest manuscript genome as a maximum standard deviation; a second calculating module, configured to calculate, for each manuscript genome except for the maximum manuscript genome in all the manuscript genomes, a Z value corresponding to the manuscript genome based on the mean value and the standard deviation corresponding to the manuscript genome, and the maximum mean value and the maximum standard deviation; a final gene selection module, configured to select, based on the Z value corresponding to each manuscript genome except for the maximum manuscript genome in all the manuscript genomes, a manuscript genome meeting a set condition from all the manuscript genomes, and merge a gene in the manuscript genome meeting the set condition with a gene in the maximum manuscript genome to obtain a finally selected manuscript gene; wherein the Z value represents a Z value in large sample differential validation.
In a third aspect, an embodiment of the present invention provides an electronic device, including: at least one memory, at least one processor, a communication interface, and a bus; the memory, the processor and the communication interface complete mutual communication through the bus, and the communication interface is used for information transmission between the electronic equipment and manuscript information equipment; the storage stores a computer program operable on the processor, and the processor executes the computer program to implement the manuscript gene selecting method according to the first aspect.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the manuscript gene selection method according to the first aspect.
According to the manuscript gene selecting method, the manuscript gene selecting device and the electronic equipment, provided by the embodiment of the invention, multiple groups of manuscript genomes are selected from the manuscript gene pools of all manuscripts in advance, and the manuscript genomes with Z values meeting set conditions are selected by calculating the Z values corresponding to the manuscript genomes to serve as final selection results, so that the selected manuscript genes can better reflect the difference among the manuscripts.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a selecting method of manuscript genes provided by the embodiment of the present invention;
FIG. 2 is a schematic view of a process for extracting manuscript genes in the manuscript gene selecting method according to the embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a manuscript gene selecting device provided in the embodiment of the present invention;
fig. 4 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments obtained by persons of ordinary skill in the art based on the embodiments of the present invention without any creative efforts belong to the protection scope of the embodiments of the present invention.
The high-speed and massive data of the internet comprise a great variety of complicated documents. Different documents contain different key information. Due to different specific applications, the existing document gene matching algorithm selects corresponding gene combinations according to experience when selecting genes to be matched of manuscripts for matching calculation. However, the traditional method has certain limitations, which causes the problem that the selected manuscript gene can not well reflect the difference of manuscripts and the like.
In order to solve the above problems, in the embodiments of the present invention, multiple sets of manuscript genomes are selected from manuscript gene pools of all manuscripts in advance, and the Z values corresponding to the manuscript genomes are calculated to select the manuscript genomes with the Z values meeting the set conditions, so as to serve as a final selection result, so that the differences between the manuscripts can be better reflected by the selected manuscript genes. Wherein, the Z value represents the Z value in the differential verification of the large sample.
As an aspect of the embodiment of the present invention, this embodiment provides a manuscript gene selecting method, and referring to fig. 1, a flow chart of the manuscript gene selecting method provided by the embodiment of the present invention is schematically illustrated, including:
s101, selecting a plurality of groups of different genes from the alternative manuscript gene list to form a plurality of manuscript genomes.
It can be understood that, before the manuscript gene selection of this embodiment is performed, a candidate manuscript gene list is established in advance according to all attribute information of the manuscript, and the candidate manuscript gene list may include all genes related to the specific attribute of the manuscript. Specifically, the candidate manuscript gene list may be regarded as a gene pool in which manuscript genes, which are genes related to manuscript information extracted from all manuscripts, are stored in units of genes. The manuscript gene mainly refers to a unique key information combination which is obtained by analyzing, calculating and quantifying the characteristic attribute of the manuscript and is different from other manuscripts.
In this step, multiple groups of manuscript genes are respectively selected according to the candidate manuscript gene list, and each group of manuscript genes forms a genome as a manuscript genome, wherein the manuscript genome is the initially selected manuscript genome. It is understood that, when selecting each group of manuscript genes, a plurality of manuscript genes in the list can be randomly selected from the candidate manuscript gene list, and one genome, namely, the manuscript genome can be formed by using the randomly selected manuscript genes.
Of course, the extraction rules may be defined in advance, such as simultaneous extraction or sequential extraction, interlaced extraction or line number-specific extraction, extraction according to different manuscript information represented by the gene, the number of extractions, and so on. Then, when the actual extraction process is carried out, for each group of manuscript genes, extracting a plurality of corresponding genes from the candidate manuscript gene list according to the predefined extraction rule.
For example, 3-5 different genes are randomly selected from the candidate manuscript gene list to form a group of genes, so that a manuscript genome is formed. In the same manner, multiple groups of genes may be selected simultaneously or sequentially to form multiple manuscript genomes, which is not limited in the embodiments of the present invention.
S102, for each manuscript genome, sampling a plurality of matching results to obtain a plurality of matching success rate samples, and calculating a mean value and a standard deviation of the matching success rates corresponding to the manuscript genome based on the plurality of matching success rate samples.
It can be understood that for each set of originally selected manuscript genomes, the matching effect with the translator needs to be determined, so as to select the manuscript gene more suitable for gene matching. Meanwhile, in order to avoid loss of generality, for each manuscript genome, the manuscript genome can be input into a given matching model, the given matching model is used for carrying out multiple matching result sampling, and one matching success rate sample can be obtained in each sampling.
It can be understood that, for each manuscript genome, when a matching model is used for collecting a matching success rate sample, genes in the manuscript genome are input into the matching model, and the matching model automatically calculates and outputs a matching success rate value of the genes in the manuscript genome and a translator gene according to the translator gene provided by the matching model, so that the matching success rate value output by the matching model can be used as a matching success rate sample. And for the same manuscript genome, performing the matching result sampling process for multiple times to obtain multiple matching success rate samples.
Then, for each preliminarily selected manuscript genome, calculating the comprehensive matching success rate of the manuscript genome according to a plurality of matching success rate samples obtained by sampling the matching results for a plurality of times, namely respectively calculating the mean value and the standard deviation of the matching success rates corresponding to the manuscript genome. It can be understood that each matching success rate sample is actually a matching success rate value obtained by sampling a matching result.
For example, suppose that the matching result is sampled according to a certain manuscript genome, and n matching success rate samples are obtained and are respectively p1,p2,...pn. Calculating the mean value of the matching success rate corresponding to the manuscript genome according to the mean value:
Figure BDA0001805690480000061
wherein E (p) represents the mean value of the matching success rates corresponding to the manuscript genome, piThe i-th matching success rate sample of the manuscript genome is shown, and n represents the total number of the matching success rate samples collected aiming at the manuscript genome.
On the basis, the standard deviation of the matching success rate corresponding to the manuscript genome is calculated as follows:
Figure BDA0001805690480000062
wherein S represents the standard deviation of the matching success rate corresponding to the manuscript genome, E (p) represents the mean value of the matching success rate corresponding to the manuscript genome, and piThe i-th matching success rate sample of the manuscript genome is shown, and n represents the total number of the matching success rate samples collected aiming at the manuscript genome.
S103, selecting the manuscript genome corresponding to the largest one of all the mean values, defining the manuscript genome as the largest manuscript genome, defining the mean value of the largest manuscript genome as the largest mean value, and defining the standard deviation of the largest manuscript genome as the largest standard deviation.
It can be understood that, assuming that the total group number of the selected manuscript genomes is m groups, the mean value and the standard deviation corresponding to the m groups can be calculated according to the steps. According to the embodiment of the invention, the maximum value is selected from the m mean values, and the manuscript genome corresponding to the maximum value is defined as the maximum manuscript genome. Correspondingly, the mean of the largest contribution genome is defined as the maximum mean, using the variable EmaxThe standard deviation of the largest manuscript genome is defined as the maximum standard deviation, and the variable S is usedmaxAnd (4) showing.
S104, calculating a Z value corresponding to each manuscript genome except for the maximum manuscript genome in all the manuscript genomes based on the mean value and the standard deviation corresponding to the manuscript genome and the maximum mean value and the maximum standard deviation; wherein the Z value represents a Z value in large sample differential validation.
It can be understood that at the rootAnd calculating the Z value of the initially selected manuscript genomes on the basis of the mean value and the standard deviation of the matching success rates corresponding to the initially selected manuscript genomes except the maximum manuscript genome in all the manuscript genomes. Specifically, for each of the manuscript genomes, according to the standard deviation and the mean value of the matching success rate corresponding to the manuscript genome, the maximum mean value E corresponding to the maximum manuscript genome is combinedmaxAnd maximum standard deviation SmaxAnd respectively calculating the Z values corresponding to the manuscript genomes.
It is understood that the concept of Z value therein is a large sample differential proof, i.e., the concept of Z value in Z proof. The Z-test is a method that is generally used for mean variability tests for large samples (i.e., sample volumes greater than 30). It uses the theory of standard normal distribution to deduce the probability of difference occurrence, so as to compare whether the difference between two average values is significant or not. When the standard deviation is known, it is verified whether the mean of a set of numbers is equal to a certain desired value. In the embodiment of the invention, Z verification is utilized to measure the matching difference verification of the initially selected manuscript genomes, so that Z value calculation is carried out on each initially selected manuscript genome.
And S105, selecting the manuscript genome meeting the set condition from all the manuscript genomes based on the Z value corresponding to each manuscript genome except the maximum manuscript genome in all the manuscript genomes, and merging the gene in the manuscript genome meeting the set condition with the gene in the maximum manuscript genome to obtain the finally selected manuscript gene.
It is understood that, according to the above steps, the Z value of each manuscript genome except the maximum manuscript gene can be calculated, and the difference performance of each corresponding manuscript genome in gene matching can be judged according to the Z value. Therefore, according to the Z value corresponding to each manuscript genome, whether the translator genome corresponding to the Z value meets the set diversity requirement can be judged by using the preset condition. If the Z value does not meet the set difference requirement, the original manuscript genomes are removed from the initially selected manuscript genomes, and finally all the remained manuscript genomes which are not removed are the manuscript genomes meeting the requirement, wherein the manuscript genomes comprise the manuscript genomes with the Z values meeting the set difference requirement and the maximum manuscript genomes. And taking out the genes in all the residual manuscript genomes, and removing the repeated genes in the genes to form a new group of genes which are used as the finally selected manuscript genes.
For example, assume that a total of n matching success rate samples are collected for a certain manuscript genome, and the matching success rate samples conform to a normal distribution. Meanwhile, the preset condition for selecting the manuscript gene is that the confidence coefficient of the selected gene is not lower than 95 percent, and the confidence coefficient corresponds to the Z value of the manuscript genome to be 1.96. And comparing the Z value corresponding to each preliminarily selected manuscript genome with 1.96, if the Z value is greater than 1.96, rejecting the manuscript genome corresponding to the Z value, and otherwise, reserving the manuscript genome corresponding to the Z value.
It is assumed that p manuscript genomes which do not satisfy the setting condition are removed from all the n initially selected manuscript genomes according to the processing procedure, and the rest n-p manuscript genomes satisfy the setting condition. Then, in the n-p manuscript genomes, two or more manuscript genomes may simultaneously contain a certain manuscript gene. Therefore, all manuscript genes in the n-p manuscript genomes are taken out and put into a new gene pool, and in the gene pool, for each manuscript gene appearing many times, redundant manuscript genes are removed, and only one manuscript gene is reserved. Finally, the new gene pool contains a plurality of non-repetitive manuscript genes which are taken as the finally selected manuscript genes.
According to the manuscript gene selecting method provided by the embodiment of the invention, multiple groups of manuscript genomes are selected from the manuscript gene pools of all manuscripts in advance, and the manuscript genomes with Z values meeting set conditions are selected by calculating the Z values corresponding to the manuscript genomes to serve as final selection results, so that the selected manuscript genes can better reflect the difference between the manuscripts. In addition, in the gene matching application, the selected manuscript can be reasonably matched with the existing translator, so that the translation efficiency and the translation accuracy are effectively improved.
In one embodiment, before the step of selecting a plurality of different groups of genes from the candidate manuscript gene list, the method of the embodiment of the present invention further includes:
extracting corresponding genes from all the project related information, manuscript related information and process related information of the manuscript respectively, and correspondingly forming the project related genes, the manuscript related genes and the process related genes of the manuscript;
and forming an alternative manuscript gene list based on the project related gene, the manuscript related gene and the process related gene.
It can be understood that the internet contains a great variety of complicated and intricate documents in high-speed and massive data. Different documents contain different key information. Through the source channel of the manuscript gene, the manuscript gene is extracted from the following aspects to form an alternative manuscript gene list:
the project related information, namely the requirements of the customers on the project, comprises the provided related tools, terms, expert support and other information, and belongs to the important source channel of the genes;
manuscript related information, wherein document information of the manuscript is determined by document content and comprises document size, language information, category information, type information, vocabulary information, term information, syntax information, semantic information and the like;
the process related information refers to the state of the manuscript from generation to translation completion and the like, and fragment manuscripts, such as new gene information which appears after a large manuscript in a project is split, such as change of word number, requirement of quality, change of industry, requirement of time and the like.
Based on the above information of the manuscript, the corresponding genes corresponding to the manuscript are respectively extracted, and according to the above aspects, the corresponding item related gene, manuscript related gene and process related gene are formed. Then, a candidate manuscript gene list is constructed based on the genes of the above aspects. For example, for the manuscript-related information of the manuscript, a candidate manuscript gene list corresponding to the manuscript-related information can be constructed as shown in table 1, and is a candidate manuscript gene list of the manuscript-related information according to the embodiment of the present invention.
Table 1 shows a list of alternative manuscript genes of manuscript related information according to an embodiment of the present invention
Figure BDA0001805690480000091
Figure BDA0001805690480000101
Then, when selecting a plurality of manuscript genomes according to table 1, a plurality of manuscript genes respectively corresponding to each data item can be randomly selected, and if a gene corresponding to the "source language" in simplified chinese "and a gene corresponding to the" field of the country "in engine" are selected, the two genes form a manuscript genome. By adopting the same processing process, other different manuscript genomes can be selected.
Similarly, if the extraction rule is set in advance to select genes related to documents of the manuscripts, genes corresponding to "number of characters of the manuscripts", "type of the manuscripts", "format of the manuscripts", and "reference language" in table 1 may be selected to constitute a genome of the manuscripts.
It can be understood that the manuscript genes exist in manuscripts, and different manuscripts have different genes, and the different genes are common but more important to be extracted, so that the manuscripts can be treated differently and matched with the best translators.
However, since genes are not characteristic and cannot be identified easily and clearly, extraction requires a step. Genes are essentially distinguished from features, which abstract a concept from characteristics common to objects. The characteristics include segment attributes, and the most basic information of the object, i.e., the gene, included in the attributes.
Therefore, when extracting the manuscript gene, the present embodiment first extracts the corresponding feature information as the manuscript feature according to the three pieces of information of the manuscript of the above embodiments. Then, according to different manuscript characteristics, extracting attribute information of the manuscript, namely manuscript attributes, and then respectively extracting the most basic information of the manuscript to form a manuscript direct gene. Specifically, as shown in fig. 2, a schematic flow chart of extracting manuscript genes in the manuscript gene selecting method provided by the embodiment of the present invention is shown.
The manuscript gene selecting method provided by the embodiment of the invention extracts the genes of the manuscript from the three aspects of the item related information, the manuscript related information and the process related information of the manuscript respectively, and forms an alternative manuscript gene list according to the genes to select and match the genes of the manuscript better, so that the special information of different aspects of the manuscript can be considered more comprehensively, and a reliable basis is provided for more reasonably matching the genes.
Optionally, according to the foregoing embodiments, the step of performing multiple matching result sampling and obtaining multiple matching success rate samples further includes:
for any round of multiple matching result sampling, the following processing flow is executed:
carrying out initial setting on initial values of matching success rates of all manuscript genomes;
randomly selecting a manuscript genome from all manuscript genomes, performing a matching test on the selected manuscript genome, and updating the current matching power value of the manuscript genome based on the matching success rate result of the current matching test on the manuscript genome and the historical matching success rate result;
repeating the steps from random selection to updating until the number of times of the matching test on any manuscript genome reaches a first set threshold, stopping the matching test on the manuscript genome, and recording the current matching power value of the manuscript genome;
and (3) repeatedly executing the step from random selection to recording on the manuscript genomes except for the manuscript genomes stopping the matching test until the total times of the matching test on all the manuscript genomes reaches a second set threshold, recording the current matching power value of each manuscript genome, finishing the sampling of the multiple matching results in the current round, entering the next round of the sampling of the multiple matching results, and acquiring the matching success rate sample of which the number of each manuscript genome is the third set threshold until the total number of the rounds of the sampling of the multiple matching results reaches the third set threshold.
In particular, multiple rounds of multiple matching result sampling may be performed using a given matching model. When obtaining multiple matching success rate samples, it may be assumed that m sets of manuscript genomes are selected according to the above embodiments, and then the matching success rate of each manuscript genome may be sampled, multiple rounds (generally not less than 30 times) of matching experiments are performed based on the m sets of genomes, and the process of each round of matching experiments is as follows:
step 1, initializing and setting the value of the matching success rate of each manuscript genome, for example, initializing and setting to 0.
And 2, randomly selecting a manuscript genome, and calculating a matching success rate result in a given matching model to obtain the matching success rate result of the matching test. Meanwhile, the matching success rate result of the previous matching tests recorded in the history in the current round of multiple matching result sampling, namely the history matching success rate result, is combined to calculate the current matching power value of the selected manuscript genome.
And 3, circularly executing the steps 1 and 2 for multiple times, wherein the manuscript genomes selected each time are randomly selected from all the manuscript genomes, so that the times of matching tests performed on each genome are possibly different, when the times of the matching tests on a certain manuscript genome reach a first set threshold, stopping the current matching test on the manuscript genome, and recording the current matching power value of the manuscript genome when the test is stopped.
And 4, continuing to execute the processing flows of the steps 1-3 for the rest manuscript genomes except the manuscript genome reaching the first set threshold until the total times of the matching test of the round reaches a second set threshold, and stopping the matching test of the round. At this time, for each manuscript genome, one matching power value corresponds to the power value, namely, a matching success rate sample obtained by sampling the matching result for multiple times in the current round is obtained, and for m manuscript genomes, m matching success rate samples can be obtained.
Then, for all manuscript genomes, multiple rounds (for example, reaching a third set threshold) of the above multiple matching result sampling are performed, so that multiple matching success rate samples of each manuscript genome can be obtained, for example, if the number of rounds is n, the number of matching success rate samples is n (n is generally not less than 50).
For example, suppose a is initially selected1、a2And a3Three manuscript genomes are set in advance, and the first set threshold value, the second set threshold value and the third set threshold value are respectively 3, 8 and 5. Then, at each round of multiple matching result sampling:
first a first selection is made from a1、a2And a3In which one is randomly selected, e.g. to a1Then to a1Performing a matching test, and if the test result is that the matching is successful, obtaining a1The matching success rate value of (1) is 100%.
Next, a second selection is performed assuming that a is selected2Performing a matching test on the obtained object to obtain a test result which is unsuccessful in matching, and obtaining a2The matching power value of (1) is 0%.
Then, a third selection is performed, assuming that a is selected again1And if the matching test result is that the matching is not successful, according to the pair a1A total of two matching test results to obtain a1The current match has a power value of 50%.
Then, a fourth selection is performed, assuming that a is selected3And if the matching test result is successful, obtaining a3The matching success rate value of (1) is 100%.
Then, a fifth selection is performed, assuming that a is selected again1And if the matching test result is that the matching is successful, the matching is carried out according to the pair a1A total of three matching test results to obtain a1The current match power value is 66.6%. At this time, for a1Has reached the first set threshold 3, the continuation of the pair a is stopped1Performing a matching test, and outputting the current matching power value of 66.6 percent, namely the manuscript genome a in the sampling of the multiple matching results of the round1The matching success rate samples.
Then a sixth selection is performed, since a1When 3 matching tests have been achieved, only a2And a3The specific selection and matching test process is similar to the steps. Thus, up to the total number of matching tests, i.e. for a1、a2And a3When the total times of the matching tests reach 8 times of the second set threshold value, the sampling of the matching results of the current round is finished. At this time, for each manuscript genome, a matching success rate sample is obtained according to the matching test.
Then, for three manuscript genomes a1、a2And a3Repeating multiple rounds of the above-mentioned multiple matching result sampling, each round will obtain a1、a2And a3A corresponding set of matching success rate samples, respectively. Until the repeated discussion reaches the third set threshold 5, then a can be obtained1、a2And a3Respectively corresponding 5 matching success rate samples.
According to the manuscript gene selection method provided by the embodiment of the invention, multiple times of matching success rate calculation of each manuscript genome is carried out by using the given matching model, and the manuscript genome with higher matching success rate is selected according to the multiple times of matching success rate calculation, so that the reliability of the calculation result is higher.
Optionally, according to the above embodiments, the step of calculating the Z value corresponding to the manuscript genome based on the mean value and the standard deviation corresponding to the manuscript genome and the maximum mean value and the maximum standard deviation further includes:
calculating the Z value corresponding to each manuscript genome except the maximum manuscript genome in all the manuscript genomes by using the following calculation formula:
Figure BDA0001805690480000131
in the formula, ZiRepresenting the Z value corresponding to the ith manuscript genome, n representing the number of matching success rate samples corresponding to each manuscript genome, EiMeans, S, corresponding to the ith contribution genomeiDenotes the corresponding standard deviation of the ith manuscript genome, EmaxDenotes the maximum mean, SmaxThe maximum standard deviation is indicated.
It is understood that the standard deviation and the mean of the matching success rate corresponding to each manuscript genome except the maximum manuscript genome calculated according to the above embodiments are combined, and the maximum mean E corresponding to the maximum manuscript genomemaxAnd maximum standard deviation SmaxBy using the given Z value calculation formula, the Z value of each manuscript genome except the maximum manuscript genome which is initially selected can be correspondingly calculated.
According to the method for selecting the manuscript genes, provided by the embodiment of the invention, the Z value of each manuscript genome is calculated by utilizing the mean value and the standard deviation which correspond to each preliminarily selected manuscript genome respectively and combining the maximum mean value and the maximum standard deviation, so that the matching success rate condition of each manuscript genome can be represented more accurately, the manuscript genes can be selected more accurately to be matched with a translator gene, and the matching effect is improved.
Optionally, the step of selecting, based on the Z value corresponding to each manuscript genome except for the maximum manuscript genome in all manuscript genomes, the manuscript genome meeting the set condition from all manuscript genomes further includes: and if the plurality of matching success rate samples accord with normal distribution, determining a preset Z value according to a preset confidence level, removing the maximum manuscript genome and the manuscript genome with the Z value larger than the preset Z value, and taking the residual manuscript genomes in the manuscript genomes as the manuscript genomes meeting the set conditions.
It can be understood that after the Z value of each manuscript genome except the maximum manuscript genome is calculated, the difference performance of each corresponding manuscript genome in gene matching can be judged according to the Z value. Therefore, whether the Z value corresponding to each manuscript genome meets the set difference requirement can be judged according to the Z value corresponding to each manuscript genome, and if the Z value does not meet the set difference requirement, the Z value is removed from each initially selected manuscript genome. In addition, the largest manuscript genome is also removed, and finally all the remained manuscript genomes which are not removed are the manuscript genomes meeting the requirements. And taking out the genes in all the residual manuscript genomes, and removing the repeated genes in the genes to form a new group of genes which are used as the finally selected manuscript genes.
For the case that the sampled matching success rate sample is in accordance with normal distribution, if 95% confidence is to be obtained, that is, the preset selection criterion is that the confidence satisfies 95%, the Z value calculated for the manuscript genome should not be greater than 1.96. Therefore, the manuscript genome with the Z value larger than 1.96 and the maximum manuscript genome obtained according to the above embodiments are rejected, and the remaining initially selected manuscript genome is reserved as the finally selected manuscript genome.
According to the manuscript gene selection method provided by the embodiment of the invention, the threshold is preset, and the manuscript genome is selected according to the threshold, so that the precision of the selected manuscript genome can be ensured, and the method has an important significance for more accurately matching a translator.
Further, on the basis of the above embodiment, after the step of selecting a manuscript genome satisfying a set condition from all manuscript genomes, the method of the embodiment of the present invention further comprises: and if the number of the manuscript genomes with the Z values not larger than the preset Z values in all the manuscript genomes except the maximum manuscript genome is smaller than the preset threshold value, reselecting a plurality of groups of genes from the alternative manuscript gene list, and performing multiple matching result sampling until the finally selected manuscript genes are obtained.
It is understood that, after acquiring the Z value corresponding to each manuscript genome except for the maximum manuscript genome and selecting the manuscript genome satisfying the set condition from all the manuscript genomes, the embodiment of the invention may further include the following processing steps: counting the number of manuscript genomes with the Z value not larger than the preset Z value, counting the number of the finally selected manuscript genomes meeting the set condition, comparing the number with a preset threshold value, if the number is smaller than the preset threshold value, reselecting multiple groups of genes from an alternative manuscript gene list, and performing the selection step from the sampling of the multiple matching results to the acquisition of the finally selected manuscript genes.
For example, according to the Z value corresponding to each manuscript genome, the judgment is performed by using the preset selection criteria. If none of the manuscript genomes has a Z value meeting the selection criterion, the method returns to step S101 to re-select a plurality of different manuscript genomes from the alternative manuscript gene list and re-perform the calculation and selection processes of the embodiment.
For example, in the case that the sampled matching success rate sample conforms to the normal distribution, if a 95% confidence is to be obtained, that is, the preset condition is that the confidence of the manuscript genome satisfies 95%, the Z value calculated for the manuscript genome should not be greater than 1.96. In practical applications, when multiple sets of manuscript genomes are selected from the candidate manuscript gene list, if the Z values cannot meet the above criteria in calculating the Z values of the selected manuscript genomes due to random selection and other reasons, another manuscript genome needs to be selected from the candidate manuscript gene list again, and recalculation and selection are performed.
According to the manuscript gene selection method provided by the embodiment of the invention, through judgment of the calculation result and repeated execution of the selection steps, high-quality genes meeting requirements can be selected, and the method has important significance for more accurate matching of translators.
Further, on the basis of the above embodiment, before the step of performing multiple matching result sampling and obtaining multiple matching success rate samples, the method of the embodiment of the present invention further includes: setting a total frequency threshold value for sampling a matching result according to the requirement of gene matching precision; correspondingly, for each manuscript genome, the number of extracted matching success rate samples is not less than the total times threshold value.
It can be understood that, before the step of performing multiple matching result sampling and obtaining multiple matching success rate samples, the present embodiment sets the total number threshold for performing matching result sampling according to the requirement of the gene matching calculation precision with the translator to be matched, and accordingly, when actual sampling is performed, the number of the collected matching success rate samples is not less than the total number threshold. For example, for each manuscript genome, the number of matching success rate samples required to be extracted is not less than 50, and then the data 50 is the preset total number threshold.
The manuscript gene selecting method provided by the embodiment of the invention can ensure the number of samples by setting a proper total frequency threshold value, so that the generality is not lost, and the precision is higher.
As another aspect of the embodiments of the present invention, the embodiments of the present invention provide a manuscript gene selecting device according to the above embodiments, which is used for realizing the selection of the final manuscript gene in the above embodiments. Therefore, the descriptions and definitions in the selecting method of the manuscript genes in the embodiments above can be used for understanding the execution modules in the embodiments of the present invention, and specific reference may be made to the embodiments above, which are not repeated herein.
According to an embodiment of the present invention, a manuscript gene selecting device is shown in fig. 3, which is a schematic structural diagram of a manuscript gene selecting device provided by an embodiment of the present invention, and the device can be used for selecting a manuscript gene in each of the above method embodiments, and the device includes: an initial gene selection module 301, a first calculation module 302, a maximum genome selection module 303, a second calculation module 304, and a final gene selection module 305.
The initial gene selection module 301 is configured to select multiple groups of different genes from the candidate manuscript gene list to form multiple manuscript genomes; the first calculating module 302 is configured to perform multiple matching result sampling on each manuscript genome to obtain multiple matching success rate samples, and calculate a mean value and a standard deviation of matching success rates corresponding to the manuscript genome based on the multiple matching success rate samples; the maximum genome selecting module 303 is configured to select a manuscript genome corresponding to the maximum of all the mean values, and define the largest manuscript genome as a maximum mean value, and define a standard deviation of the largest manuscript genome as a maximum standard deviation; the second calculating module 304 is configured to calculate, for each manuscript genome except for the maximum manuscript genome in all the manuscript genomes, a Z value corresponding to the manuscript genome based on a mean value and a standard deviation corresponding to the manuscript genome, and the maximum mean value and the maximum standard deviation; the final gene selection module 305 is configured to select a manuscript genome meeting the set condition from all manuscript genomes based on a Z value corresponding to each manuscript genome except for the maximum manuscript genome in all manuscript genomes, and merge a gene in the manuscript genome meeting the set condition with a gene in the maximum manuscript genome to obtain a finally selected manuscript gene; wherein the Z value represents a Z value in large sample differential validation.
Specifically, the initial gene selecting module 301 may select multiple groups of manuscript genes according to a pre-established candidate manuscript gene list, and each group of manuscript genes forms a genome as a manuscript genome, where the manuscript genome is the initially selected manuscript genome. For example, when selecting each group of document genes, the initial gene extracting module 301 may randomly select a plurality of document genes in the list from the candidate document gene list, and form a genome, that is, a document genome, using the randomly selected document genes.
Then, for each set of originally selected manuscript genome, the matching effect with the manuscript needs to be determined, so as to select the manuscript gene more suitable for gene matching. Meanwhile, for each group of manuscript genomes without loss of generality, the first calculation module 302 can perform multiple matching result sampling by using a given matching model by inputting the manuscript genomes into the given matching model, and each sampling can obtain a matching success rate sample. It can be understood that each matching success rate sample is actually a matching success rate value obtained by sampling a matching result.
In addition, for each preliminarily selected manuscript genome, the first calculating module 302 calculates the comprehensive matching success rate of the manuscript genome according to a plurality of matching success rate samples obtained by sampling the matching results for a plurality of times, that is, calculates the mean and standard deviation of the matching success rates corresponding to the manuscript genome respectively.
Then, the maximum genome selecting module 303 first selects the largest value from all the calculated mean values, and defines the manuscript genome corresponding to the largest value as the maximum manuscript genome. Correspondingly, the maximum genome extracting module 303 further defines the mean value of the maximum manuscript genome as the maximum mean value, and uses the variable EmaxThe standard deviation of the largest manuscript genome is defined as the maximum standard deviation, and the variable S is usedmaxAnd (4) showing.
Then, the second calculation module 304 calculates the Z value of each genome except the largest genome according to all the mean values and standard deviations calculated above. Specifically, for each of the manuscript genomes, the second calculating module 304 calculates the Z value corresponding to the manuscript genome according to the standard deviation and the mean value of the matching success rate corresponding to the manuscript genome, in combination with the maximum mean value and the maximum standard deviation corresponding to the maximum manuscript genome.
Finally, on the basis of the calculation, the difference performance of each corresponding manuscript genome in gene matching can be judged according to the Z value of each manuscript genome. Therefore, according to the Z value corresponding to each manuscript genome, the final gene selecting module 305 may determine whether the manuscript genome corresponding to the Z value meets the set difference requirement by using the preset condition. If the Z value does not meet the set difference requirement, the original manuscript genomes are removed from the initially selected manuscript genomes, and finally all the remained manuscript genomes which are not removed are the manuscript genomes meeting the requirement, wherein the manuscript genomes comprise the manuscript genomes with the Z values meeting the set difference requirement and the maximum manuscript genomes. Finally, the final gene selection module 305 takes out all the remaining genes in the manuscript genome, and forms a new set of genes after removing duplicate genes in the genes, namely, the finally selected manuscript genes.
Further, on the basis of the above embodiment, the apparatus according to the embodiment of the present invention further includes a candidate manuscript gene list construction module, configured to: extracting corresponding genes from all the project related information, manuscript related information and process related information of the manuscript respectively, and correspondingly forming the project related genes, the manuscript related genes and the process related genes of the manuscript; and forming an alternative manuscript gene list based on the project related gene, the manuscript related gene and the process related gene.
Optionally, the second calculating module is specifically configured to: calculating the Z value corresponding to each manuscript genome except the maximum manuscript genome in all the manuscript genomes by using the following calculation formula:
Figure BDA0001805690480000181
in the formula, ZiRepresenting the Z value corresponding to the ith manuscript genome, n representing the number of matching success rate samples corresponding to each manuscript genome, EiMeans, S, corresponding to the ith contribution genomeiDenotes the corresponding standard deviation of the ith manuscript genome, EmaxDenotes the maximum mean, SmaxThe maximum standard deviation is indicated.
Optionally, the final gene selection module is specifically configured to: and if the plurality of matching success rate samples accord with normal distribution, determining a preset Z value according to a preset confidence level, removing the maximum manuscript genome and the manuscript genome with the Z value larger than the preset Z value, and taking the residual manuscript genomes in the manuscript genomes as the manuscript genomes meeting the set conditions.
Further, on the basis of the above embodiment, the apparatus according to the embodiment of the present invention further includes a determining module, configured to: and if the number of the manuscript genomes with the Z values not larger than the preset Z values in all the manuscript genomes except the maximum manuscript genome is smaller than the preset threshold value, reselecting a plurality of groups of genes from the alternative manuscript gene list, and performing multiple matching result sampling until the finally selected manuscript genes are obtained.
Further, on the basis of the foregoing embodiment, the first calculating module is further configured to: setting a total frequency threshold value for sampling a matching result according to the requirement of gene matching precision; correspondingly, for each manuscript genome, the number of extracted matching success rate samples is not less than the total times threshold value.
It is understood that, in the embodiment of the present invention, each relevant program module in the apparatus of each of the above embodiments may be implemented by a hardware processor (hardware processor). In addition, when the selecting device of each manuscript gene in the embodiments of the present invention is used for selecting the manuscript gene in the above embodiments of the method, the beneficial effects produced are the same as those of the corresponding embodiments of the method, and reference may be made to the above embodiments of the method, which are not described herein again.
As another aspect of the embodiment of the present invention, in this embodiment, an electronic device is provided according to the above embodiments, and with reference to fig. 4, an entity structure diagram of the electronic device provided in the embodiment of the present invention includes: at least one memory 401, at least one processor 402, a communication interface 403, and a bus 404.
The memory 401, the processor 402 and the communication interface 403 complete mutual communication through the bus 404, and the communication interface 403 is used for information transmission between the electronic device and the manuscript information device; the memory 401 stores a computer program that can be executed by the processor 402, and when the processor 402 executes the computer program, the manuscript gene selecting method according to each of the above embodiments is implemented.
It is understood that the electronic device at least includes a memory 401, a processor 402, a communication interface 403 and a bus 404, and the memory 401, the processor 402 and the communication interface 403 are connected to each other through the bus 404 for communication, such as program instructions of a method for reading the manuscript gene from the memory 401 by the processor 402. In addition, the communication interface 403 may also implement communication connection between the electronic device and the manuscript information device, and may complete mutual information transmission, such as implementing selection of manuscript genes through the communication interface 403.
When the electronic device is running, the processor 402 calls the program instructions in the memory 401 to perform the methods provided by the above-mentioned method embodiments, including for example: selecting a plurality of groups of different genes from the alternative manuscript gene list to form a plurality of manuscript genomes; for each manuscript genome, sampling a plurality of matching results to obtain a plurality of matching success rate samples, and calculating a mean value and a standard deviation of matching success rates corresponding to the manuscript genome based on the plurality of matching success rate samples; selecting a manuscript genome corresponding to the largest of all the mean values, defining the manuscript genome as the largest manuscript genome, defining the mean value of the largest manuscript genome as the largest mean value, and defining the standard deviation of the largest manuscript genome as the largest standard deviation; calculating a Z value corresponding to each manuscript genome except for the maximum manuscript genome in all manuscript genomes based on the mean value and the standard deviation corresponding to the manuscript genome and the maximum mean value and the maximum standard deviation; selecting a manuscript genome meeting set conditions from all manuscript genomes based on a Z value corresponding to each manuscript genome except the maximum manuscript genome in all manuscript genomes, and merging genes in the manuscript genome meeting the set conditions with genes in the maximum manuscript genome to obtain a finally selected manuscript gene; wherein the Z value represents Z value in large sample difference verification, and the like.
The program instructions in the memory 401 may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Alternatively, all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, where the program may be stored in a computer-readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
Embodiments of the present invention further provide a non-transitory computer-readable storage medium according to the above embodiments, where the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions cause a computer to execute the method for selecting a manuscript gene according to the above embodiments, for example, the method includes: selecting a plurality of groups of different genes from the alternative manuscript gene list to form a plurality of manuscript genomes; for each manuscript genome, sampling a plurality of matching results to obtain a plurality of matching success rate samples, and calculating a mean value and a standard deviation of matching success rates corresponding to the manuscript genome based on the plurality of matching success rate samples; selecting a manuscript genome corresponding to the largest of all the mean values, defining the manuscript genome as the largest manuscript genome, defining the mean value of the largest manuscript genome as the largest mean value, and defining the standard deviation of the largest manuscript genome as the largest standard deviation; calculating a Z value corresponding to each manuscript genome except for the maximum manuscript genome in all manuscript genomes based on the mean value and the standard deviation corresponding to the manuscript genome and the maximum mean value and the maximum standard deviation; selecting a manuscript genome meeting set conditions from all manuscript genomes based on a Z value corresponding to each manuscript genome except the maximum manuscript genome in all manuscript genomes, and merging genes in the manuscript genome meeting the set conditions with genes in the maximum manuscript genome to obtain a finally selected manuscript gene; wherein the Z value represents Z value in large sample difference verification, and the like.
According to the electronic device and the non-transitory computer readable storage medium provided by the embodiments of the present invention, by executing the method for selecting the manuscript genes described in the embodiments, a plurality of groups of manuscript genomes are selected from the manuscript gene pools of all manuscripts in advance, and the Z values corresponding to the manuscript genomes are calculated to select the manuscript genomes with the Z values meeting the set conditions as the final selection result, so that the selected manuscript genes can better reflect the differences between the manuscripts, and further the selected manuscripts can be more reasonably matched with the existing manuscripts, thereby effectively improving the translation efficiency and the translation accuracy.
It is to be understood that the above-described embodiments of the apparatus, the electronic device and the storage medium are merely illustrative, and that elements described as separate components may or may not be physically separate, may be located in one place, or may be distributed on different network elements. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the technical solutions mentioned above may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a usb disk, a removable hard disk, a ROM, a RAM, a magnetic or optical disk, etc., and includes several instructions for causing a computer device (such as a personal computer, a server, or a network device, etc.) to execute the methods described in the method embodiments or some parts of the method embodiments.
In addition, it should be understood by those skilled in the art that in the specification of the embodiments of the present invention, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
In the description of the embodiments of the invention, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description. Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects.
However, the disclosed method should not be interpreted as reflecting an intention that: that is, the claimed embodiments of the invention require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of an embodiment of this invention.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the embodiments of the present invention, and not to limit the same; although embodiments of the present invention have been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. A manuscript gene selection method is characterized by comprising the following steps:
selecting a plurality of groups of different genes from the alternative manuscript gene list to form a plurality of manuscript genomes;
for each manuscript genome, performing multiple matching processing on genes in the manuscript genome and translator genes to obtain multiple matching success rate samples, and calculating the mean value and standard deviation of the matching success rates corresponding to the manuscript genome based on the multiple matching success rate samples;
selecting a manuscript genome corresponding to the largest of all the mean values, defining the manuscript genome as the largest manuscript genome, defining the mean value of the largest manuscript genome as the largest mean value, and defining the standard deviation of the largest manuscript genome as the largest standard deviation;
for each manuscript genome except for the maximum manuscript genome in all the manuscript genomes, calculating a Z value corresponding to the manuscript genome based on the mean value and the standard deviation corresponding to the manuscript genome, and the maximum mean value and the maximum standard deviation;
selecting a manuscript genome meeting set conditions from all the manuscript genomes based on the Z value corresponding to each manuscript genome except the maximum manuscript genome in all the manuscript genomes, and merging genes in the manuscript genomes meeting the set conditions and genes in the maximum manuscript genome to obtain finally selected manuscript genes;
wherein the Z value represents a Z value in large sample differential validation.
2. The method of claim 1, wherein before the step of selecting a plurality of different groups of genes from the candidate manuscript gene list, respectively, further comprising:
extracting corresponding genes from all the project related information, manuscript related information and process related information of the manuscript respectively, and correspondingly forming the project related genes, the manuscript related genes and the process related genes of the manuscript;
and constructing the alternative manuscript gene list based on the project related gene, the manuscript related gene and the process related gene.
3. The method of claim 1, wherein the step of calculating the Z value corresponding to the manuscript genome based on the mean and the standard deviation, and the maximum mean and the maximum standard deviation corresponding to the manuscript genome further comprises:
calculating the Z value corresponding to each manuscript genome except the maximum manuscript genome in all the manuscript genomes by using the following calculation formula:
Figure FDA0003232561140000021
in the formula, ZiRepresenting the Z value corresponding to the ith manuscript genome, n representing the number of the matching success rate samples corresponding to each manuscript genome, EiRepresenting said mean, S, corresponding to the ith contribution genomeiRepresents the standard deviation, E, corresponding to the ith contribution genomemaxRepresents the maximum mean, SmaxThe maximum standard deviation is indicated.
4. The method according to claim 3, wherein the step of selecting a manuscript genome satisfying a set condition from all the manuscript genomes based on the Z value corresponding to each of the manuscript genomes except the maximum manuscript genome further comprises:
and if the matching success rate samples conform to normal distribution, determining a preset Z value according to a preset confidence level, removing the maximum manuscript genome and the manuscript genome with the Z value larger than the preset Z value, and taking the residual manuscript genomes in the manuscript genomes as the manuscript genomes meeting the set conditions.
5. The method according to claim 4, further comprising, after the step of selecting a manuscript genome satisfying a set condition from all the manuscript genomes:
and if the number of the manuscript genomes with the Z values not larger than the preset Z values in all the manuscript genomes except the maximum manuscript genome is smaller than a preset threshold value, reselecting multiple groups of genes from the alternative manuscript gene list, and performing multiple matching processing on the genes in the manuscript genomes and the translator genes until the finally selected manuscript genes are obtained.
6. The method of claim 1, wherein before the step of performing multiple matching processes on the genes in the manuscript genome and the translator genes to obtain multiple matching success rate samples, the method further comprises:
setting a total frequency threshold value for matching processing according to the gene matching precision requirement;
correspondingly, for each manuscript genome, the number of the extracted matching success rate samples is not less than the total times threshold value.
7. A manuscript gene selecting device is characterized by comprising:
the initial gene selection module is used for respectively selecting a plurality of groups of different genes from the alternative manuscript gene list to form a plurality of manuscript genomes;
the first calculation module is used for carrying out multiple matching processing on the genes in the manuscript genome and the translator genes to obtain a plurality of matching success rate samples for each manuscript genome, and calculating the mean value and the standard deviation of the matching success rates corresponding to the manuscript genome based on the matching success rate samples;
a maximum genome selecting module, configured to select a manuscript genome corresponding to a maximum of all the mean values, and define the largest manuscript genome as a maximum mean value, and define the standard deviation of the largest manuscript genome as a maximum standard deviation;
a second calculating module, configured to calculate, for each manuscript genome except for the maximum manuscript genome in all the manuscript genomes, a Z value corresponding to the manuscript genome based on the mean value and the standard deviation corresponding to the manuscript genome, and the maximum mean value and the maximum standard deviation;
a final gene selection module, configured to select, based on the Z value corresponding to each manuscript genome except for the maximum manuscript genome in all the manuscript genomes, a manuscript genome meeting a set condition from all the manuscript genomes, and merge a gene in the manuscript genome meeting the set condition with a gene in the maximum manuscript genome to obtain a finally selected manuscript gene;
wherein the Z value represents a Z value in large sample differential validation.
8. An electronic device, comprising: at least one memory, at least one processor, a communication interface, and a bus;
the memory, the processor and the communication interface complete mutual communication through the bus, and the communication interface is used for information transmission between the electronic equipment and manuscript information equipment;
the memory has stored therein a computer program operable on the processor, which when executed by the processor, implements the method of any of claims 1 to 6.
9. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1-6.
CN201811096577.1A 2018-09-19 2018-09-19 Manuscript gene selection method and device and electronic equipment Active CN109299738B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811096577.1A CN109299738B (en) 2018-09-19 2018-09-19 Manuscript gene selection method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811096577.1A CN109299738B (en) 2018-09-19 2018-09-19 Manuscript gene selection method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN109299738A CN109299738A (en) 2019-02-01
CN109299738B true CN109299738B (en) 2021-10-26

Family

ID=65163505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811096577.1A Active CN109299738B (en) 2018-09-19 2018-09-19 Manuscript gene selection method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN109299738B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158672A (en) * 2007-11-05 2008-04-09 云南烟草科学研究院 Method for accurately checking and confirming cigarette main stream flue gas index box index value
CN104789686A (en) * 2015-05-06 2015-07-22 安诺优达基因科技(北京)有限公司 Kit and device for detecting aneuploidy of chromosomes
CN105069776A (en) * 2015-07-13 2015-11-18 中国石油大学(北京) Method for selecting training image based on data event difference degree
CN105844116A (en) * 2016-03-18 2016-08-10 广州市锐博生物科技有限公司 Processing method and processing apparatus for sequencing data
CN106778908A (en) * 2017-01-11 2017-05-31 湖南文理学院 A kind of novelty detection method and apparatus

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7238946B2 (en) * 2003-06-27 2007-07-03 Siemens Medical Solutions Usa, Inc. Nuclear imaging system using scintillation bar detectors and method for event position calculation using the same
US7725291B2 (en) * 2006-04-11 2010-05-25 Moresteam.Com Llc Automated hypothesis testing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158672A (en) * 2007-11-05 2008-04-09 云南烟草科学研究院 Method for accurately checking and confirming cigarette main stream flue gas index box index value
CN104789686A (en) * 2015-05-06 2015-07-22 安诺优达基因科技(北京)有限公司 Kit and device for detecting aneuploidy of chromosomes
CN105069776A (en) * 2015-07-13 2015-11-18 中国石油大学(北京) Method for selecting training image based on data event difference degree
CN105844116A (en) * 2016-03-18 2016-08-10 广州市锐博生物科技有限公司 Processing method and processing apparatus for sequencing data
CN106778908A (en) * 2017-01-11 2017-05-31 湖南文理学院 A kind of novelty detection method and apparatus

Also Published As

Publication number Publication date
CN109299738A (en) 2019-02-01

Similar Documents

Publication Publication Date Title
CN111310438B (en) Chinese sentence semantic intelligent matching method and device based on multi-granularity fusion model
CN109670191B (en) Calibration optimization method and device for machine translation and electronic equipment
CN109165284B (en) Financial field man-machine conversation intention identification method based on big data
WO2018040899A1 (en) Error correction method and device for search term
CN112000815B (en) Knowledge graph complementing method and device, electronic equipment and storage medium
CN111145737A (en) Voice test method and device and electronic equipment
CN108009135B (en) Method and device for generating document abstract
KR101939209B1 (en) Apparatus for classifying category of a text based on neural network, method thereof and computer recordable medium storing program to perform the method
CN107679031B (en) Advertisement and blog identification method based on stacking noise reduction self-coding machine
CN113705237A (en) Relation extraction method and device fusing relation phrase knowledge and electronic equipment
CN108153735B (en) Method and system for acquiring similar meaning words
CN110765266B (en) Method and system for merging similar dispute focuses of referee documents
CN112613321A (en) Method and system for extracting entity attribute information in text
CN115905487A (en) Document question and answer method, system, electronic equipment and storage medium
CN111046177A (en) Automatic arbitration case prejudging method and device
CN109299738B (en) Manuscript gene selection method and device and electronic equipment
CN109447402B (en) Manuscript gene selection method and device and electronic equipment
JP6677093B2 (en) Table data search device, table data search method, and table data search program
CN111950267A (en) Method and device for extracting text triples, electronic equipment and storage medium
CN108810640B (en) Television program recommendation method
CN116226681A (en) Text similarity judging method and device, computer equipment and storage medium
CN109448792B (en) Translator gene selection method and device and electronic equipment
CN109299737B (en) Translator gene selection method and device and electronic equipment
CN114969001A (en) Database metadata field matching method, device, equipment and medium
CN108733824B (en) Interactive theme modeling method and device considering expert knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant