CN107437001B

CN107437001B - Method and system for converting information sequence into vectorized data

Info

Publication number: CN107437001B
Application number: CN201710532274.9A
Authority: CN
Inventors: 王嘉伟
Original assignee: Weimeng Chuangke Network Technology China Co Ltd
Current assignee: Weimeng Chuangke Network Technology China Co Ltd
Priority date: 2017-07-03
Filing date: 2017-07-03
Publication date: 2020-03-27
Anticipated expiration: 2037-07-03
Also published as: CN107437001A

Abstract

The invention relates to the technical field of data mining, in particular to a method and a system for converting an information sequence into vectorized data, which comprises the following steps: sequentially reading each information element in the information sequence; establishing a corresponding sub-vector according to the position of each information element in the information sequence; and arranging each sub-vector according to the position of the corresponding information element in the information sequence to form a vector of the information sequence. The invention can express the sequence information in the information sequence when the information sequence is converted into vectorization data.

Description

Method and system for converting information sequence into vectorized data

Technical Field

The invention relates to the technical field of data mining, in particular to a method and a system for converting an information sequence into vectorized data.

Background

The information sequence is information data having a certain order, such as a bit stream, DNA, protein sequence, etc. The information sequence is characterized by large information quantity and unchangeable sequence. In general, information sequences containing highly repetitive information are difficult to analyze in conventional methods.

In the case of a large number of known information sequences and their corresponding results (e.g., the sequence of tens of thousands of DNA fragments of the same product and the expression level of its corresponding product are known), it is an extremely effective method to build a data mining model for the information sequences to analyze the principles thereof. However, the input required by the data mining model on the computer is multi-dimensional vectorization data, so how to convert the information sequence into the multi-dimensional vectorization data is an important problem.

In the prior art, only simple statistics is performed on data in an information sequence, and a statistical result is written into a vector. Taking the DNA sequence as an example, the method used in the prior art is: new vectorized data a1 is created. Counting the proportion a, T, G and C of A/T/G/C in the total length of DNA in the DNA sequence with the length l, and then storing { a, T, G, C and l } in vectorization data A1, namely A1: { a, T, G, C and l } is finally obtained vectorization data. The method only counts the proportion of each component in the information sequence data, and the important characteristic of the sequence, namely the sequence order, containing information is not utilized. The vectorized data thus generated, which represents DNA, performs poorly in the subsequent data mining process.

Disclosure of Invention

The technical problem to be solved by the present invention is to overcome the deficiencies of the prior art, and to provide a method and a system for converting an information sequence into vectorized data, which can express sequence information in the information sequence when the information sequence is converted into vectorized data.

To achieve the above technical object, in one aspect, the present invention provides a method for converting an information sequence into vectorized data, the method comprising:

sequentially reading each information element in the information sequence;

establishing a corresponding sub-vector according to the position of each information element in the information sequence;

and arranging each sub-vector according to the position of the corresponding information element in the information sequence to form a vector of the information sequence.

In another aspect, the present invention provides a system for converting an information sequence into vectorized data, the system comprising:

a reading unit for reading each information element in the information sequence in sequence;

the sub-vector unit is used for establishing a corresponding sub-vector according to the position of each information element in the information sequence;

and the vector unit is used for arranging each sub-vector according to the position of the corresponding information element in the information sequence to form a vector of the information sequence.

In the technical scheme of the invention, a sub-vector containing the position information of each information element in the information sequence is established, and then the sub-vectors are arranged according to the position of each information element in the information sequence to form a vector of the information sequence, so that the information sequence is converted into vectorized data, and a computer can conveniently establish a data mining model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a system configuration according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a sub-vector unit according to an embodiment of the present invention;

FIG. 4 is a block diagram of a sub-vector segment module according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the method for converting an information sequence into vectorized data according to the present invention includes the following steps:

101. sequentially reading each information element in the information sequence;

102. establishing a corresponding sub-vector according to the position of each information element in the information sequence; the method comprises the following specific steps:

1021. respectively selecting a current information element and a plurality of adjacent information elements to form a plurality of subvector segments, which specifically comprises the following steps:

selecting a current information element and an adjacent information element arranged behind the current information element to form a first segment of the current information element;

selecting a current information element and two adjacent information elements which are arranged behind the current information element to form a second segment of the current information element;

by analogy, the current information element and m adjacent information elements which are arranged behind the current information element are taken to form the mth segment of the current information element, and m is the number of the information elements included behind the current information element in the information sequence;

selecting continuous p segments from m segments of the current information element as sub-vector segments of the current information element, wherein p is a natural number and is less than or equal to m; preferably, of the p sub-vector segments of the current information element, the sub-vector segment with the longest length contains twice as many information elements as the sub-vector segment with the shortest length.

1022. Matching each sub-vector segment of the current information element with the information sequence, and recording the occurrence times of each sub-vector segment of the current information element in the information sequence; the times are respectively recorded behind each sub-vector segment of the corresponding current information element;

1023. combining each sub-vector segment of the current information element with the corresponding occurrence number to form a corresponding sub-vector element;

1024. sequentially arranging the sub-vector elements to form the sub-vector of the current information element, specifically:

and the sub-vector elements are sequentially arranged from small to large according to the number of the information elements contained in the corresponding sub-vector segment to form the sub-vector of the current information element.

103. And arranging each sub-vector according to the position of the corresponding information element in the information sequence to form a vector of the information sequence.

As shown in fig. 2 to 4, the system for converting an information sequence into vectorized data according to the present invention includes:

a reading unit 21 for reading each information element in the information sequence in turn;

a sub-vector unit 22, configured to establish a corresponding sub-vector according to a position of each information element in the information sequence;

and the vector unit 23 is configured to arrange each of the sub-vectors according to the position of the corresponding information element in the information sequence to form a vector of the information sequence.

In an embodiment, one possible structure of the sub-vector unit 22, as shown in fig. 3, includes:

a sub-vector segment module 221, configured to select a current information element and a plurality of adjacent information elements to form a plurality of sub-vector segments;

a frequency module 222, configured to match each sub-vector segment of the current information element with the information sequence, and record the occurrence frequency of each sub-vector segment of the current information element in the information sequence;

a sub-vector element module 223, configured to combine each sub-vector segment of the current information element with the corresponding occurrence number to form a corresponding sub-vector element;

and an arranging module 224, configured to sequentially arrange the sub-vector elements in order to form a sub-vector of the current information element.

In an embodiment, one possible structure of the sub-vector segment module 221, as shown in fig. 4, includes: fragment submodule 2211 and selection submodule 2210;

the segment sub-module 2211 is configured to select a current information element and an adjacent information element arranged behind the current information element to form a first segment of the current information element; selecting a current information element and two adjacent information elements which are arranged behind the current information element to form a second segment of the current information element; by analogy, the current information element and m adjacent information elements which are arranged behind the current information element are taken to form the mth segment of the current information element, and m is the number of the information elements included behind the current information element in the information sequence;

the selecting submodule 2210 is configured to select, from m segments of a current information element, p consecutive segments as a subvector segment of the current information element, where p is a natural number and is not greater than m.

In a specific embodiment, the number module 222 is specifically configured to record the number of occurrences after each sub-vector segment of the corresponding current information element; the arranging module 224 is specifically configured to arrange the sub-vector elements in sequence from small to large according to the number of information elements included in the corresponding sub-vector segment.

In an embodiment, the number of information elements in the p sub-vector segments of the current sub-vector selected by the selecting sub-module 2210 is twice as large as that of the sub-vector segment with the shortest length.

After the information sequence is converted into the vectorized data X, the vectorized data X and the result y of the information sequence can be expressed in the form of (X, y). After a batch of information sequences with expression results are obtained, the information sequences are input into a computer in the form of (X, y), and the batch of data expressed in the form of (X, y) is trained by using a logistic regression classifier. The training model may be selected as follows:

function h as described above_θ(x) Representing the formula for the estimated value of y when X is known.

As can be seen from the formula (1), the training process of the classifier model finds a set of parameters θ, so that the result h of the model_θ(x) The result of the training data is met as much as possible.

The principle of logistic regression classifier model training is to use a gradient descent method, i.e.:

in the formula (2), y is the result of an information sequence, m is the total number of training data in the training process, α is called a learning rate, and in actual operation, α needs to be manually adjusted continuously to enable the model to have the best effect;

in the formula (2), theta and x are vectors, and the number of dimensions is the same. The subscript of x denotes the ith piece of data, and the superscript j denotes the jth component (dimension) of x.

In the gradient descent method, equation (2) is continuously executed for each component of θ. It can be shown that in doing so, all θ will converge to a globally optimal solution. That is, when training is completed, the parameter set θ that best fits the training set is obtained.

Next, in the prediction process, when the vector X of the information sequence is known, X is substituted into formula (1), and formula (1) has a well-trained parameter set θ. Calculate h_θ(x) Is obtained atPredictor h for y given a vector X of an information sequence_θ(x) Thereby predicting the information vector expression result y.

Taking the DNA sequence as an example, if the DNA sequence is:

AGTTCAGTCAGCAGCAGCAGTCAG

when the expression level (result) was 0.93, y of this data was 0.93;

each information element in the DNA sequence is read in left-to-right order. Firstly, reading a first information element A;

then the subvector fragments for information element a are:

a first segment: AG. A second fragment: AGT, third fragment: AGTT, fourth fragment: AGTTC, fifth fragment: AGTTCA … …

Selecting 4 segments from the second segment to the fifth segment as subvector segments of the information element a, wherein the fifth segment: the number of information elements contained in AGTTCA is the second fragment: AGT contains twice the number of information elements.

Matching the second fragment to the fifth fragment of the information element A with the DNA sequence, respectively recording the times of the second fragment to the fifth fragment appearing in the DNA sequence, and then recording the times behind the corresponding fragments;

the subvector elements of ie a are: AGT 1, AGTT 1, AGTTC 1 and AGTTCA 1;

arranging the sub-vector elements of the information element A from small to large according to the length of the corresponding sub-vector segment, and obtaining the sub-vector of the information element A as follows: { AGT:1, AGTT:1, AGTTC:1, AGTTCA:1 }.

Then, reading an incoming information element G, and obtaining a subvector of the information element G according to the method as follows: { AGT:1, AGTT:1, AGTTC:1, AGTTCA:1, GTT:1, GTTC:1, GTTCA:1, GTTCAG:1 }.

Reading of the itoms in the DNA sequence is continued until finally a subvector for each itom is obtained.

Then arranging the sub-vectors of each information element according to the position of the corresponding information element in the DNA sequence to form a vector X of the DNA sequence:

{"CAG":5,"GTT":1,"TCAG":2,"GCAGC":2,"TCAGT":1,"AGC":3,"AGCA":3,"AGT":2,"GTTCAG":1,"AGTCAG":2,"GTTCA":1,"GTCAGC":1,"CAGC":3,"CAGCAG":3,"AGCAGT":1,"CAGCA":3,"CAGT":2,"TTC":1,"TTCAG":1,"GTTC":1,"GTC":1,"GTCAG":1,"TCAGC":1,"GCAGT":1,"AGCAGC":2,"TTCA":1,"GCAGTC":1,"GCA":3,"AGCAG":3,"GCAG":3,"AGTCA":2,"GCAGCA":2,"TCAGTC":1,"CAGTCA":2,"TCAGCA":1,"GTCA":1,"CAGTC":2,"AGTC":2,"TTCAGT":1,"TCA":2}。

the DNA sequence may be expressed in the form of (X, y) in combination with the expression amount of the DNA sequence.

Suppose we have a large amount of resulting DNA data, which can be written in the form of (X, y);

then in the Python software with the Scikit (open source machine learning framework) installed, the following code is entered:

model＝LogisticRegression()

model.fit(X,y)

after the computer finishes working, a logistic regression model is established. When some DNA only knows its sequence and does not know its result, and the result is to be predicted, it is written in vector form by the same method of the present invention (X2), and then the following code is input into Python software:

predicted＝model.predict(X2)

after the computer finishes working, the predicted result of the vector X2 is stored in the predicted part in the Python software.

In the invention, the generated vectorization data has more dimensions, and the dimensions corresponding to the segments with larger relative length can describe the sequence relation among the segments with smaller relative length. The vectorized data, which is composed of both large and small segments, can thus describe the sequential information of the various information elements contained in the information sequence itself.

The invention can greatly improve the expression of the vectorized data of the information sequence in the subsequent information mining step, can establish a neural network model which expresses the relation between the information sequence and the result thereof better, further improves the accuracy of the result prediction of the information sequence, and reduces the error of the classifier model.

It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.

In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. To those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

Those of skill in the art will further appreciate that the various illustrative logical blocks, units, and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate the interchangeability of hardware and software, various illustrative components, elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.

The various illustrative logical blocks, or elements, described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may be located in a user terminal. In the alternative, the processor and the storage medium may reside in different components in a user terminal.

In one or more exemplary designs, the functions described above in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media that facilitate transfer of a computer program from one place to another. Storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media can include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store program code in the form of instructions or data structures and which can be read by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Additionally, any connection is properly termed a computer-readable medium, and, thus, is included if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wirelessly, e.g., infrared, radio, and microwave. Such discs (disk) and disks (disc) include compact disks, laser disks, optical disks, DVDs, floppy disks and blu-ray disks where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included in the computer-readable medium.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method of converting an information sequence into vectorized data, the method comprising:

sequentially reading each information element in the information sequence;

arranging each sub-vector according to the position of the corresponding information element in the information sequence to form a vector of the information sequence;

wherein, the establishing of the corresponding sub-vector according to the position of each information element in the information sequence specifically includes:

respectively selecting a current information element and a plurality of adjacent information elements to form a plurality of subvector segments;

matching each sub-vector segment of the current information element with the information sequence, and recording the occurrence times of each sub-vector segment of the current information element in the information sequence;

combining each sub-vector segment of the current information element with the corresponding occurrence number to form a corresponding sub-vector element;

and sequentially arranging the sub-vector elements to form the sub-vector of the current information element.

2. The method according to claim 1, wherein the selecting the current information element and the adjacent information elements respectively to form a plurality of sub-vector segments comprises:

and selecting continuous p segments from the m segments of the current information element as the sub-vector segments of the current information element, wherein p is a natural number and is less than or equal to m.

3. Method for converting an information sequence into vectorized data according to claim 1 or 2, wherein said number of occurrences is recorded after each sub-vector segment of the corresponding current information element;

4. The method according to claim 2, wherein the largest sub-vector segment of the selected p sub-vector segments of the current information element contains twice as many information elements as the shortest sub-vector segment.

5. A system for converting an information sequence into vectorized data, the system comprising:

the vector unit is used for arranging each sub-vector according to the position of the corresponding information element in the information sequence to form a vector of the information sequence;

wherein the sub-vector unit includes:

the sub-vector segment module is used for respectively selecting the current information element and a plurality of adjacent information elements to form a plurality of sub-vector segments;

the frequency module is used for matching each sub-vector segment of the current information element with the information sequence and recording the occurrence frequency of each sub-vector segment of the current information element in the information sequence;

the sub-vector element module is used for combining each sub-vector segment of the current information element with the corresponding occurrence number to form a corresponding sub-vector element;

and the arrangement module is used for sequentially arranging the sub-vector elements to form the sub-vector of the current information element.

6. The system for converting an information sequence into vectorized data according to claim 5, wherein said sub-vector fragment module comprises: a fragment submodule and a selection submodule;

the segment submodule is used for selecting a current information element and an adjacent information element which is arranged behind the current information element to form a first segment of the current information element; selecting a current information element and two adjacent information elements which are arranged behind the current information element to form a second segment of the current information element; by analogy, selecting a current information element and m adjacent information elements arranged behind the current information element to form an m-th segment of the current information element, wherein m is the number of the information elements included behind the current information element in the information sequence;

and the selection submodule is used for selecting continuous p segments from the m segments of the current information element as the subvector segments of the current information element, wherein p is a natural number and is less than or equal to m.

7. The system for converting an information sequence into vectorized data according to claim 5 or 6, wherein said count module is specifically configured to record said number of occurrences after each sub-vector segment of the corresponding current information element;

the arranging module is specifically configured to arrange the sub-vector elements in sequence from small to large according to the number of information elements included in the corresponding sub-vector segment.

8. The system according to claim 6, wherein the selecting sub-module selects p sub-vector segments of the current sub-vector, and the sub-vector segment with the longest length has twice the number of information elements as the sub-vector segment with the shortest length.