CN113345557A

CN113345557A - Data processing method and system

Info

Publication number: CN113345557A
Application number: CN202010140962.2A
Authority: CN
Inventors: 徐忆苏
Original assignee: Beijing Yuexi Xingzhong Technology Co ltd
Current assignee: Beijing Yuexi Xingzhong Technology Co ltd
Priority date: 2020-03-03
Filing date: 2020-03-03
Publication date: 2021-09-03

Abstract

The embodiment of the invention provides a data processing method and a system, wherein the method comprises the following steps: acquiring online information data, wherein the online information data comprises: basic information, self-describing information, questionnaire information, and tongue picture information of the user; filtering the online information data to obtain target information data; extracting features of the target information data to obtain target feature vectors corresponding to the target information data; acquiring target reference information corresponding to the online information data according to a target feature vector corresponding to the target information data and a preset reference information set; the set of reference information includes: historical feature vectors corresponding to historical information data and historical reference information corresponding to the historical information data. Abundant reference information can be automatically provided for doctors, the working efficiency of the doctors is improved, and the working accuracy of the doctors is improved.

Description

Data processing method and system

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a data processing method and system.

Background

Traditional Chinese medicine has been an important component of the traditional culture of Chinese nationality for thousands of years, and seeks etiology, disease property and disease position, analyzes pathogenesis and forms syndrome differentiation by means of 'looking after and asking' the four diagnostic methods to make treatment measures. With the development of computer technology, the informatization of medical diagnosis and treatment is driven, and the communication between a patient and a doctor becomes convenient and fast due to the appearance of an internet hospital/online diagnosis and treatment platform, so that great convenience is brought to the patient.

However, there is no solution in the prior art that can utilize large-scale diagnostic data from which to obtain useful reference information online.

Disclosure of Invention

The embodiment of the invention provides a data processing method and a data processing system, which can improve the working efficiency of doctors.

The embodiment of the invention provides a data processing method, which comprises the following steps:

acquiring online information data, wherein the online information data comprises: basic information, self-describing information, questionnaire information, and tongue picture information of the user;

filtering the online information data to obtain target information data;

extracting features of the target information data to obtain target feature vectors corresponding to the target information data;

acquiring target reference information corresponding to the online information data according to a target feature vector corresponding to the target information data and a preset reference information set; the set of reference information includes: historical feature vectors corresponding to historical information data and historical reference information corresponding to the historical information data.

An embodiment of the present invention provides a data processing system, including:

an obtaining module, configured to obtain online information data, where the online information data includes: basic information, self-describing information, questionnaire information, and tongue picture information of the user;

the filtering module is used for filtering the online information data to obtain target information data;

the extraction module is used for extracting the characteristics of the target information data to obtain a target characteristic vector corresponding to the target information data;

the retrieval module is used for acquiring target reference information corresponding to the online information data according to a target characteristic vector corresponding to the target information data and a preset reference information set; the set of reference information includes: historical feature vectors corresponding to historical information data and historical reference information corresponding to the historical information data.

The embodiment of the invention has the following advantages:

according to the embodiment of the invention, the online information data is acquired, the target information data is obtained by filtering the online information data, the target information data is subjected to feature extraction to obtain the target feature vector, the target reference information is obtained from the reference information set according to the target feature vector, abundant reference information can be automatically provided for a doctor according to the online information data, and the reference information is used as an intermediate processing result, so that the doctor can be assisted in performing a subsequent data analysis process, and the working efficiency of the doctor is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a flow chart of a first embodiment of a data processing method of the present invention;

FIG. 2 is a flow chart of a second embodiment of a data processing method of the present invention;

FIG. 3 is a block diagram illustrating the architecture of one embodiment of a data processing system of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Method embodiment one

Referring to fig. 1, a flowchart of a first embodiment of a data processing method according to the present invention is shown, where the method specifically includes:

step 101, obtaining online information data, wherein the online information data comprises: basic information of the user, self-describing information, questionnaire information, and tongue picture information.

In this embodiment, the user may upload information data through an online platform, which is used to establish an online inquiry link between an inquiry party (user) and an inquired party (doctor), such as an internet hospital, an online diagnosis and treatment platform, and a hospital client. The online platform acquires information data of an online user, wherein the information data of the user comprises: basic information of the user, self-describing information, questionnaire information, and tongue picture information.

The basic information of the user can comprise information such as the age, the height, the weight, the disease history and the allergy history of the user, and belongs to the category of Chinese medicine question. The item content contained in the basic information can be set on-line in advance by a person skilled in the art according to the needs of traditional Chinese medicine, and the item content contained in the basic information is not particularly limited by the invention.

The self-describing information of the user comprises the description of the user on the self or the uncomfortable symptoms required to be diagnosed, can comprise the description of main symptoms, additional symptoms, degree, nature and the like, and belongs to the category of Chinese medicine question.

The tongue picture information of the user is tongue body information such as tongue color, tongue darkness, tooth marks, cracks and the like contained in the tongue picture uploaded by the user through the online platform. Belongs to the category of "Wang" in traditional Chinese medicine.

The information of the user's questionnaire can be information after a doctor acquires information such as basic information, self-describing information, tongue picture information and the like of the user through an online platform, comprehensively considers symptoms and the characteristics of traditional Chinese medicine syndrome differentiation thereof, and aims at specific problems and corresponding options designed for specific symptoms for the user to be improved; or the information which is automatically generated by the online platform according to the basic information, the self-describing information, the tongue picture information and other information of the user and is designed aiming at the specific disease and the corresponding options are provided for the user to be improved. If after the irregular menstruation information is described for the user, the online platform automatically generates information including: the questionnaires of specific questions such as the physical pain condition, the abortion condition, the menstruation volume, the menstruation quality condition and the like are filled by the user, and the questionnaire information is the information contained in the questionnaire filled by the user. It belongs to the category of questions in TCM.

Optionally, the target information data corresponding to the basic information, the questionnaire information and the tongue picture information of the user may be represented in a structured form; and representing the target information data corresponding to the self-describing information of the user in a natural language text form. So as to facilitate the processing and analysis of the user information data.

And 102, filtering the online information data to obtain target information data.

The online information data is filtered, so that the user information can be processed preliminarily, the data processing amount is reduced, for example, the user's self-describing information is irregular menstruation, and the user's allergic history and other information can be filtered.

And 103, performing feature extraction on the target information data to obtain a target feature vector corresponding to the target information data.

It should be noted that the present invention considers that the user information data is generally considered to be meaningful in a specific section. For example, the basic information of the user includes information of age, height, weight, etc., and the age, height, weight are all numerical types (discrete or continuous), the age of irregular menstruation is generally between 15 and 45 years old, symptoms in different age groups may be different, for example, corresponding symptoms rarely appear after the age of 50 years or under 10 years old, the age of 1 to 10 may be divided into a first interval, the interval is characterized by non-development, and the interval is characterized by development, and the interval is divided into a second interval, 11 to 50, the interval is characterized by development. In addition, the height and the weight are combined, the characteristics can be recorded as over-thin, fat and the like, and the body state of the user can be judged under certain indexes; the information such as disease history and allergy history can be directly classified into features. Therefore, different features need to be divided, and classification and hierarchical analysis of data are facilitated.

The person skilled in the art can preset feature information according to big data such as historical information data, for example, the invention can combine the observation of tongue image in traditional Chinese medicine dialectics to set 16 features for tongue image information, and each feature can be further subdivided according to the degree and/or nature of the feature. The setting standard of the characteristic information is adjusted according to the needs of those skilled in the art, and the present invention is not limited thereto.

The feature extraction of the target information data may be to extract features of the basic information, the self-describing information, the questionnaire information, and the tongue picture information of the user, which are obtained by filtering, from the target information data. Mapping the extracted features to a mapping table, and extracting the features according to the mapping relation between the target information data and preset information data in the mapping table to obtain a target feature vector corresponding to the target information data.

104, acquiring target reference information corresponding to the online information data according to a target feature vector corresponding to the target information data and a preset reference information set; the set of reference information includes: historical feature vectors corresponding to historical information data and historical reference information corresponding to the historical information data.

The set of reference information includes: historical feature vectors corresponding to the historical information data and historical references corresponding to the historical information data are obtained by vectorizing the historical information data in advance by a person skilled in the art, and the historical feature vectors are stored to an online platform in a vector set mode so that the online platform can intelligently process the data.

And retrieving and acquiring a preset number of target reference information which is most similar to the online user information data from a reference information set according to the target characteristic vector corresponding to the target information data to obtain a preset number of historical reference information, wherein the historical reference information can comprise historical disease reference information, historical medication reference information, historical recovery state reference information and the like.

According to the embodiment of the invention, the online information data is acquired, the target information data is filtered to obtain the target information data, the target information data is subjected to feature extraction to obtain the target feature vector, the target reference information is acquired from the reference information set according to the target feature vector, the doctor can be assisted to work according to the historical information data and the user information data in the reference information set, the doctor can be assisted to perform the subsequent data analysis process by providing abundant reference information for the doctor online, and the reference information is used as an intermediate processing result, so that the working efficiency of the doctor is improved, and the working accuracy of the doctor is improved.

Method embodiment two

Referring to fig. 2, a flowchart of a second embodiment of a data processing method according to the present invention may specifically include:

step 201, obtaining online information data, where the online information data includes: basic information of the user, self-describing information, questionnaire information, and tongue picture information.

Step 202, filtering the online information data to obtain target information data.

Step 203, performing feature extraction on the target information data through a first model to obtain a first feature vector corresponding to the basic information, a second feature vector corresponding to the self-describing information, a third feature vector corresponding to the questionnaire information, and a fourth feature vector corresponding to the tongue picture information.

Optionally, the first model includes: the mapping table comprises a first mapping table, a keyword list, a second mapping table and a third mapping table.

The present invention takes into account that user information data is generally considered to be more meaningful in a specific section. For example, the basic information of the user includes information of age, height, weight, etc., and the age, height, weight are all numerical types (discrete or continuous), the age of irregular menstruation is generally between 15 and 45 years old, symptoms in different age groups may be different, for example, corresponding symptoms rarely appear after the age of 50 years or under 10 years old, the age of 1 to 10 may be divided into a first interval, the interval is characterized by non-development, and the interval is characterized by development, and the interval is divided into a second interval, 11 to 50, the interval is characterized by development. In addition, the height and the weight are combined, the characteristics can be recorded as over-lean, fat and the like, and the body state of the user can be recorded under certain indexes; the information such as disease history and allergy history can be directly classified into features. Therefore, different features need to be divided, so as to facilitate classification and hierarchical analysis of data.

The invention constructs a first model in advance according to historical information data and/or the need of the technicians in the field, wherein the first model comprises the following components: the mapping table comprises a first mapping table, a keyword list, a second mapping table and a third mapping table. For example, according to the age of the user basic information in the user information data, the first mapping table is constructed, wherein the age items in the first mapping table are children of 1-10 years old, children of 11-20 years old, young of 21-30 years old, young and old of 31-40 years old, young and old of 41-50 years old, middle and old of 51-60 years old and old of 61-90 years old, and in addition, information such as disease history and allergy history can be classified and unified with information such as the age, height and weight after interval formation, so that the first mapping table is constructed; because the questionnaire information is the information designed for specific problems and corresponding options for users to complete aiming at specific diseases, the questionnaire information is combined by 'problems + options' as each individual characteristic under general conditions, and a second mapping table can be constructed by different 'problems + options' according to historical information data and/or the needs of technicians in the field; according to the needs of those skilled in the art and combined with the observation of tongue image in the dialectics of traditional Chinese medicine, a third mapping table can be constructed after manually selecting tongue image features affecting specific conditions, for example, 16 kinds of tongue image features are manually selected according to tongue image features affecting specific conditions, and then the third mapping table contains 16 kinds of features set for tongue image information, or each feature can be further subdivided into more features according to the degree and/or nature of the feature.

Optionally, step 203 includes:

step 2031, extracting a first feature vector corresponding to the basic information of the user according to a mapping relationship between the basic information of the user in the target information data and the basic information preset in the first mapping table.

Step 2032, extracting a second feature vector corresponding to the self-describing information of the user according to a mapping relationship between the self-describing information of the user in the target information data and a preset keyword in the keyword list.

Step 2033, extracting to obtain a third feature vector corresponding to the questionnaire information of the user according to a mapping relationship between the questionnaire information of the user in the target information data and the questionnaire information preset in the second mapping table.

Step 2034, according to a mapping relationship between the tongue information of the user in the target information data and preset tongue information in the third mapping table, extracting to obtain a fourth feature vector corresponding to the tongue information of the user.

Mapping the basic information of the user into a first mapping table according to the mapping relation between the basic information of the user in the target information data and the basic information preset in the first mapping table, assigning values to each mapped feature according to Boolean weights, and extracting to obtain a first feature vector corresponding to the basic information of the user. For example: the age entry in the first mapping table includes: 1-10 years old children, 11-20 years old children and 21-30 young years, wherein the age in the current user data is 16 years old, according to the mapping relation between the age in the user basic information and the preset age item in the first mapping table, the age item characteristic in the user basic information is extracted to be children, and according to Boolean weight, each mapped characteristic is assigned: the assignment of the 1-10 year old children interval is 0, the assignment of the 11-20 year old children interval is 1, the assignment of the 21-30 youth interval is 0, and the feature vectors corresponding to the ages in the basic information of the user are extracted and obtained as follows:

similarly, according to the mapping relation between the self-describing information of the user in the target information data and the preset keywords in the keyword list, mapping the self-describing information of the user into the keyword list, assigning values to each mapped feature according to Boolean weight, and extracting to obtain a second feature vector corresponding to the self-describing information of the user; mapping the questionnaire information of the user into a second mapping table according to a mapping relation between the questionnaire information of the user in the target information data and the preset questionnaire information in the second mapping table, assigning values to each mapped feature according to Boolean weight, and extracting to obtain a third feature vector corresponding to the questionnaire information of the user; mapping the user's tongue information into a third mapping table according to the mapping relation between the user's tongue information in the target information data and the preset tongue information in the third mapping table, assigning values to each mapped feature according to Boolean weights, and extracting to obtain a fourth feature vector corresponding to the user's tongue information.

And 204, splicing the first feature vector, the second feature vector, the third feature vector and the fourth feature vector to obtain an initial feature vector.

The step of obtaining the initial feature vector may be to splice the first feature vector, the second feature vector, the third feature vector, and the fourth feature vector according to a preset sequence, where the preset sequence may be set by a person skilled in the art.

Optionally, the keyword list is constructed by the following steps:

and step S1, storing the self-describing information of each historical user in the reference information set, and generating a self-describing document set.

Step S2, counting the frequency of each word in the self-describing document set, the frequency of the document of each word in the self-describing document set, and the frequency of the co-occurrence of word pairs in the sentence range.

Storing the self-describing information of each historical user as a document, namely generating a self-describing document set according to the self-describing information of all historical users in the reference information set, wherein the document frequency is as follows: and if the word W appears in m documents in the self-describing document set, the frequency of the document of the word W appearing in the self-describing document set is m.

And step S3, selecting the target keywords in the reference information set to construct a keyword list according to the word frequency, the document frequency and the word pair co-occurrence frequency.

Storing the self-describing information of each historical user in the reference information set, and generating a self-describing document set: DOC (Document) { DOC₁，doc₂，...，doc_nB, }; counting each term W in the self-describing document collection_iFrequency of occurrence tf_i、W_iDocument frequency df appearing in self-describing document collection_iWord pair co-occurrence frequency cof in sentence range_i。

The step of selecting the target keywords in the reference information set according to the word frequency, the document frequency and the word pair co-occurrence frequency to construct a keyword list comprises the following steps: calculating TF-IDF (term frequency-inverse text frequency index) according to TF and df, wherein the TF-IDF is a common weighting technology for information retrieval and data mining; removing documents with df being more than or equal to 0.75 x I DOC I from the self-describing document set, and arranging the words in descending order according to TF-IDF to form a word list WL₁(ii) a Selecting N words with highest frequency as a reference list WR, wherein the N words can be matched words removed by using PMI (Point Mutual Information), and Chi-Square statistic X is calculated for each word according to tf and cof²According to X²Arranging the words in descending order to form a word list WL₂(ii) a Word list WL₁、WL₂The words in (1) are numbered in sequence, and WL is calculated₁And WL₂Sum of numbers of each word in the union set of (1)_iAnd then, arranging words in an ascending order according to sum values, and selecting words with large sum values in a preset number to construct a keyword list.

Step 205, processing the initial feature vector through a second model to obtain a target feature vector corresponding to the target information data.

In step 204, the first feature vector, the second feature vector, the third feature vector and the fourth feature vector are spliced to obtain a plurality of initial feature vector dimensions, and the feature vectors are obtained by using boolean weights, which easily causes sparse representation of the initial feature vectors and is not beneficial to distance and/or similar calculation of the feature vectors, so that the initial feature vectors need to be subjected to dimension reduction processing by using a second model, that is, main information is retained to obtain dense target feature vectors.

Optionally, the second model comprises a multi-layer perceptron depth model or a principal component analysis model.

Optionally, step 205 includes:

step 2051, when the amount of the historical information data in the reference information set is greater than or equal to a preset threshold and the historical information data is labeled, processing the initial feature vector through the multi-layer sensor depth model to obtain a target feature vector corresponding to the target information data;

and step 2052, when the amount of the historical information data in the reference information set is smaller than the preset threshold value or the historical information data in the reference information set is not provided with a label, processing the initial characteristic vector through the principal component analysis model to obtain a target characteristic vector corresponding to the target information data.

The historical information data may be obtained from the online platform, or may be obtained from other big data. The set of reference information includes: historical feature vectors corresponding to the historical information data and historical reference information corresponding to the historical information data, wherein the historical feature vectors can be obtained by preliminarily setting the historical information data D0 to D by a person skilled in the art₁、d₂、...、d_nVectorization processing is carried out to obtain V0 ═ V₁、v₂、...、v_nAnd storing the vector set V0 so as to facilitate intelligent processing of data by an online platform. Determining the current online platform according to the data quantity n of the historical information dataAnd adopting which of an MLP (multi-layer perceptron) depth model or a PCA (principal component analysis) model in the second model to process the initial feature vector, and reducing the dimension of the initial feature vector to obtain a target feature vector corresponding to the target information data.

The preset threshold of the data volume of the historical information in the reference information set can be set by a person skilled in the art according to data conditions and/or technical needs, and the invention is not limited to this, and n is more than or equal to 10⁵. In addition, when the historical information data in the reference information set is not labeled, supervised model training cannot be performed, so that the initial characteristic vector is selected to be processed through the principal component analysis model to obtain the target characteristic vector corresponding to the target information data.

Step 206, acquiring target reference information corresponding to the online information data according to a target feature vector corresponding to the target information data and a preset reference information set; the set of reference information includes: historical feature vectors corresponding to historical information data and historical reference information corresponding to the historical information data.

Optionally, step 206 includes:

and acquiring target reference information corresponding to the online information data from a reference information set through a third model according to the target characteristic vector corresponding to the target information data.

Optionally, the third model is constructed by the following steps:

step Y1, determining the feature vector dimension and the similarity measurement index of the target feature vector; wherein the similarity metric comprises: euclidean distance measurement and included angle cosine value measurement.

And Y2, when the feature vector dimension of the target feature vector is smaller than or equal to a preset dimension and the similarity measurement index of the target feature vector uses Euclidean distance measurement, selecting a kd tree to construct the third model.

And step Y3, when the feature vector dimension of the target feature vector is larger than the preset dimension, or the similarity measurement index of the target feature vector is measured by using the cosine value of the included angle, selecting a ball tree to construct the third model.

The set of reference information includes: historical feature vectors corresponding to the historical information data and historical reference information corresponding to the historical information data, wherein the historical feature vectors can be obtained by preliminarily setting the historical information data D0 to D by a person skilled in the art₁、d₂、...、d_nVectorization processing is carried out to obtain V0 ═ V₁、v₂、...、v_nAnd storing the vector set V0 so as to facilitate intelligent processing of data by an online platform.

And retrieving and acquiring the most similar preset number of target reference information corresponding to the online information data from a reference information set according to the target characteristic vectors corresponding to the target information data, and if the preset number is multiple, acquiring multiple target reference information, namely providing multiple accurate reference information for reference of doctors. The historical reference information may include historical condition reference information, historical medication reference information, historical recovery status reference information, and the like.

The preset dimensionality is set in advance by a technician according to needs, if the preset dimensionality dim is 50, when the dimensionality dim of the feature vector of the target feature vector is less than or equal to 50 and the similarity measurement index of the target feature vector is measured by using Euclidean distance, a kd tree is selected to construct a third model to obtain target reference information from the reference information set; and when the feature vector dimension dim of the target feature vector is larger than 50 or the similarity measurement index of the target feature vector is measured by using the cosine value of the included angle, selectively using a ball tree to construct a third model to obtain target reference information from the reference information set.

It should be noted that, when the similarity measure index of the target feature vector is measured by using the cosine value of the included angle, because the similarity measure index does not meet the distance definition required by the spherical tree, the similarity measure index needs to be converted into the size of the included angle of the vector to meet the definition of the distance, the cosine value of the included angle needs to be converted into the size of the included angle of the vector through an inverse trigonometric function, and then the spherical tree is used for constructing a third model to obtain the target reference information from the reference information set; and when the feature vector dimension of the target feature vector is larger than the preset dimension and the similarity measurement index of the target feature vector is not measured by using the cosine value of the included angle, directly selecting to use a ball tree to construct a third model and obtain target reference information from a reference information set.

In addition, in practical application, a preset amount of reference information obtained according to the present application can be used as a reference source for doctor work, and usually, a doctor can refer to the reference information obtained by the present application to provide final result information corresponding to user information data. Therefore, optionally, the information data of the user and the final result information may also be used as historical information data, and stored in the reference information set in a key-value form, the key item may be a target feature vector corresponding to the information data of the user, and the key value is the final result information corresponding to the information data of the user.

In summary, in the embodiment of the present invention, after the target information data is obtained by obtaining the online information data and filtering the online information data, the target information data is subjected to feature extraction through the first model and the second model to obtain the target feature vector, and both the first model and the second model are preset by those skilled in the art, so that the data processing efficiency can be improved; and target reference information is obtained from the reference information set according to the target characteristic vector, the doctor can be assisted according to historical information data and user information data in the reference information set with a certain scale, and the doctor can be assisted to perform subsequent data analysis processes by providing abundant reference information for the doctor, wherein the reference information is used as an intermediate processing result, so that the doctor working efficiency is improved, and the doctor working accuracy is improved.

System embodiment

Referring to FIG. 3, a block diagram of an embodiment of a data processing system of the present invention is shown. The system may specifically include:

an obtaining module 301, configured to obtain online information data, where the online information data includes: basic information of the user, self-describing information, questionnaire information, and tongue picture information.

The filtering module 302 is configured to filter the online information data to obtain target information data.

An extracting module 303, configured to perform feature extraction on the target information data to obtain a target feature vector corresponding to the target information data.

A retrieval module 304, configured to obtain target reference information corresponding to the online information data according to a target feature vector corresponding to the target information data and a preset reference information set; the set of reference information includes: historical feature vectors corresponding to historical information data and historical reference information corresponding to the historical information data.

Optionally, the extracting module includes:

and the sub-module for extracting corresponding characteristic vectors is used for performing characteristic extraction on the target information data through a first model to obtain a first characteristic vector corresponding to the basic information, a second characteristic vector corresponding to the self-describing information, a third characteristic vector corresponding to the questionnaire information and a fourth characteristic vector corresponding to the tongue picture information.

And the splicing submodule is used for splicing the first feature vector, the second feature vector, the third feature vector and the fourth feature vector to obtain an initial feature vector.

And the processing submodule is used for processing the initial characteristic vector through a second model to obtain a target characteristic vector corresponding to the target information data.

Optionally, the sub-module for extracting the corresponding feature vector includes:

a first extraction unit, configured to extract a first feature vector corresponding to the basic information of the user according to a mapping relationship between the basic information of the user in the target information data and the basic information preset in the first mapping table;

a second extraction unit, configured to extract a second feature vector corresponding to the self-describing information of the user according to a mapping relationship between the self-describing information of the user in the target information data and a preset keyword in the keyword list;

a third extraction unit, configured to extract a third feature vector corresponding to the questionnaire information of the user according to a mapping relationship between the questionnaire information of the user in the target information data and the questionnaire information preset in the second mapping table;

and the fourth extraction unit is used for extracting and obtaining a fourth feature vector corresponding to the tongue information of the user according to the mapping relation between the tongue information of the user in the target information data and the preset tongue information in the third mapping table.

Optionally, the processing sub-module includes:

the first processing unit is used for processing the initial characteristic vector through the multi-layer perceptron depth model to obtain a target characteristic vector corresponding to the target information data when the amount of historical information data in the reference information set is greater than or equal to a preset threshold and the historical information data is labeled;

and the second processing unit is used for processing the initial characteristic vector through the principal component analysis model to obtain a target characteristic vector corresponding to the target information data when the amount of the historical information data in the reference information set is smaller than the preset threshold value or the historical information data in the reference information set is not provided with a label.

Optionally, the retrieving module includes:

and the third model retrieval submodule is used for retrieving the target reference information from the reference information set through the third model according to the target characteristic vector corresponding to the target information data.

Optionally, the system further includes a third model building module, configured to build the third model; the third model building module comprises:

the determining submodule is used for determining the feature vector dimension and the similarity measurement index of the target feature vector; wherein the similarity metric comprises: euclidean distance measurement and included angle cosine value measurement.

And the first retrieval submodule is used for selecting the kd tree to construct the third model when the feature vector dimension of the target feature vector is less than or equal to a preset dimension and the similarity measurement index of the target feature vector uses Euclidean distance measurement.

And the second retrieval submodule is used for selecting a ball tree to construct the third model when the feature vector dimension of the target feature vector is larger than the preset dimension or the similarity measurement index of the target feature vector is measured by using an included angle cosine value.

Optionally, the system further includes a keyword list construction module, configured to construct the keyword list; the keyword list building module comprises:

the storage submodule is used for storing the self-describing information of each historical user in the reference information set and generating a self-describing document set;

the statistic submodule is used for counting the word frequency of each word in the self-describing document set, the document frequency of each word in the self-describing document set and the word pair co-occurrence frequency in a sentence range;

and the keyword selection sub-module is used for selecting the target keywords in the reference information set according to the word frequency, the document frequency and the word pair co-occurrence frequency to construct a keyword list.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

With regard to the system in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The data processing method and system provided by the present invention are introduced in detail, and a specific example is applied in the text to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A data processing method, comprising:

filtering the online information data to obtain target information data;

2. The method according to claim 1, wherein the extracting the features of the target information data to obtain a target feature vector corresponding to the target information data comprises:

performing feature extraction on the target information data through a first model to obtain a first feature vector corresponding to the basic information, a second feature vector corresponding to the self-describing information, a third feature vector corresponding to the questionnaire information and a fourth feature vector corresponding to the tongue picture information;

splicing the first feature vector, the second feature vector, the third feature vector and the fourth feature vector to obtain an initial feature vector;

and processing the initial characteristic vector through a second model to obtain a target characteristic vector corresponding to the target information data.

3. The method of claim 2, wherein the first model comprises: the keyword table comprises a first mapping table, a keyword list, a second mapping table and a third mapping table;

the extracting features of the target information data through the first model to obtain a first feature vector corresponding to the basic information of the user, a second feature vector corresponding to the self-describing information of the user, a third feature vector corresponding to the questionnaire information of the user, and a fourth feature vector corresponding to the tongue-view information of the user includes:

extracting a first feature vector corresponding to the basic information of the user according to a mapping relation between the basic information of the user in the target information data and the basic information preset in the first mapping table;

extracting a second feature vector corresponding to the self-describing information of the user according to the mapping relation between the self-describing information of the user in the target information data and a preset keyword in the keyword list;

extracting and obtaining a third feature vector corresponding to the questionnaire information of the user according to the mapping relation between the questionnaire information of the user in the target information data and the questionnaire information preset in the second mapping table;

and extracting to obtain a fourth feature vector corresponding to the user's tongue information according to the mapping relation between the user's tongue information in the target information data and the preset tongue information in the third mapping table.

4. The method of claim 2, wherein the second model comprises a multi-layered perceptron depth model or a principal component analysis model; the processing the initial feature vector through the second model to obtain the target feature vector corresponding to the target information data includes:

when the amount of historical information data in the reference information set is larger than or equal to a preset threshold and the historical information data are provided with labels, processing the initial characteristic vector through the multi-layer sensor depth model to obtain a target characteristic vector corresponding to the target information data;

and when the data volume of the historical information in the reference information set is smaller than the preset threshold value or the historical information data in the reference information set is not provided with a label, processing the initial characteristic vector through the principal component analysis model to obtain a target characteristic vector corresponding to the target information data.

5. The method according to claim 1, wherein the obtaining target reference information corresponding to the online information data according to a target feature vector corresponding to the target information data and a preset reference information set includes:

acquiring target reference information corresponding to the online information data from a reference information set through a third model according to a target feature vector corresponding to the target information data;

wherein the third model is constructed by the following steps:

determining a feature vector dimension and a similarity measure index of the target feature vector; wherein the similarity metric comprises: measuring Euclidean distance and cosine value of included angle;

when the feature vector dimension of the target feature vector is smaller than or equal to a preset dimension and the similarity measurement index of the target feature vector uses Euclidean distance measurement, selecting a kd tree to construct the third model;

and when the feature vector dimension of the target feature vector is larger than the preset dimension or the similarity measurement index of the target feature vector is measured by using the cosine value of the included angle, selecting a ball tree to construct the third model.

6. The method of claim 3, wherein the keyword list is constructed by:

storing the self-describing information of each historical user in the reference information set, and generating a self-describing document set;

counting the word frequency of each word in the self-describing document set, the document frequency of each word in the self-describing document set and the co-occurrence frequency of word pairs in a sentence range;

and selecting target keywords in the reference information set according to the word frequency, the document frequency and the word pair co-occurrence frequency to construct a keyword list.

7. A data processing system, comprising:

8. The system of claim 7, wherein the extraction module comprises:

a sub-module for extracting corresponding feature vectors, configured to perform feature extraction on the target information data through a first model to obtain a first feature vector corresponding to the basic information, a second feature vector corresponding to the self-describing information, a third feature vector corresponding to the questionnaire information, and a fourth feature vector corresponding to the tongue-photograph information;

the splicing submodule is used for splicing the first feature vector, the second feature vector, the third feature vector and the fourth feature vector to obtain an initial feature vector;

9. The system of claim 8, wherein the first model comprises: the keyword table comprises a first mapping table, a keyword list, a second mapping table and a third mapping table;

the sub-module for extracting the corresponding feature vectors comprises:

10. The system of claim 8, wherein the second model comprises a multi-layered perceptron depth model or a principal component analysis model; the processing submodule comprises:

11. The system of claim 7, wherein the retrieval module comprises:

the third model retrieval submodule is used for retrieving target reference information from a reference information set through a third model according to the target characteristic vector corresponding to the target information data;

the system further comprises a third model building module for building the third model; the third model building module comprises:

the determining submodule is used for determining the feature vector dimension and the similarity measurement index of the target feature vector; wherein the similarity metric comprises: measuring Euclidean distance and cosine value of included angle;

the first retrieval submodule is used for selecting a kd tree to construct the third model when the feature vector dimension of the target feature vector is smaller than or equal to a preset dimension and the similarity measurement index of the target feature vector uses Euclidean distance measurement;

12. The system of claim 9, further comprising a keyword list construction module for constructing the keyword list; the keyword list building module comprises: