WO2020215697A1

WO2020215697A1 - Tongue image extraction method and device, and a computer readable storage medium

Info

Publication number: WO2020215697A1
Application number: PCT/CN2019/118413
Authority: WO
Inventors: 曹靖康; 王健宗; 王义文
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-08-09
Filing date: 2019-11-14
Publication date: 2020-10-29
Also published as: SG11202008404RA; CN110569879B; CN110569879A

Abstract

A tongue image extraction method and device, and a computer readable storage medium. In the invention, an LNMF algorithm is used to perform training, and a matrix V corresponding to a training image is decomposed into the product of a non-negative feature matrix W and a weight value matrix H; the dimensions of the non-negative feature matrix W are n*r, r columns being a feature base image, and the non-negative feature matrix W forming a non-negative subspace; the dimensions of the weight value matrix H are r*m; the training image and test images are projected onto the non-negative subspace to obtain a feature coefficient for each; a nearest neighbor rule is used to solve for the degree of similarity of the feature coefficients corresponding to the training image and to the test images; and features in the test images for which the degree of similarity is higher than a threshold value are extracted, and thus frames may be used to separate out from each test image regions representing tongue features.

Description

Tongue image extraction method, device and computer readable storage medium

Cross-references to related applications:

This application claims the priority of the Chinese patent application filed to the China Patent Office with application number 201910733855.8 on August 9, 2019, with the application titled "Tongue Image Extraction Method, Apparatus, and Computer-readable Storage Medium", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to a tongue image extraction method, device and computer readable storage medium.

Background technique

Existing tongue image detection methods usually adopt the method of target detection, using a sliding window to slide on the image in the horizontal and vertical directions, and extract the spatial features of the objects in the sliding window through the CNN model, and use the SVM classifier to extract The spatial features obtained are classified to determine whether there is a tongue image in the sliding window. The coordinates of the four corner points of the sliding window are output, and the position of the tongue image is calibrated with the coordinates of the four corner points. However, due to the large differences in the size of the tongue in different images, the angle and posture are also different. Therefore, the size of the target frame is uncertain, which requires multiple times of sliding recognition through target frames of various sizes, which also leads to a certain degree of complexity in target detection.

Therefore, the inventor realized that how to quickly obtain a correct, complete and clear tongue image is an urgent problem to be solved.

Summary of the invention

According to various embodiments disclosed in the present application, a tongue image extraction method is provided, which is applied to an electronic device, and the method includes the following steps:

S110: Convert the training image containing the tongue into a matrix V, where all the non-negative gray values of an image correspond to a column in V, and use the LNMF algorithm for training to decompose the matrix V into a non-negative feature matrix W and weights. The product of the value matrix H, that is, V=WH;

The dimension of the non-negative feature matrix W is n*r, and column r is a feature base image. The feature base image refers to a non-negative feature matrix W representing tongue features, and the non-negative feature matrix W forms a non-negative subspace;

The dimension of the weight matrix H is r*m, and each column is a code;

S120. Use the EHMM model to identify whether the test image contains a face image, and if it does, project the training image and the test image to the non-negative subspace respectively to obtain feature coefficients, and use the nearest neighbor criterion to obtain the training image and Test the similarity of the feature coefficients corresponding to the image, and extract the tongue-representing feature in the test image whose similarity is higher than the similarity threshold as the tongue feature;

S130: After projection, the characteristic area containing the tongue feature and the non-characteristic area not containing the tongue feature are respectively identified with different labels, where the label set corresponds to the boundary information of the characteristic area, and the extreme values in the up, down, left, and right directions in the boundary information are extracted To determine the border that contains the characteristic area.

This application also provides a tongue image extraction device, including:

The matrix decomposition module is used to convert the training image containing the tongue into a matrix V, where all the non-negative gray values of an image correspond to a column in V. The LNMF algorithm is used for training to decompose the matrix V into non-negative features The product of the matrix W and the weight matrix H, that is, V=WH; the dimension of the non-negative feature matrix W is n*r, column r is the feature base image, and the feature base image refers to the non-negative feature matrix representing the tongue feature W, the non-negative feature matrix W constitutes a non-negative subspace; the dimension of the weight matrix H is r*m, and each column is a code;

The tongue feature extraction module uses the EHMM model to identify whether the test image contains a face image, and if it does, project the training image and the test image to the non-negative subspace respectively to obtain feature coefficients, and use the nearest neighbor criterion to obtain The similarity of the feature coefficients corresponding to the training image and the test image, and extract the tongue-representing feature in the test image whose similarity is higher than the similarity threshold as the tongue feature;

The tongue image segmentation module uses different labels to mark the characteristic areas that contain tongue features and the non-feature areas that do not contain tongue features. The label set corresponds to the boundary information of the characteristic area, and extracts the extreme values in the up, down, left, and right directions in the boundary information to determine the inclusion The border of the characteristic area.

The present application also provides an electronic device, which includes a memory and a processor, and a tongue image extraction program is stored in the memory. When the tongue image extraction program is executed by the processor, the following steps are implemented:

The dimension of the weight matrix H is r*m, and each column is a code;

In addition, a computer non-volatile readable storage medium is also provided. The computer non-volatile readable storage medium stores a computer program. The computer program includes program instructions. When the program instructions are executed by a processor, Realize any of the tongue image extraction methods described above.

The details of one or more embodiments of the application are set forth in the following drawings and description. Other features and advantages of this application will become apparent from the description, drawings and claims.

Description of the drawings

By describing its embodiments in conjunction with the following drawings, the above-mentioned features and technical advantages of the present application will become clearer and easier to understand.

FIG. 1 is a schematic flowchart of a tongue image extraction method according to an embodiment of the present application;

2 is a schematic diagram 1 of the flow of the super state and the embedded state of the EHMM corresponding to the slice of the image in the embodiment of the present application;

FIG. 3 is a second schematic diagram of the flow of the super state and the embedded state of the EHMM corresponding to the slice of the image in the embodiment of the present application;

FIG. 4 is a third flowchart of the superstate and embedded state of the EHMM corresponding to the slice of the image in the embodiment of the present application;

5 is a schematic diagram of the hardware architecture of an electronic device according to an embodiment of the present application;

Fig. 6 is a block diagram of a tongue image extraction program according to an embodiment of the present application;

FIG. 7 is a schematic diagram of adjusting a frame of a linear regression model according to an embodiment of the present application.

Detailed ways

Hereinafter, embodiments of the tongue image extraction method, device, and computer non-volatile readable storage medium described in this application will be described with reference to the accompanying drawings.

In one of the embodiments, FIG. 1 is a schematic flowchart of a tongue image extraction method provided by an embodiment of the application, which is applied to an electronic device, and the method includes the following steps:

S110. Use the LNMF (local non-negative matrix factorization) algorithm for training to obtain feature base images of different dimensions. For example, 1000 tongue images (that is, the images contain the tongue and reflect the shape and color of the tongue) are used as the training image set, and the tongue images have been previously annotated. Preferably, the tongue image can be compressed first, such as compressed to 56*64 pixels, and the tongue image is de-averaged and normalized, and the LNMF algorithm is used for training to obtain feature base images of different dimensions. The base image refers to a non-negative feature matrix W representing the characteristics of the tongue, and the non-negative feature matrix W constitutes a non-negative subspace.

Among them, LNMF is an improvement on the basis of NMF. The LNMF algorithm decomposes the matrix V corresponding to the training image into the product of the feature matrix W and the weight matrix H, that is, V=WH.

Among them, V is a matrix of n*m, V=(V1, V2,...Vm), all the non-negative gray values of an image correspond to a column in V, and the data in V is the corresponding training image grayscale value.

The dimension of the feature matrix W is n*r, and column r is the base image;

The dimension of the weight matrix H is r*m, and each column of it is a code, which corresponds to a tongue image in V one to one. Therefore, a training image can be expressed as a linear combination of base images.

S120: Use the EHMM model to identify whether the test image contains a face image, and if it does, perform feature extraction on the test image. Specifically, the non-negative feature matrix W representing the characteristics of the tongue constitutes a non-negative subspace, and the training image and the test image are respectively projected to the non-negative subspace obtained by the training image set, and the feature coefficients are obtained respectively, and the nearest neighbor criterion is used to obtain The similarity of the feature coefficients corresponding to the training image and the test image is extracted, and the tongue-representative features whose similarity of the feature coefficients are higher than the set threshold are extracted as tongue features, so that images with tongue features are screened from the test images. Among them, the characteristics of the tongue include the shape, angle, color, state of the tongue coating, and the positional relationship between the tongue and the facial organs.

S130. The test image is projected to the non-negative subspace. The process of projection is equivalent to transforming the test image to the non-negative subspace. It is still an image and is an image composed of learned features. After projection, the tongue will be included. Featured feature areas and non-featured areas without tongue features are respectively identified with different labels, so that the feature areas containing tongue features are segmented from the test image. Among them, the label set corresponds to the boundary information of the characteristic area, and the upper, lower, left, and right extreme values in the boundary information are extracted to determine the smallest border containing the characteristic area. The significance of using the minimum frame here is that linear regression will be used to adjust the position of the minimum frame to eliminate or reduce the position error. The non-characteristic area and the characteristic area have different labels. For example, the non-characteristic area is 0 and the characteristic area is non-zero. Based on 0 and non-zero, the image area representing the tongue feature can be segmented from each test image with a frame. Further, it also includes step S140, using an SVM classifier to classify the features extracted from the test image, and sending the extracted features to k svm classifiers for recognition, and the value of k is equal to the number of categories. Specifically, it can be classified into "tongue" and "non-tongue", for example. It can also be classified according to the characteristics of the pathological condition of the tongue. Among them, the physical condition of a person can include damp heat, yin deficiency, normal, high heat, impassable qi and blood, and blood stasis. The class with the highest score among the k SVM classifiers As a result of classification.

In one of the embodiments, it further includes step S150, adjusting the border position of the tongue image through the linear regression model. For each category, for example, damp-heat, yin deficiency, normal, heat-rich, qi and blood impassable, and blood stasis are trained separately. In the linear regression model, the input is the characteristics of the image in the frame, and the output is the translation (left-right translation and up-down translation) value and zoom value of the border. The linear regression model is used to calculate the translation value and zoom value of the border, and the loss function is used to constrain the position error of the border, so as to continuously adjust the border to move to a suitable position.

Among them, as shown in Figure 7, the linear regression model is the original value of a given position P=(P _x ,P _y ,P _w ,P _h ), where P _x ,P _y represent the coordinates of the frame, P _w ,P _h Represents the width and height of the border respectively, and obtains the mapping f through machine learning, so that

and,

Position prediction

≈The true value of the position (G _x , G _y , G _w , G _h ).

Assuming translation (Δx, Δy), Δx=P _w d _x (P), Δy=P _h d _y (P), then

Assuming scale scaling (S _w ,S _h ), S _w =exp(d _w (P)), _Sh =exp(d _h (P)), then

The bounding box regression is to learn to obtain accurate d _x (P), d _y (P), d _w (P), d _h (P) these four transformation values.

The input is P=(P _x ,P _y ,P _w ,P _h ), and the output is the position prediction value

While the original position value is transformed into the actual position value G, the real transformation value t _* = (t _x , t _y , t _w , t _h ) is required, where the real translation amount is (t _x , t _y ), and the scale is actually scaled The width and height of is (t _w , t _h ), where

t _x =(G _x -P _x )/P _w (5)

t _y = (G _y -P _y )/P _h (6)

t _w =log(G _w /P _w ) (7)

t _h =log(G _h /P _h ) (8)

Construct objective function

w _* is the parameter to be learned (* means x, y, w, h, that is, an objective function is set for each transformation), and d*(P) is the predicted value of the transformation. K(P) is the feature vector corresponding to the feature region. To minimize the difference between the transformed predicted value and the transformed true value t*=(t _x , t _y , t _w , t _h ), construct the loss function Loss and minimize:

Among them, i is the i-th iteration;

N is the number of samples.

Through some sample training, the loss function is minimized to obtain w*, and then d _* (P) can be obtained, which is d _x (P), d _y (P), d _w (P), d _h (P) value.

In one of the embodiments, NMF is a local subspace projection method. Since the features extracted by the NMF algorithm are based on global features, there is no restriction on the locality of the feature space. In order to enhance the localization of the principal components of the feature matrix W, LNMF emphasizes the localization of the basic feature components in the original image decomposition process. The formula of the LNMF algorithm is as follows:

Build the objective function

Among them, α and β are normal numbers;

V, W, H≥0;

||W _j || = 1, W _j represents the j-th column vector of the feature base matrix W, which means that each column of the feature base matrix W is normalized;

_{_{V = [V 1, V 2}} ... V i ... V m] represents a set of training images web m, V _i represents the column vector web training image i, V _ij denotes the j-th amplitude of the i gray value image, each piece The size of the training image is n, and the size of V is n*m.

W=[W ₁ , W ₂ ,...W _j …W _r ] is the characteristic matrix, the size is n*r;

H=[H ₁ , H ₂ …H _j …H _m ] is the weight matrix, H _j is the j-th column vector of H, and the size is r*m.

Iteratively update W and H through the following formula to minimize the objective function,

Among them, i=1,2,...m; j=1, 2,...,r; l=1,2,...n, W and H always remain non-negative numbers during the iteration process.

In one of the embodiments, both the training image and the test image are binarized first, and the binarization refers to setting the gray value of the pixel on the image to only 0 or 255, that is, to present the entire image The process of producing obvious black and white effects. The tongue area can be obtained more accurately. Specifically, the gray value of the pixel on the image is only set to 0 or 255 according to the set gray value threshold, and the gray value threshold can be the middle value from 0 to 255. The bit value is set to 0 if it is less than the gray value threshold, and 255 if it is greater than or equal to the gray value threshold.

In one of the embodiments, the EHMM (Embedded Hidden Markov Algorithm) is first used to classify the test image. Specifically, the image is classified according to two categories of "human face" and "no human face". Optimize recognition accuracy. The specific classification process includes the following steps:

Select multiple feature points of the face to form a feature sequence.

Input the test image into the EHMM model. The EHMM model scans the test image from top to bottom and from left to right through a moving window. It first scans from left to right. Each window scans to obtain a set of feature vectors, which is the face at this time A feature extraction of a region. After the scanning window calculates the feature vector, it moves to the right at a fixed distance to continue the feature extraction. When it moves to the right side of the image, it changes to the next line to continue scanning from left to right. Until the window moves to the lower right of the image, the entire scanning process is ended and multiple sets of feature vectors are obtained, and multiple sets of feature vectors form an observation sequence.

Among them, the EHMM model contains a set of superstates. The number of superstates in the superstate set is the same as the number of vertical slice pictures of a human face. Each superstate encapsulates a set of embedded state sets. The number of embedded states is the same as the number of slice images of the face in the horizontal direction. The EHMM model scans the image from left to right and top to bottom through a fixed-size window. The facial features can correspond to the super state from top to bottom and the embedded state from left to right. Among them, as shown in Figure 2, the slices of the image corresponding to the longitudinal super state are forehead area, eye area, nose area, mouth area, and chin area. From top to bottom, the positional relationship of these areas is fixed, which is the common feature of human faces. The personality of the vertical face is reflected by the characteristics of each superstate (that is, each region), and the relationship between the superstates. From left to right, the face is divided into left face, left eye, middle of the two eyes, right eye, and right face. This positional relationship is also fixed, and the personality of the horizontal face passes through each embedded state And the mutual connection between each embedded state is reflected.

The forward algorithm is used to find the probability that the observation value sequence is similar to the feature sequence composed of multiple feature points of the face. If the similarity probability is greater than the judgment threshold, it is considered that the detected image contains a face.

In one of the embodiments, the training process of the EHMM model is as follows:

1) EHMM modeling: EHMM can be defined as a ternary formula of λ = (P ₀ , A ₀ , ∧), where

The basic elements of the EHMM model include:

(1) The initial probability of the super state P ₀ =π _0,i , π _0,i is the probability of the super state i at time=0, 1≤i≤N ₀ , N ₀ represents the number of super states;

(2) Transition transition matrix A ₀ =a ₀ ,ij, where a ₀ ,ij is the probability of transition matrix from superstate i to superstate j. In the left-to-right EHMM model, the only allowed transition is The transition from this state to the next state, so the transition probability from the original state to the previous state is 0;

(3)

Represents the parameter set of the k-th super state, 1≤k≤N ₀ ;

among them,

Is the initial probability distribution of the embedded state;

Is the embedded state transition probability matrix;

Bk represents the probability matrix of observations,

The embedded state j representing the super state k produces observations

The probability,

The two variables correspond to the vertical and horizontal dimensions respectively,

among them,

Represents the number of mixed Gaussians;

Is the mixing coefficient of the m-th mixing term of the embedded state j of the super state k;

So

Is the mean vector,

Is the Gaussian density of the covariance matrix.

2) Image segmentation: The training image is uniformly divided, and the observation sequence obtained from the image is evenly divided into N ₀ longitudinal slices corresponding to the longitudinal superstate. Each longitudinal slice can be divided into multiple embedded states from left to right.

3) Parameter initialization: After segmentation, the initial values of the model parameters are obtained through the initialization probability and the transition probability of the state. The state of each EHMM uses K-means clustering to calculate the probability of observation. K is the number of Gaussian distributions in each state. All the observed value vectors extracted in the embedded state can use the Gaussian mixture model to explain the observed value probability density function. The state initialization rule of each super state is as follows: the initialization probability of the first state of each EHMM is specified as 1.0, and the initialization probability of other states is 0.

4) Embedded Viterbi segmentation: After the first step of the iteration, the dual embedded Viterbi algorithm (Viterbi algorithm) is used instead of uniform segmentation, and a new set of initialization and migration probabilities are determined through the new segmentation and event frequency counting.

5) Segmentation of Kmeans clustering: According to the segmentation result of step 4, Kmeans clustering is used to calculate the observation value vector corresponding to the new state and the new observation value probability density function. In the next iteration, these values will be used as the initial values for a new round of dual embedded Viterbi segmentation.

6) Repeat steps 4 and 5 until the continuous iteration process change is less than the set convergence threshold.

In one of the embodiments, as shown in Fig. 3, the longitudinal slice of the face also includes the hair area. Although not everyone has hair, the hair area provides additional features of the face, which really helps Recognize faces more accurately. The calculation process is basically the same as the above process, so I won't repeat it here.

In one of the embodiments, the above is that the entire face is divided into slice images in the vertical direction to correspond to the super state, and the horizontal division into slices corresponds to the embedded state. However, since the purpose of this application is tongue image extraction, in face recognition, only partial longitudinal slices can be used to correspond to the super state. For example, as shown in Fig. 4, only the chin and mouth area are corresponding to the super state. , Without the need to recognize other areas of the face, it can also be trained to identify whether it is a face, and the amount of calculation can be reduced. The calculation process is basically the same as the above process, so I won't repeat it here.

In one of the embodiments, after using the embedded Hidden Markov Algorithm to classify the test image for human face and non-human face, the image of the human face is further classified, including identifying the gender and age of the image. Although the age and gender of a person cannot be accurately determined based on the state of the tongue, the condition of a person's tongue is related to its age and gender (for example, different age stages, taste buds (papillary protrusions distributed on the tongue)) The quantity is different, and it shows a decreasing trend. For children, there are about 10,000 taste buds. With age, the cells will slowly age. In old age, taste buds are only 20% of childhood. And the younger the age, the tenderer the tongue, and the redder the tongue; the older, the darker the tongue. Female tongues are usually smaller than male tongues). The test images can be classified according to the age and gender of the person, and then images of the tongue area can be extracted from the images in the categories classified by age and gender. Since the images in each category are more related to the tongue that the category should actually correspond to, that is to say, the tongue in an image belongs to the elderly, so it is classified into the older category, in the image The tongue has the characteristics of the tongue that the elderly should have, such as a small number of taste buds (of course only a rough number of recognition), and the color of the tongue is dim, so it can be recognized faster, which is equivalent to reducing the amount of calculation of the model . Of course, this requires prior training of the LNMF model corresponding to the age group. That is to say, the images in the training set are first classified according to age and gender, and the images in each category are labeled with and without tongue, which also forms the age-gender-with tongue label. For each category Train an LNMF model separately.

The following is a specific explanation. First, a CNN (Convolutional Neural Network) model is obtained, which can be trained to recognize the gender and age of a face. Suppose 6 categories are set according to age group and gender, such as 0-20-male, 20-40-male, 40-70-male, 0-20-female, 20-40-female, 40-70-female. Use CNN (Convolutional Neural Network) to identify and classify training images into the above six age-gender categories;

Then, label the images in these six categories, and the corresponding label for each image is the label of age, gender, and tongue;

Then, LNMF models are trained for these six categories respectively, and six LNMF models corresponding to the above six categories are obtained. For example, the LNMF model corresponding to the 0-20-male tongue is used to identify the 0-20-male tongue. The LNMF model corresponding to the 20-40-female tongue is used to identify the 20-40-female tongue;

Then, use the previously trained CNN model to identify the gender and age of the test image, and similarly classify the test image according to gender and age, and its category is the same as that of the training image;

Then, the tongue image area is extracted according to the LNMF model after age group, gender, and tongue correspondence training.

Since the LNMF model is trained to specifically identify tongue images corresponding to age and gender, the test images are also divided into corresponding age and gender categories, and tongues of this age and gender have corresponding characteristics, which can be more Conducive to LNMF model to identify. On the other hand, dividing the test image into multiple categories for simultaneous recognition also speeds up the recognition efficiency. In addition, since the tongue image corresponds to the age group and gender, this also helps the accuracy and rapidity of the later classification of damp heat, yin deficiency, normal, heat rich, qi and blood barrier, and blood stasis.

Refer to FIG. 5, which is a schematic diagram of the hardware architecture of an embodiment of the electronic device of the present application. In this embodiment, the electronic device 2 is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions. For example, it can be a smart phone, a tablet computer, a notebook computer, a desktop computer, a server, etc. As shown in FIG. 5, the electronic device 2 at least includes, but is not limited to, a memory 21, a processor 22, and a network interface 23 that can be communicatively connected to each other through a system bus. Wherein, the memory 21 includes at least one type of computer non-volatile readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, memory, magnetic disk, optical disk, and the like. In some embodiments, the memory 21 may be an internal storage unit of the electronic device 2, such as a hard disk or a memory of the electronic device 2. In other embodiments, the memory 21 may also be an external storage device of the electronic device 2, for example, a plug-in hard disk equipped on the electronic device 2, a smart media card (SMC), a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc. In this embodiment, the memory 21 is generally used to store the operating system and various application software installed in the electronic device 2, such as the tongue image extraction program code.

The processor 22 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. In this embodiment, the processor 22 is used to run the program code or processing data stored in the memory 21, for example, to run the tongue image extraction program.

The network interface 23 may include a wireless network interface or a wired network interface, and the network interface 23 is generally used to establish a communication connection between the electronic device 2 and other electronic devices.

Optionally, the electronic device 2 may also include a display, and the display may also be called a display screen or a display unit.

The memory 21 containing a readable storage medium may include an operating system, a tongue image extraction program 50, and the like. When the processor 22 executes the tongue image extraction program 50 in the memory 21, the steps from S1 to S4 described above are implemented, which will not be repeated here. In this embodiment, the tongue image extraction program 50 stored in the memory 21 can be divided into one or more program modules, and the one or more program modules are stored in the memory 21, and can be divided into one or more program modules. It is executed by two processors (in this embodiment, the processor 22) to complete the application. For example, FIG. 6 shows a schematic diagram of the program modules of the tongue image extraction program. In this embodiment, the tongue image extraction program 50 can be divided into a matrix decomposition module 501, a tongue feature extraction module 502, and a tongue image segmentation module 503. . The following description will specifically introduce the specific functions of the program modules.

Among them, the matrix factorization module 501 is used for training using the LNMF (local non-negative matrix factorization) algorithm to obtain feature base images of different dimensions. For example, 1000 tongue images (that is, the images contain the tongue and reflect the shape and color of the tongue) are used as the training image set, and the tongue images have been previously annotated.

The dimension of the feature matrix W is n*r, and column r is the base image;

NMF is a projection method based on local subspace. Since the features extracted by the NMF algorithm are based on global features, there is no restriction on the locality of the feature space. In order to enhance the localization of the principal components of the feature matrix W, LNMF emphasizes the localization of the basic feature components in the original image decomposition process. The formula of the LNMF algorithm is as described above and will not be detailed here.

The tongue feature extraction module 502 uses the EHMM model to identify whether the test image contains a face image, and if it does, it performs feature extraction on the test image. Specifically, the non-negative feature matrix W representing the characteristics of the tongue constitutes a non-negative subspace, and the training image and the test image are respectively projected to the non-negative subspace obtained by the training image set, and the feature coefficients are obtained respectively, and the nearest neighbor criterion is used to obtain The similarity of the feature coefficients corresponding to the training image and the test image, so as to extract the features in the test image. In other words, if the similarity of the feature coefficients is higher than the set threshold, it means that the feature base in the test image is the tongue, so that images with tongue features can be selected from the test image.

The tongue image segmentation module 503 is used to project the test image to the non-negative subspace. The process of projection is equivalent to transforming the test image to the non-negative subspace. It is still an image and is an image composed of learned features. The non-characteristic area and the characteristic area have different labels. For example, the non-characteristic area is 0 and the characteristic area is non-zero. Based on 0 and non-zero, the image area representing the tongue feature can be segmented from each test image with a frame.

In one of the embodiments, a classification module 504 is further included. The classification module 504 is used to classify the features extracted from the test image using an SVM classifier, and send the extracted features to k svm classifiers for identification, and the value of k The value is equal to the number of categories. Specifically, it can be classified into "tongue" and "non-tongue", for example. It can also be classified according to the characteristics of the pathological condition of the tongue, so as to obtain a framed tongue image. Specifically, it can be classified according to the characteristics of the different tongue images corresponding to the person's body condition. Among them, the person's body condition can include damp heat, yin deficiency, normal, heat, qi and blood barrier, blood stasis, and The class with the highest score among the k SVM classifiers is used as the classification result.

In one of the embodiments, a frame adjustment module 505 is further included. The frame adjustment module 505 is used to adjust the frame position of the tongue image through a linear regression model. For each category, for example, damp heat, yin deficiency, normal, heat and blood, Train a linear regression model for the unreasonable and blood stasis respectively. The input is the characteristics of the image in the frame, and the output is the translation (left-right translation and up-down translation) value and zoom value of the border. The linear regression model is used to calculate the translation and zoom values of the border, and the loss function is used to constrain the position error of the border, so as to continuously adjust the border to move to a suitable position.

In one of the embodiments, a binarization module 506 is further included. The binarization module 506 is used to binarize both the training image and the test image (referring to setting the gray value of the pixel on the image to 0 Or 255, that is, the process of presenting the entire image with a clear black and white effect), because color images (such as RGB images) are changed through the three color channels of red (R), green (G), and blue (B) and They are superimposed on each other to get a variety of colors, the tongue area obtained by them is more hollow (missing) areas, black and white images (only single channel), single channel is more conducive to the model than three channels Optimized to obtain the tongue area more accurately.

In one of the embodiments, it also includes a face recognition module 507. The face recognition module 507 is used to first use EHMM (Embedded Hidden Markov Algorithm) to classify the test image. Specifically, the image is classified as ”And “Nobody’s face” are classified into two categories to optimize the recognition accuracy. EHMM model for face recognition includes the following steps:

Select multiple feature points of the face to form a feature sequence.

Input the test image into the EHMM model, and the EHMM model scans the test image from top to bottom and from left to right through a moving window, because the EHMM model contains a set of superstates. The number of superstates in the superstate set is related to the number of human faces. The number of vertical slice pictures is the same, each super state encapsulates a set of embedded state sets, and the number of embedded states in the embedded state set is the same as the number of face slice pictures in the horizontal direction. . The EHMM model scans the image from left to right and top to bottom through a fixed-size window (face features can be super-state structure from top to bottom, and embedded state from left to right). For example, it first scans from left to right, and each window scans to obtain a set of feature vectors, which is a feature extraction of the face region at this time. After the scanning window calculates the feature vector, it moves to the right at a fixed distance to continue the feature extraction. When it moves to the right side of the image, it changes to the next line to continue scanning from left to right. Until the window moves to the lower right of the image, the entire scanning process is ended and multiple sets of feature vectors are obtained, and multiple sets of feature vectors form an observation sequence.

The basic elements of the EHMM model include:

(3)

Represents the parameter set of the k-th super state, 1≤k≤N ₀ ;

among them,

Is the initial probability distribution of the embedded state;

Is the embedded state transition probability matrix;

B ^k represents the probability matrix of observations,

The embedded state j representing the super state k produces observations

The probability,

among them,

Represents the number of mixed Gaussians;

So

Is the mean vector,

Is the Gaussian density of the covariance matrix.

2) Image segmentation: The test image is uniformly divided, and the observation value sequence obtained from the image is evenly divided into N ₀ longitudinal slices corresponding to the longitudinal superstate. Each longitudinal slice can be divided into multiple embedded states from left to right.

6) Repeat steps 4 and 5 until the continuous iteration process change is less than the set threshold.

In one of the embodiments, a reclassification module 508 is further included. The reclassification module 508 is used to classify the face and non-face of the test image using the embedded hidden Markov algorithm, and then further classify the human face Images are classified, including identifying the gender and age of the image.

The following is a specific explanation. First, a CNN (Convolutional Neural Network) model is obtained, which can be trained to recognize the gender and age of a face. Suppose 6 categories are set according to age group and gender, such as 0-20-male, 20-40-male, 40-70-male, 0-20-female, 20-40-female, 40-70-female. Use CNN (Convolutional Neural Network) to identify and classify training images into the above six categories;

Then, the images in these six categories are labeled, and the corresponding label for each image is age group-gender-tongue;

Then, according to each category, the corresponding LNMF model is used to extract the tongue image area.

In addition, an embodiment of the present application also provides a tongue image extraction device, which includes a matrix decomposition module 501, a tongue feature extraction module 502, and a tongue image segmentation module 503.

Among them, the matrix decomposition module 501 is used to convert the training image containing the tongue into a matrix V, where all the non-negative gray values of an image correspond to a column in V, and the LNMF algorithm is used for training to decompose the matrix V into The product of the non-negative feature matrix W and the weight matrix H, that is, V=WH; the dimension of the non-negative feature matrix W is n*r, and the column r is the feature base image. The feature base image refers to the non-negative feature of the tongue. Negative feature matrix W, the non-negative feature matrix W constitutes a non-negative subspace; the dimension of the weight matrix H is r*m, and each column is a code.

Among them, the tongue feature extraction module 502 uses the EHMM model to identify whether the test image contains a face image, and if it does, the training image and the test image are respectively projected to the non-negative subspace to obtain feature coefficients respectively, using the nearest neighbor criterion To obtain the similarity of the feature coefficients corresponding to the training image and the test image, and extract the features in the test image whose similarity is higher than the similarity threshold as tongue features;

Among them, the tongue image segmentation module 503 uses different tags to distinguish the feature area containing the tongue feature from the non-feature area without the tongue feature, and determines the minimum border containing the feature area by reading the tag, thereby identifying the feature area representing the tongue feature Segmented from the test image.

In one of the embodiments, a classification module 504 is further included. The classification module 504 is used to classify the features extracted from the test image using an SVM classifier, and send the extracted features to k svm classifiers for identification, and the value of k The value is equal to the number of categories.

In one of the embodiments, a binarization module 506 is further included. The binarization module 506 is used to perform binarization on both the training image and the test image.

In one of the embodiments, it also includes a face recognition module 507. The face recognition module 507 is used to first use EHMM (Embedded Hidden Markov Algorithm) to classify the test image. Specifically, the image is classified as ”And “Nobody’s face” are classified into two categories to optimize the recognition accuracy.

In addition, the embodiment of the present application also proposes a computer non-volatile readable storage medium. The computer non-volatile readable storage medium may be a hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a read-only memory ( ROM), erasable programmable read-only memory (EPROM), portable compact disk read-only memory (CD-ROM), USB memory, etc., or any combination of several. The computer non-volatile readable storage medium includes a tongue image extraction program, etc., when the tongue image extraction program 50 is executed by the processor 22, the following operations are implemented:

The dimension of the feature matrix W is n*r, and column r is the base image;

S120. Use the EHMM model to identify whether the test image contains a face image, and if it does, perform feature extraction on the test image. Specifically, the non-negative feature matrix W representing the characteristics of the tongue constitutes a non-negative subspace, and the training image and the test image are respectively projected to the non-negative subspace obtained by the training image set, and the feature coefficients are obtained respectively, and the nearest neighbor criterion is used to obtain The similarity of the feature coefficients corresponding to the training image and the test image, so as to extract the features in the test image. In other words, if the similarity of the feature coefficients is higher than the set threshold, it means that the feature base in the test image is the tongue, so that images with tongue features can be selected from the test image.

S130. The test image is projected to the non-negative subspace, and the process of projection is equivalent to transforming the test image to the non-negative subspace, which is still an image, and is an image composed of learned features. Different labels are used to mark the characteristic regions containing tongue features and non-feature regions without tongue features, and the smallest borders containing the characteristic regions are determined by reading the labels, and the characteristic regions containing tongue features are segmented from the test image. For example, the non-characteristic area is 0 and the characteristic area is non-zero. According to 0 and non-zero, the image area representing the tongue feature can be segmented from each test image with a frame.

The specific implementation of the computer non-volatile readable storage medium of the present application is substantially the same as the specific implementation of the tongue image extraction method and the electronic device 2 described above, and will not be repeated here.

The foregoing descriptions are only preferred embodiments of the application, and are not intended to limit the application. For those skilled in the art, the application may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the protection scope of this application.

Claims

A tongue image extraction method applied to an electronic device, characterized in that the method includes the following steps:

S110: Convert the training image containing the tongue into a matrix V, where all the non-negative gray values of an image correspond to a column in V, and use the LNMF algorithm for training to decompose the matrix V into a non-negative feature matrix W and weights. The product of the value matrix H, that is, V=WH;

The dimension of the non-negative feature matrix W is n*r, and column r is a feature base image. The feature base image refers to a non-negative feature matrix W representing tongue features, and the non-negative feature matrix W forms a non-negative subspace;

The dimension of the weight matrix H is r*m, and each column is a code;

S120, using the EHMM model to identify whether the test image contains a face image, if it does, project the training image and the test image to the non-negative subspace respectively to obtain feature coefficients, and use the nearest neighbor criterion to obtain the training image and the test image The similarity of the corresponding feature coefficients, and extract the tongue-representing features in the test image whose similarity is higher than the similarity threshold as the tongue feature;

S130: After projection, the characteristic area containing the tongue feature and the non-characteristic area without the tongue feature are respectively identified with different labels. The label set corresponds to the boundary information of the characteristic area, and the extremes in the up, down, left, and right directions in the boundary information are extracted. Value to determine the bounding box containing the characteristic area.
The tongue image extraction method according to claim 1, wherein:

It also includes step S150, calculating the translation value and scaling value of the frame through the linear regression model, and using the loss function to constrain the position error of the frame, and adjust the frame to move to a suitable position.
The tongue image extraction method according to claim 1, wherein:

Before step S110, the training image and the test image are binarized first, and the gray value of the pixel on the image is only set to 0 or 255 according to the set gray value threshold.
The tongue image extraction method according to claim 1, wherein:

Using the EHMM model to recognize images containing human faces in the test image includes the following steps:

Select multiple feature points of the face to form a feature sequence;

Input the test image into the EHMM model. The EHMM model scans the test image from top to bottom and from left to right through a moving window to obtain multiple sets of feature vectors, and multiple sets of feature vectors form an observation sequence;

The forward algorithm is used to find the probability that the observation sequence is similar to the feature sequence composed of multiple feature points of the face. If the similar probability is greater than the determination threshold, the detected image is considered to contain a face. The EHMM model contains a set of Super state set. The number of super states in the super state set is the same as the number of vertical slice pictures of the face. Each super state package corresponds to a set of embedded state sets. The number of embedded states in the embedded state set is equal to The number of horizontal slice images of the face is the same.
The tongue image extraction method according to claim 1, wherein, in S110, the training image is compressed first, and the training image is de-averaged and normalized, and then the LNMF algorithm is used for training to obtain the feature base Image, where

Subtract the mean value of each column element of matrix V;

According to the ratio of the difference between each column element of the matrix V and the minimum value in the column element, and the difference between the maximum value and the minimum value in the column element.
The tongue image extraction method according to claim 4, wherein:

After identifying the image containing the face, the test image is classified into the corresponding age-sex classification, and the LNMF model for each age-sex classification is used to extract the tongue images in each age-sex classification, including the following step:

Obtain a CNN model, which is trained and used to determine gender and age group, and classify training images into each age group-gender classification;

Annotate the training images in each age-gender category, and each training image gets the label of age, gender, and tongue;

Train the LNMF model separately according to the age group, gender, and tongue label to obtain the corresponding LNMF model after training;

Use a trained CNN model to identify the gender and age group of the test images, and classify the test images according to gender and age group;

The tongue image is extracted according to the LNMF model after age group, gender and tongue correspondence training.
A tongue image extraction device is characterized in that it comprises:

The matrix decomposition module is used to convert the training image containing the tongue into a matrix V, where all the non-negative gray values of an image correspond to a column in V. The LNMF algorithm is used for training to decompose the matrix V into non-negative features The product of the matrix W and the weight matrix H, that is, V=WH; the dimension of the non-negative feature matrix W is n*r, column r is the feature base image, and the feature base image refers to the non-negative feature matrix representing the tongue feature W, the non-negative feature matrix W constitutes a non-negative subspace; the dimension of the weight matrix H is r*m, and each column is a code;

The tongue feature extraction module uses the EHMM model to identify whether the test image contains a face image. If it does, the training image and the test image are respectively projected to the non-negative subspace to obtain feature coefficients, and the nearest neighbor criterion is used to obtain the training image The similarity of the feature coefficients corresponding to the test image, and extract the tongue-representative feature in the test image whose similarity is higher than the similarity threshold as the tongue feature;

The tongue image segmentation module uses different labels to mark the characteristic areas that contain tongue features and the non-feature areas that do not contain tongue features. The label set corresponds to the boundary information of the characteristic area, and extracts the extreme values in the up, down, left, and right directions in the boundary information to determine the inclusion The border of the characteristic area.
The tongue image extraction device according to claim 7, wherein:

It also includes a border adjustment module, which is used to calculate the translation value and zoom value of the border through the linear regression model, and use the loss function to constrain the position error of the border and adjust the border to move to a suitable position.
The tongue image extraction device according to claim 7, wherein:

It also includes a face recognition module for using the EHMM model to recognize images containing human faces in the test image, including the following steps:

Select multiple feature points of the face to form a feature sequence;

Input the test image into the EHMM model. The EHMM model scans the test image from top to bottom and from left to right through a moving window to obtain multiple sets of feature vectors, and multiple sets of feature vectors form an observation sequence;

The forward algorithm is used to find the probability that the observation sequence is similar to the feature sequence composed of multiple feature points of the face. If the similar probability is greater than the determination threshold, the detected image is considered to contain a face. The EHMM model contains a set of Super state set. The number of super states in the super state set is the same as the number of vertical slice pictures of the face. Each super state package corresponds to a set of embedded state sets. The number of embedded states in the embedded state set is equal to The number of horizontal slice images of the face is the same.
The tongue image extraction device according to claim 7, wherein:

It also includes a binarization module, which is used to binarize both the training image and the test image first, and set the gray value of the pixel on the image to only 0 or 255 according to the set gray value threshold.
An electronic device, characterized in that the electronic device comprises: a memory and a processor, the memory stores a tongue image extraction program, and when the tongue image extraction program is executed by the processor, the following steps are implemented:

S110: Convert the training image containing the tongue into a matrix V, where all the non-negative gray values of an image correspond to a column in V, and use the LNMF algorithm for training to decompose the matrix V into a non-negative feature matrix W and weights. The product of the value matrix H, that is, V=WH;

The dimension of the non-negative feature matrix W is n*r, and column r is a feature base image. The feature base image refers to a non-negative feature matrix W representing tongue features, and the non-negative feature matrix W forms a non-negative subspace;

The dimension of the weight matrix H is r*m, and each column is a code;

S120, using the EHMM model to identify whether the test image contains a face image, if it does, project the training image and the test image to the non-negative subspace respectively to obtain feature coefficients, and use the nearest neighbor criterion to obtain the training image and the test image The similarity of the corresponding feature coefficients, and extract the tongue-representing features in the test image whose similarity is higher than the similarity threshold as the tongue feature;

S130: After projection, the characteristic area containing the tongue feature and the non-characteristic area not containing the tongue feature are respectively identified with different labels, where the label set corresponds to the boundary information of the characteristic area, and the extreme values in the up, down, left, and right directions in the boundary information are extracted To determine the border that contains the characteristic area.
The electronic device according to claim 11, wherein the tongue image extraction program when executed by the processor further comprises:

Before step S110, the training image and the test image are binarized first, and the gray value of the pixel on the image is only set to 0 or 255 according to the set gray value threshold.
The electronic device according to claim 11, wherein the tongue image extraction program when executed by the processor further comprises:

It also includes step S150, calculating the translation value and scaling value of the frame through the linear regression model, and using the loss function to constrain the position error of the frame, and adjust the frame to move to a suitable position.
11. The electronic device according to claim 12, wherein the tongue image extraction program when executed by the processor further comprises:

Using the EHMM model to identify images containing human faces in the test image includes the following steps:

Select multiple feature points of the face to form a feature sequence;

Input the test image into the EHMM model. The EHMM model scans the test image from top to bottom and from left to right through a moving window to obtain multiple sets of feature vectors, and multiple sets of feature vectors form an observation sequence;

The forward algorithm is used to find the probability that the observation sequence is similar to the feature sequence composed of multiple feature points of the face. If the similar probability is greater than the determination threshold, the detected image is considered to contain a face. The EHMM model contains a set of Super state set. The number of super states in the super state set is the same as the number of vertical slice pictures of the face. Each super state package corresponds to a set of embedded state sets. The number of embedded states in the embedded state set is equal to The number of horizontal slice images of the face is the same.
11. The electronic device according to claim 12, wherein the tongue image extraction program when executed by the processor further comprises:

In S110, the training image is first compressed, and the training image is de-averaged and normalized, and then the LNMF algorithm is used for training to obtain the feature base image, where,

Subtract the mean value of each column element of matrix V;

According to the ratio of the difference between each column element of the matrix V and the minimum value in the column element, and the difference between the maximum value and the minimum value in the column element.
11. The electronic device according to claim 12, wherein the tongue image extraction program when executed by the processor further comprises:

After identifying the image containing the face, the test image is classified into the corresponding age-sex classification, and the LNMF model for each age-sex classification is used to extract the tongue images in each age-sex classification, including the following step:

Obtain a CNN model, which is trained and used to determine gender and age group, and classify training images into each age group-gender classification;

Annotate the training images in each age-gender category, and each training image gets the label of age, gender, and tongue;

Train the LNMF model separately according to the age group, gender, and tongue label to obtain the corresponding LNMF model after training;

Use a trained CNN model to identify the gender and age group of the test images, and classify the test images according to gender and age group;

The tongue image is extracted according to the LNMF model after age group, gender and tongue correspondence training.
A computer nonvolatile readable storage medium, wherein the computer nonvolatile readable storage medium stores a computer program, the computer program includes program instructions, and when the program instructions are executed by a processor, The tongue image extraction method according to claim 1 is realized.
The computer non-volatile readable storage medium according to claim 17, wherein when the program instructions are executed by the processor, they further implement:

Using the EHMM model to recognize images containing human faces in the test image includes the following steps:

Select multiple feature points of the face to form a feature sequence;

Input the test image into the EHMM model. The EHMM model scans the test image from top to bottom and from left to right through a moving window to obtain multiple sets of feature vectors, and multiple sets of feature vectors form an observation sequence;

The forward algorithm is used to find the probability that the observation sequence is similar to the feature sequence composed of multiple feature points of the face. If the similar probability is greater than the determination threshold, the detected image is considered to contain a face. The EHMM model contains a set of Super state set. The number of super states in the super state set is the same as the number of vertical slice pictures of the face. Each super state package corresponds to a set of embedded state sets. The number of embedded states in the embedded state set is equal to The number of horizontal slice images of the face is the same.
The computer non-volatile readable storage medium according to claim 17, wherein when the program instructions are executed by the processor, they further implement:

In S110, the training image is first compressed, and the training image is de-averaged and normalized, and then the LNMF algorithm is used for training to obtain the feature base image, where,

Subtract the mean value of each column element of matrix V;

According to the ratio of the difference between each column element of the matrix V and the minimum value in the column element, and the difference between the maximum value and the minimum value in the column element.
The computer non-volatile readable storage medium according to claim 17, wherein when the program instructions are executed by the processor, they further implement:

After identifying the image containing the face, the test image is classified into the corresponding age-sex classification, and the LNMF model for each age-sex classification is used to extract the tongue images in each age-sex classification, including the following step:

Obtain a CNN model, which is trained and used to determine gender and age group, and classify training images into each age group-gender classification;

Annotate the training images in each age-gender category, and each training image gets the label of age, gender, and tongue;

Train the LNMF model separately according to the age group, gender, and tongue label to obtain the corresponding LNMF model after training;

Use a trained CNN model to identify the gender and age group of the test images, and classify the test images according to gender and age group;

The tongue image is extracted according to the LNMF model after age group, gender and tongue correspondence training.