WO2024098763A1 - 文本操作图互检方法及模型训练方法、装置、设备、介质 - Google Patents

文本操作图互检方法及模型训练方法、装置、设备、介质 Download PDF

Info

Publication number
WO2024098763A1
WO2024098763A1 PCT/CN2023/101222 CN2023101222W WO2024098763A1 WO 2024098763 A1 WO2024098763 A1 WO 2024098763A1 CN 2023101222 W CN2023101222 W CN 2023101222W WO 2024098763 A1 WO2024098763 A1 WO 2024098763A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
feature
recipe
current
component
Prior art date
Application number
PCT/CN2023/101222
Other languages
English (en)
French (fr)
Inventor
李仁刚
王立
范宝余
郭振华
Original Assignee
苏州元脑智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州元脑智能科技有限公司 filed Critical 苏州元脑智能科技有限公司
Publication of WO2024098763A1 publication Critical patent/WO2024098763A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/19007Matching; Proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19127Extracting features by transforming the feature space, e.g. multidimensional scaling; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19147Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/1918Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Definitions

  • the present application relates to the field of information retrieval technology, and in particular to a method and device for mutual checking of text operation diagrams, a method and device for training a mutual checking model of text operation diagrams, an electronic device, and a non-volatile readable storage medium.
  • multimedia data has shown explosive growth, such as news reports, Weibo and Taobao comment data, WeChat chat records and other multi-modal data, emoticons, article illustrations, mobile phone photos, medical images and other image data, video media data such as Douyin and Kuaishou, city camera data and other video data, accompanied by audio information, such as WeChat voice, video dubbing and other information.
  • Resnet Residual Network
  • Bert Bidirectional Encoder Representations from Transformers
  • the Resnet-Bert network model uses the Resnet model for classified retrieval of image data, video data and audio data, and uses the Bert model for classified retrieval of text data.
  • the present application provides a text operation diagram mutual checking method and device, a method and device for training a text operation diagram mutual checking model, an electronic device, and a non-volatile readable storage medium, which can realize high-precision mutual retrieval between recipe text and recipe step diagram.
  • a first aspect of an embodiment of the present application provides a method for training a text operation diagram mutual inspection model, comprising:
  • Pre-constructing a text operation diagram mutual inspection model including a text information feature encoder and a step diagram feature encoder, and generating recipe ingredient information by analyzing recipe ingredients contained in each recipe sample in a target recipe text sample set;
  • the text information feature encoder For each group of training samples in the training sample set, the text information feature encoder is used to extract the principal component features of the current text sample.
  • the recipe mean feature is determined by extracting all text features of the current text sample from the text information feature encoder; and the virtual ingredient label of the principal component feature is actively learned based on the recipe ingredient information.
  • the current recipe text features and the current recipe image features are input into the text operation graph mutual inspection model for model training.
  • determining whether the current recipe text feature of the current text sample is a main component feature or a recipe mean feature includes:
  • Each element in the virtual ingredient label is used to indicate the confidence level of the principal component corresponding to the recipe ingredient information contained in the current text sample;
  • the principal component probability prediction confidence it is determined whether the current recipe text feature of the current text sample is the principal component feature or the recipe mean feature.
  • determining whether the current recipe text feature of the current text sample is a principal component feature or a recipe mean feature includes:
  • If the current output control mode is the binary switching mode, determine whether the confidence of the main component probability prediction is greater than the component prediction confidence threshold;
  • the current recipe text feature of the current text sample is the principal component feature
  • the current recipe text feature of the current text sample is the recipe mean feature.
  • determining whether the current recipe text feature of the current text sample is a principal component feature or a recipe mean feature includes:
  • If the current output control mode is the hybrid switching mode, compare the numerical relationship between the principal component probability prediction confidence level and the component prediction confidence threshold and the preset confidence limit threshold;
  • the current recipe text feature of the current text sample is the principal component feature
  • the current recipe text feature of the current text sample is the feature sum of the recipe mean feature and the principal component feature
  • the current recipe text feature of the current text sample is the recipe mean feature.
  • the current output control mode is the hybrid switching mode
  • the sum of the numerical relationships between the principal component probability prediction confidence level and the component prediction confidence threshold and the confidence limit threshold is compared, including:
  • the current recipe text feature of the previous text sample is the output feature after the recipe mean feature and the principal component feature are cascaded and processed through the fully connected layer.
  • the text information feature encoder includes an input layer, a text feature extraction layer, and an output data processing layer;
  • the input layer includes a text data input unit and an ingredient identification flag input unit;
  • the text data input unit includes a dish name input unit, a recipe step input unit and an ingredient input unit, which are used to sequentially input different types of data of each text sample of the training sample set;
  • the ingredient identification flag input unit is used to input a flag bit used to identify the task of performing active learning ingredient information;
  • the text feature extraction layer is a bidirectional encoder based on a transformer, which is used to extract features from the output information of the input layer;
  • the output data processing layer is used to actively learn the virtual component labels corresponding to the main component features extracted by the text feature extraction layer based on the flag bits, and determine the current recipe text features of the current text sample based on the virtual component labels and the component prediction confidence threshold.
  • the output data processing layer includes a feature selection controller, a principal component output unit, and a recipe mean feature output unit;
  • the recipe mean feature output unit includes a dish name feature output unit, a recipe step feature output unit and an ingredient feature output unit, which is used to output the feature averages of the dish name feature, the recipe step feature and the ingredient feature;
  • a principal component output unit used to output principal component features and obtain virtual component labels by performing active learning tasks
  • a feature selection controller is used to determine the current recipe text features based on the virtual ingredient labels and the ingredient prediction confidence threshold, and switch the principal component output unit and the recipe mean feature output unit to output the current recipe text features.
  • the principal component output unit includes a first fully connected layer, a mapping layer, a second fully connected layer, and a loss calculation layer;
  • the first fully connected layer is used to receive the feature information of the component identification mark input unit corresponding to the output;
  • a mapping layer is used to perform nonlinear mapping processing on feature information
  • the second fully connected layer is used to map the features obtained after the mapping process to the principal components, so as to obtain the principal component features with the same dimension as the recipe ingredient information;
  • the loss calculation layer is used to actively learn virtual ingredient labels of principal component features based on recipe ingredient information.
  • virtual ingredient labels based on recipe ingredient information are actively learned based on principal component features, including:
  • the vector data corresponding to the virtual ingredient label has the same dimension as the vector data corresponding to the main component feature;
  • the loss calculation formula is:
  • loss cla is the loss information
  • M is the principal component feature corresponding to the dimension of the vector data
  • sigmoid() is the sigmoid function
  • label m is the virtual component label corresponding to the element at the mth position of the vector data
  • cla m is the principal component feature corresponding to the element at the mth position of the vector data.
  • recipe ingredient information is generated by analyzing recipe ingredients contained in each recipe sample in the target recipe text sample set, including:
  • a virtual ingredient label is generated based on the comparison result between the current text sample and the recipe ingredient information, including:
  • the position element corresponding to the current sample component is set to the first preset identification value
  • the position element corresponding to the current sample component is set to the second preset identification value
  • the method before extracting the principal component features and the recipe mean features of the current text sample using the text information feature encoder and actively learning the virtual component labels of the principal component features based on the recipe ingredient information, the method further includes:
  • Each word of the token information is mapped to a corresponding high-dimensional token vector for input into the text information feature encoder.
  • the method before extracting the principal component features and the recipe mean feature group of the current text sample using the text information feature encoder, the method further includes:
  • a text vector is generated for inputting into a text information feature encoder.
  • a step graph feature encoder is used to extract a current recipe image feature of a current operation graph sample corresponding to a current text sample, including:
  • the step graph feature encoder includes a feature extraction network and a feature fusion network;
  • the image features of each step diagram are input into the feature fusion network to obtain the current recipe image features of the current operation diagram sample.
  • the feature fusion network is a long short-term memory neural network
  • the image features of each step diagram are input into the feature fusion network to obtain the current recipe image features of the current operation diagram sample, including:
  • LSTM i is the i-th LSTM unit
  • ⁇ () is the output of the feature extraction network
  • I is the total number of step images contained in the current operation graph sample. number.
  • a second aspect of an embodiment of the present application provides a device for training a text operation diagram mutual inspection model, comprising:
  • a model building module used to build a text operation graph mutual inspection model including a text information feature encoder and a step graph feature encoder;
  • An identification information generation module used for generating recipe ingredient information by analyzing all recipe samples containing recipe ingredients in a training sample set
  • the text data processing module is used to extract the principal component features and the recipe mean features of the current text sample from each group of training samples in the training sample set by using the text information feature encoder, and actively learn the virtual component labels of the principal component features based on the recipe component information; the recipe mean features are determined according to all text features of the current text sample extracted by the text information feature encoder; based on the virtual component labels and the component prediction confidence threshold, determine whether the current recipe text features of the current text sample are principal component features or recipe mean features;
  • An image feature extraction module for extracting a current recipe image feature of a current operation diagram sample corresponding to a current text sample using a step diagram feature encoder
  • the training module is used to input the current recipe text features and the current recipe image features into the text operation graph mutual inspection model for model training.
  • a third aspect of the embodiments of the present application provides a text operation diagram mutual inspection method, including:
  • a text operation graph mutual checking model is trained
  • the text features to be matched and the image features to be matched are input into the text operation graph mutual inspection model to obtain the text operation graph mutual inspection results.
  • a fourth aspect of the embodiments of the present application provides a text operation diagram mutual inspection device, including:
  • a model training module used to train a text operation graph mutual checking model in advance using any of the above methods for training a text operation graph mutual checking model
  • a feature acquisition module is used to acquire text features to be matched of the text to be retrieved; and acquire image features to be matched of the operation image to be retrieved;
  • the mutual inspection result generation module is used to input the to-be-matched text features and the to-be-matched image features into the text operation graph mutual inspection model to obtain the text operation graph mutual inspection results.
  • An embodiment of the present application also provides an electronic device including a processor and a memory, wherein the processor is used to implement any of the above methods for training a text operation diagram mutual checking model and/or the steps of the above text operation diagram mutual checking method when executing a computer program stored in the memory.
  • an embodiment of the present application further provides a non-volatile readable storage medium, on which a computer program is stored.
  • a computer program is executed by a processor, the steps of training a text operation diagram mutual checking model as described above and/or the steps of the text operation diagram mutual checking method as described above are implemented.
  • the advantage of the technical solution provided by the present application is that a function that can actively learn the recipe ingredients contained in the recipe text data based on the recipe ingredient information is set in the text operation graph mutual inspection model.
  • the text feature extraction accuracy of the text operation graph mutual inspection model can be well verified, and the recipe text features used for image-text matching can be adjusted in time, so that the high-level semantic information of the recipe text can be well extracted and high-reliability classification can be achieved. Redundant noise is removed, thereby effectively improving the accuracy of mutual retrieval between recipe text and recipe operation diagram.
  • the embodiments of the present application also provide a corresponding text operation diagram mutual checking method, an implementation device, an electronic device and a non-volatile readable storage medium for the method of training a text operation diagram mutual checking model, further making the method more practical, and the text operation diagram mutual checking method, an implementation device, an electronic device and a non-volatile readable storage medium have corresponding advantages.
  • FIG1 is a flow chart of a method for training a text operation diagram mutual inspection model provided in an embodiment of the present application
  • FIG2 is a schematic diagram of the structural framework of a text information feature encoder provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of a flow chart of a text operation diagram mutual inspection method provided in an embodiment of the present application.
  • FIG4 is a schematic diagram of a framework of an exemplary application scenario provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of a framework of a text operation diagram mutual inspection model in an exemplary application scenario provided by an embodiment of the present application;
  • FIG6 is a structural diagram of a specific implementation of a device for training a text operation diagram mutual inspection model provided in an embodiment of the present application
  • FIG7 is a structural diagram of a specific implementation of a text operation diagram mutual inspection device provided in an embodiment of the present application.
  • FIG8 is a structural diagram of a specific implementation of an electronic device provided in an embodiment of the present application.
  • FIG. 1 is a flow chart of a method for training a text operation diagram mutual inspection model provided by an embodiment of the present application.
  • the embodiment of the present application may include the following contents:
  • S101 Pre-construct a text operation graph mutual inspection model including a text information feature encoder and a step graph feature encoder.
  • the text operation diagram mutual inspection model in this step is used to perform the mutual retrieval task between the recipe text and the recipe operation diagram, that is, the text data to be retrieved or the operation diagram data to be retrieved is input into the trained text operation diagram mutual inspection model, and the text operation diagram mutual inspection model reads the corresponding data from the specified database to be retrieved for matching, and outputs the data that matches the text to be retrieved. Or the target recipe operation diagram or target recipe text that matches the operation diagram to be retrieved. For example, if the task to be retrieved is to retrieve the operation diagram corresponding to the text to be retrieved from the image database, the text to be retrieved is input into the text operation diagram mutual inspection model.
  • the text operation diagram mutual inspection model matches the recipe text features of the text to be retrieved with the recipe image features of each operation diagram in the image database, determines the recipe operation diagram with the highest similarity as the target recipe operation diagram and outputs it.
  • the text information feature encoder is used to encode the input recipe text data and output the final recipe text features; the step diagram feature encoder is used to encode the input recipe operation diagram data and output the final recipe operation diagram features.
  • S102 Generate recipe ingredient information by analyzing recipe ingredients included in each recipe sample in the target recipe text sample set in advance.
  • the target recipe text sample set may be composed of all or part of the recipe text samples of the training sample set for training the text operation diagram mutual inspection model, or may be composed of recipe texts selected from other data sets, which does not affect the implementation of the present application.
  • the training sample set referred to in this embodiment is sample data for training the text operation diagram mutual inspection model, and the training sample set includes multiple groups of training samples, each group of training samples includes corresponding text samples and operation diagram samples, that is, the text samples and the operation diagram samples are a set of sample data that matches each other.
  • the text samples of this embodiment and the subsequent texts to be retrieved are all recipe texts, and the recipe texts include three types of data: dish names, cooking steps and ingredients.
  • the operation diagram samples and the subsequent operation diagrams to be retrieved are all recipe operation diagrams.
  • the operation diagram or operation diagram sample of the present application includes a group of sub-images with a sequential operation sequence, and each sub-image of the group of images corresponds to an operation step in the text data or text sample, that is, a cooking step.
  • the recipe ingredient information refers to recipe ingredient statistical information generated by reading the recipe ingredients contained in each recipe sample, that is, used to identify ingredient data contained in a text sample or a sample to be retrieved.
  • the text information feature encoder is used to extract the principal component features and the recipe mean features of the current text sample, and the virtual component labels of the principal component features are actively learned based on the recipe component information; the recipe mean features are determined based on all text features of the current text sample extracted by the text information feature encoder. Based on the virtual component labels and the component prediction confidence threshold, it is determined whether the current recipe text feature of the current text sample is the principal component feature or the recipe mean feature.
  • each set of training samples includes a text sample and an operation diagram sample corresponding to each other, for the text sample, the text sample is input into the text information feature encoder, which includes a text input function, a feature extraction function and a text output function with an active learning function.
  • the text information feature encoder first extracts the text features of the input text sample based on the feature extraction function.
  • the text sample of this embodiment includes three types of text data: dish name, cooking steps and recipe ingredients. Each type of text data will extract corresponding text features, and this embodiment also has an input bit for indicating an ingredient identification flag for an active learning function.
  • the ingredient The identification mark and the text sample or the sample to be retrieved are used as the model input, and each input corresponds to an output, that is, the output corresponding to the input position for inputting the component identification mark is the principal component feature, the output corresponding to the input position for inputting the dish name is the dish name feature, the output corresponding to the input position for inputting the cooking steps is the cooking step feature, and the output corresponding to the input position for inputting the recipe ingredients is the recipe ingredient feature.
  • the recipe mean feature of this embodiment is the feature generated after the recipe ingredient feature, the cooking step feature and the dish name feature are combined, that is, the recipe mean feature is determined according to all the text features of the current text sample extracted by the text information feature encoder.
  • the feature extraction function of the text information feature encoder can be based on any existing text feature extraction model, such as the vector space model, the word frequency method, the document frequency method, etc., which does not affect the implementation of this application.
  • the virtual component label is used to label the principal component feature obtained by learning the principal component feature through the active learning function.
  • the text feature of the current text sample finally output by the text information feature encoder is called the current recipe text feature.
  • This feature is the principal component feature or the recipe mean feature. As for whether it is the principal component feature, it is not determined whether it is the principal component feature. Whether it is a feature or a recipe mean feature can be determined based on the learned virtual component label and the preset component prediction confidence threshold.
  • the component prediction confidence threshold is used to identify the lowest limit that the extracted principal component feature can be used. If the virtual component label and the component prediction confidence threshold can identify that the currently extracted principal component feature is a high-precision feature, then the principal component feature is directly used as the feature for matching the image feature of the operation diagram text. If the virtual component label and the component prediction confidence threshold can identify that the currently extracted principal component feature is a low-precision feature, then the principal component feature is not directly used as the feature for matching the image feature of the operation diagram text, but the principal component feature and the recipe mean feature can be used to jointly determine the final output feature.
  • S104 Using a step diagram feature encoder, extract the current recipe image features of the current operation diagram sample corresponding to the current text sample.
  • this step extracts corresponding image features of the operation diagram samples corresponding to the text samples. Since the operation diagram samples contain a set of step diagrams, the image features of the operation diagram samples are a collection of image features of this set of step diagrams. For ease of description, the operation diagram samples corresponding to the current text samples are referred to as current operation diagram samples, and the image features of the current operation diagram samples are referred to as current recipe image features.
  • the present application may use any network structure that can extract image features to build a step diagram feature encoder, such as an artificial neural network, VGG (Visual Geometry Group Network), etc., and the present application does not impose any limitation on this.
  • S105 Input the current recipe text features and the current recipe image features into the text operation graph mutual inspection model to perform model training.
  • the text feature information of the text samples of the group of training samples and the image features of the corresponding operation diagram samples are input into the text operation diagram mutual inspection model built in step S101.
  • a loss function is used to guide the training of the model, and then the network parameters of the text operation diagram mutual inspection model are updated by methods such as gradient backpropagation until the conditions are met, such as reaching the number of iterations or the convergence effect is good.
  • the training process of the text operation diagram mutual inspection model may include a forward propagation stage and a backpropagation stage.
  • the forward propagation stage is the stage in which the data is propagated from a low level to a high level
  • the backpropagation stage is the stage in which the error is propagated from a high level to a low level when the result obtained by the forward propagation does not match the expectation.
  • all network layer weights of the text operation diagram mutual inspection model are randomly initialized; then, the text features and image features carrying the data type information are input and the output value is obtained through the forward propagation of each layer of the model; the output value of the text operation diagram mutual inspection model is calculated, and the loss value of the output value is calculated based on the loss function.
  • the error is backpropagated back to the text operation diagram mutual inspection model, and the backpropagation error of each layer of the text operation diagram mutual inspection model is calculated in turn.
  • Each layer of the text operation graph mutual inspection model adjusts all weight coefficients of the text operation graph mutual inspection model based on the corresponding back propagation error to achieve weight update.
  • a new batch i.e., the image features of the next set of training samples and the text features carrying data type information are randomly selected again, and then the above process is repeated until the error between the calculated model output value and the target value is less than the preset threshold, the training is terminated, and the current parameters of each layer of the model are used as the network parameters of the trained text operation graph mutual inspection model.
  • a function is set in the text operation diagram mutual check model that can actively learn the recipe ingredients contained in the recipe text data based on the recipe ingredient information.
  • the text feature extraction accuracy of the text operation diagram mutual check model can be well verified, and the recipe text features used for image-text matching can be adjusted in time, so that the high-level semantic information of the recipe text can be well extracted, high-reliability classification can be achieved, redundant noise can be removed, and the accuracy of mutual retrieval between the recipe text and the recipe operation diagram can be effectively improved.
  • the present application also provides an optional implementation method, which may include the following contents:
  • Each element in the virtual component label of the present embodiment is used to represent the confidence of the principal component corresponding to the recipe ingredient information contained in the current text sample; a target component greater than or equal to the component confidence threshold is determined from the virtual component label, and the principal component probability prediction confidence is determined according to the confidence corresponding to each target component; based on the numerical relationship between the principal component probability prediction confidence and the component prediction confidence threshold, it is determined whether the current recipe text feature of the current text sample is a principal component feature or a recipe mean feature.
  • active learning such as self-supervised learning can obtain the classification probability corresponding to the principal component feature, and the classification probability represents the principal component probability prediction value of the input sample output by the active learning network such as the principal component self-supervised classification network, for example: [0.001, 0.02, ..., 0.91, ..., 0.006]. Based on this, when determining the final output feature type, this embodiment can switch the input feature according to the principal component probability prediction confidence value of the input sample.
  • the switching method is as follows: Calculate the principal component probability prediction confidence, and the calculation method is as follows: Obtain the active learning classification probability in the virtual component label such as [0.001, 0.02, ..., 0.91, ..., 0.006], each number represents the confidence that the sample contains the corresponding principal component in the principal component information table, and the component confidence threshold can be, for example, 0.5. According to the threshold, all values greater than the threshold 0.5 in the classification probability are extracted to construct a credible principal component information table; calculate the mean of all probability values in the credible principal component information table, and record it as the principal component probability prediction confidence. Then, the final output feature can be determined according to the principal component probability prediction confidence and the preset component prediction confidence threshold value such as 0.9.
  • this embodiment can be flexibly switched according to different needs, and the output control mode of the output text feature is set in advance.
  • the output control mode of this embodiment includes a mixed switching mode and a binary switching module. The corresponding feature output is selected based on different output control modes.
  • the process may include:
  • Get the current output control mode and determine whether the current output control mode is a binary switching mode or a mixed switching mode.
  • the current output control mode is a binary switching mode
  • the current output control mode is the mixed switching mode, compare the numerical relationship between the principal component probability prediction confidence and the component prediction confidence threshold and the preset confidence limit threshold; the confidence limit threshold can be flexibly determined according to actual needs, and the present application does not impose any restrictions on the values of the component prediction confidence threshold and the preset confidence limit threshold.
  • the current recipe text feature of the current text sample is the principal component feature; if the principal component probability prediction confidence is less than or equal to the component prediction confidence threshold and greater than or equal to the confidence limit threshold, the current recipe text feature of the current text sample is the feature sum of the recipe mean feature and the principal component feature; if the principal component probability prediction confidence is less than the confidence limit threshold, the current recipe text feature of the current text sample is the recipe mean feature.
  • the current recipe text feature of the current text sample may also be the output feature after feature cascading of the recipe mean feature and the principal component feature and processing through a fully connected layer.
  • the recipe mean feature is the mean of the output features corresponding to the dish name, ingredients, and step text in the bidirectional encoder. If the principal component probability prediction confidence > component prediction confidence threshold, the principal component probability prediction confidence is high, indicating that the text feature extraction function of the text information feature encoder and the principal component active learning classification network can well extract the high-level semantic information of the recipe text, achieve high-reliability classification, and remove redundant noise. This feature has a good expression effect, so the output component active learning classification feature is also the principal component feature. If the principal component probability prediction confidence ⁇ component prediction confidence threshold, the output mean of the bidirectional encoder for the dish name, ingredients, and step text is output.
  • the text feature extraction function and The principal component active learning classification network cannot confirm the principal component of the recipe, and the principal component features also contain a lot of noise.
  • this embodiment can take the mean of all output features extracted from the features corresponding to the input recipe text as the final feature of the entire recipe text.
  • the principal component probability prediction confidence ⁇ component prediction confidence threshold the feature after adding the recipe mean feature and the principal component feature can also be output as the final current recipe text feature of the entire recipe text; the recipe mean feature and the principal component feature can also be feature cascaded, and then the output feature after passing through a fully connected layer can be used as the final current recipe text feature of the entire recipe text.
  • step S102 there is no limitation on how to perform step S102.
  • an optional method for generating recipe ingredient information is provided, which may include the following steps:
  • the process of generating a virtual component label may include: comparing the existing components contained in the current text sample with the sample components in the principal component table one by one; for each existing component, if the current sample component in the principal component table is the same as the current existing component, then set the position element corresponding to the current sample component to the first preset identification value; if the current sample component in the principal component table is different from the current existing component, then set the position element corresponding to the current sample component to the second preset identification value; generate a virtual component label according to the value of the position element corresponding to each sample component in the principal component table.
  • the text sample includes multiple types of data, that is, the recipe text may include three types of data: ingredients, cooking steps, and dish names; for ease of description, the ingredient data read from the recipe sample is called the original ingredient, and the ingredients selected from these original ingredients through data merging and data deletion operations may be called sample ingredients.
  • the data selection method listed in this embodiment can remove unimportant data from the original ingredients, thereby improving the overall data processing efficiency.
  • the recipe ingredient information can be presented in the form of a table, that is, a principal component table is generated based on each sample ingredient.
  • This embodiment does not impose any limitation on the structure of the text information feature encoder.
  • This embodiment also provides an optional structure of the text information feature encoder, which may include the following contents:
  • the text information feature encoder may include an input layer, a text feature extraction layer, and an output data processing layer; the input layer includes a text data input unit and an ingredient identification flag input unit; the text data input unit includes a dish name input unit, a recipe step input unit, and an ingredient input unit, which are used to sequentially input different types of data of each text sample of the training sample set; the ingredient identification flag input unit is used to input a flag bit used to identify the task of actively learning ingredient information.
  • the text feature extraction layer is a bidirectional encoder based on a converter, which is used to extract features from the output information of the input layer; the output data processing layer is used to actively learn the virtual component labels corresponding to the main component features extracted by the text feature extraction layer based on the flag bit, and determine the current recipe text features of the current text sample based on the virtual component labels and the ingredient prediction confidence threshold.
  • multiple input bits can be set, and different input bits correspond to different input data.
  • the text data input unit includes multiple input bits, and different input bits correspond to data of different data types.
  • the recipe text includes cooking step data, ingredient data and dish name data.
  • the text data input unit may include input bits for cooking step data, ingredient data and dish name data, as shown in the bottom layer of Figure 2.
  • the flag bit used to identify the execution of the active learning ingredient information task can be flexibly selected according to actual needs. For example, CLS can be used as the flag bit.
  • the ingredient identification flag input unit is used to input the flag bit.
  • the ingredient identification flag input unit inputs the corresponding flag bit, but if the current execution task does not require an active learning task, then the ingredient identification flag input unit does not input the corresponding flag bit, or the input indicates that the active learning task is not to be performed.
  • Another way to specify flags for services For the input layer of the model, you can directly input a column vector, where the starting position of the vector is the flag vector element, followed by the text feature vector element.
  • the converter-based bidirectional encoder also adopts the transformer model structure.
  • it may include a Masked Multihead Attention layer, a first Add+Normalization layer, a Feed Forward layer, a second Add+Normalization layer and a bidirectional attention module connected in sequence.
  • the upper and lower attention modules input information to the Masked Multihead Attention layer.
  • the output data processing layer includes a feature selection controller, a principal component output unit and a recipe mean feature output unit;
  • the recipe mean feature output unit includes a dish name feature output unit, a recipe step feature output unit and an ingredient feature output unit, which is used to output the feature averages of the dish name feature, the recipe step feature and the ingredient feature;
  • the principal component output unit is used to output the principal component feature and obtain a virtual component label by performing an active learning task;
  • the feature selection controller is used to determine the current recipe text feature based on the virtual component label and the ingredient prediction confidence threshold, and switch the principal component output unit and the recipe mean feature output unit to output the current recipe text feature.
  • the feature selection controller is used to switch the output control mode.
  • the confidence limit threshold can be set manually at the beginning of training.
  • the switching mode of the feature selection controller that is, the binary switching mode or the mixed switching mode, can also be set manually during training.
  • the output data processing layer processes the features output by the text feature extraction layer, that is, the output data processing layer may first identify whether there is a flag bit, and if there is a flag bit, determine whether the flag bit is used to perform an active learning task, and if the flag bit is used to perform an active learning task, then the main component features output by the main component output unit are actively learned based on the recipe component information. If the flag bit is not used to perform an active learning task, no active learning is required.
  • the main component output unit may include a first fully connected layer, a mapping layer, a second fully connected layer, and a loss calculation layer; the first fully connected layer is used to receive the feature information corresponding to the output of the component identification flag input unit; the mapping layer is used to perform nonlinear mapping processing on the feature information based on a mapping function such as a nonlinear mapping function or a linear mapping function, such as ReLU (Linear rectification function, linear rectification function), Leaky ReLU (with leaky linear rectification function), etc.
  • a mapping function such as a nonlinear mapping function or a linear mapping function, such as ReLU (Linear rectification function, linear rectification function), Leaky ReLU (with leaky linear rectification function), etc.
  • the second fully connected layer is used to map the features obtained after the mapping processing to the main component to obtain the main component features with the same dimension as the recipe component information; the loss calculation layer is used to actively learn the virtual component labels of the main component features based on the recipe component information.
  • the output corresponding to the principal component output unit that is, the component identification mark input unit, passes through the first fully connected layer FC, then passes through the ReLU layer for nonlinear mapping, and finally passes through the second fully connected layer FC to map the features to the principal component data in the current text sample.
  • This embodiment also provides an optional implementation of how to actively learn virtual component labels of main component features based on recipe component information, which may include the following content:
  • the loss calculation formula is:
  • loss cla is the loss information
  • M is the principal component feature corresponding to the dimension of the vector data
  • sigmoid() is the sigmoid function
  • label m is the virtual component label corresponding to the element at the mth position of the vector data
  • cla m is the principal component feature corresponding to the element at the mth position of the vector data.
  • each component may correspond to one or more component features of the recipe identification information or may not exist in the recipe component information.
  • a virtual component label is generated through data comparison or feature comparison.
  • the vector data corresponding to the virtual component label has the same dimension as the vector data corresponding to the principal component feature.
  • the recipe component information can be a principal component table.
  • the principal component feature includes the principal component data of the text sample. If the component of the principal component table exists in the principal component feature of the recipe text, the corresponding position variable of the principal component table can be set to 1, otherwise, it is set to 0.
  • the processed principal component table can be used as a label, that is, a virtual component label, and the vector dimension corresponding to the label is the same as the number of rows in the principal component table.
  • this embodiment provides an optional model structure of the text information feature encoder, which is conducive to extracting more accurate text features.
  • the text information feature encoder before using the text information feature encoder to extract the current recipe text features of the current text sample, it can also include:
  • the text type identifier can be flexibly selected in advance according to actual application needs.
  • the recipe text sample includes three types of text information: cooking steps, ingredient information and dish name.
  • the text type identifier of the dish can be set to 1
  • the text type identifier of the ingredient information can be set to 2
  • the text type identifier of the operation step can be set to 3. All text information is packaged into a long input sequence:
  • the wordToembedding method can be used to map each word of the dish name into a high-dimensional vector.
  • For the position information it can be increased in sequence according to the order of the words.
  • each ingredient information can be separated by a comma, and then all ingredient information can be mapped into a high-dimensional column vector through the wordToembedding method.
  • the text type of the ingredient information is defined as 2 in this application.
  • the position information of the ingredient information increases in sequence according to the input order of the ingredients, as shown in Figure 2.
  • each step can be encoded in sequence, such as the first step is encoded as sequence number 1, and the second step can be encoded as sequence number 2; then each word of all the operation steps is mapped into a high-dimensional column vector through the wordToembedding method.
  • the text type identifier and position information can also be mapped through the wordToembedding method to obtain the embedding features of the text type identifier and position information, that is, a method of using a low-dimensional vector to represent an object.
  • the embedding features of the text information, text type identifier, and position information can be added and input into the text information feature encoder.
  • a flag for identifying the execution of the active learning component information task may be obtained, and a text type identification value and a position information value may be set for the identification to generate flag information;
  • Each word of the token information is mapped to a corresponding high-dimensional token vector for input into the text information feature encoder.
  • the flag is pre-defined as the CLS flag, and the position information of the flag is defined as 0, and the text type identifier is defined as 0.
  • the flag, its position information, and the text type identifier are used as a flag information, and the flag information is mapped through the wordToembedding method to obtain the embedding features of the flag, and the text information, text type information, and position information are added together.
  • This embodiment does not impose any limitation on the structure of the step graph feature encoder.
  • This embodiment also provides an optional model structure of the step graph feature encoder, which may include the following contents:
  • a step diagram feature encoder is pre-trained for extracting image features of an operation diagram, which may include a feature extraction network and a feature fusion network; the feature extraction network is used to extract image features of each step diagram of the input operation diagram, and the feature fusion network is used to integrate the image features of each operation diagram extracted by the feature extraction network into one image feature to serve as the image feature of the input operation diagram.
  • the trained step diagram feature encoder after extracting text features of the text sample, since each group of training samples includes a pair of matching text samples and operation diagram samples, for ease of description, the text sample from which the text features have been extracted is called the current text sample, and the operation diagram sample corresponding to the current text sample is called the current operation diagram sample.
  • the current operation diagram sample is input into the step diagram feature encoder, and the step diagram feature encoder uses the feature extraction network to extract features of the current operation diagram sample to obtain image features of all step diagrams contained in the current operation diagram sample.
  • the step diagram feature encoder inputs the image features of each step diagram into the feature fusion network to obtain the current recipe image features of the current operation diagram sample.
  • the feature fusion network may be a long short-term memory neural network. Accordingly, the process of inputting the image features of each step diagram into the feature fusion network to obtain the current recipe image features of the current operation diagram sample may include:
  • LSTM i Long Short-Term Memory
  • ⁇ () the output of the feature extraction network
  • I the total number of step images contained in the current operation graph sample.
  • This embodiment generates image features of the operation diagram samples by separately extracting features and fusion features, which is beneficial to improving the accuracy of image feature extraction.
  • this embodiment also provides a text operation diagram mutual inspection method, please refer to FIG. 3, which may include the following contents:
  • This embodiment may train the text operation diagram mutual checking model by using the method described in any of the above embodiments of the method for training the text operation diagram mutual checking model.
  • the text features to be matched are the current recipe text features of the current sample text in the above embodiment. This step can be performed by extracting the text features of the text sample in the above embodiment, which will not be described in detail here.
  • This step can be performed by extracting the image features of the operation diagram sample in the above-mentioned embodiment, which will not be described in detail here.
  • the weight coefficients trained in S301 can be preloaded. Feature extraction is performed on the operation diagram to be retrieved or the text to be retrieved, and stored in the text data set to be retrieved or the image database to be retrieved.
  • the user gives any data to be retrieved, which can be an operation diagram to be retrieved or a text to be retrieved.
  • the text feature information or image features of the data to be retrieved are extracted and input into the text operation diagram mutual inspection model.
  • the features of the data to be retrieved are distance matched with all sample features in the corresponding data set to be retrieved.
  • the data to be retrieved is text data
  • the corresponding data set to be retrieved is the image data set to be retrieved
  • the Mahalanobis distance is calculated between the text to be retrieved and all the operation diagram features in the data set.
  • the sample with the smallest distance is the operation diagram that best matches the text to be retrieved, and the operation diagram is output as the retrieval result.
  • this embodiment can achieve high-precision mutual retrieval between recipe text and recipe step diagram.
  • this embodiment also takes the mutual retrieval of the recipe text operation diagram as an illustrative example to illustrate the process of implementing the mutual retrieval of the text operation diagram provided by the present application.
  • the execution process of the mutual retrieval task of the recipe text and the recipe operation diagram shown in this embodiment may include:
  • this embodiment may include a recipe retrieval terminal device and a cloud server.
  • a user may perform operations on the recipe retrieval terminal device.
  • the recipe retrieval terminal device interacts with the cloud server through a network.
  • the cloud server may deploy a text operation diagram mutual inspection model.
  • FIG5 in order to enable the text operation diagram mutual inspection model to realize the function of mutual retrieval between the recipe text and the recipe operation diagram, the text operation diagram mutual inspection model needs to be trained.
  • the recipe retrieval terminal device may transmit a training sample set to the cloud server.
  • the training sample set may be pre-written into a USB flash drive, and the USB flash drive is inserted into the input interface of the recipe retrieval terminal device.
  • the training sample set may include multiple groups of training samples, each group of training samples includes a corresponding recipe text sample and a recipe operation diagram sample, and each recipe text sample may include operation steps (instruction list), ingredient information (ingredients) and dish name (Title). Instructions are steps for cooking, which are uniformly represented by steps in the following text. Ingredients are ingredients of a dish, which are uniformly represented by ingredients in the following text.
  • the ingredient data of all recipe text samples can be obtained to generate an ingredient information list.
  • the ingredient information list is generated, the data of the same ingredients are merged into one data, and the number of each ingredient after the merger is counted. For example, [78 flour], [56 eggs], [67 tomatoes], [81 water], ..., [5 shepherd's purse], [3 bird's nest] and [2 shark fin].
  • the ingredient information list after the merger if the number of ingredient information is too small, such as the number is less than 5, the ingredient information is deleted from the table.
  • the filtered ingredient information is: [78 flour], [56 eggs], [67 tomatoes], [81 water], ..., [5 shepherd's purse].
  • the filtered ingredient information table is used as the final main component table, which can be defined as the variable Main-ing.
  • the main component table is a vector, and the vector length is equal to the number of rows of the filtered ingredient information.
  • a text information feature encoder is built.
  • the wordToembedding method can be used to map each word into a high-dimensional vector, and the high-dimensional vector is used as the respective embedding feature.
  • the embedding features are added together to obtain a long input sequence as the input of the text information feature encoder.
  • the CLS flag information used to identify the active learning classification is added to the first position of each recipe text information, that is, in the long input
  • the embedding feature of the CLS flag information is added to the starting position of the sequence.
  • the embedding feature of the CLS flag information is obtained by mapping the flag bit, the position information of the connected bits that are all 0, and the text type identifier through the wordToembedding method.
  • its output feature is extracted to perform active learning classification tasks and calculate the loss of the corresponding recipe step graph features during the model training process.
  • An optional implementation method for active learning classification tasks extract the output features corresponding to the CLS of the basic transformer model, as shown in Figure 2.
  • the feature first passes through a fully connected layer FC, then is nonlinearly mapped through ReLU, and finally passes through a fully connected layer FC to map the feature to the principal component to obtain the same dimension as Main-ing.
  • the feature is called cla, and cla will calculate the classification loss: extract the ingredient information of each recipe text, and compare the ingredient information of the recipe text with the generated principal component table Main-ing. If the component of the principal component table exists in the ingredient information of the recipe text, the corresponding position variable of the principal component table is set to 1, otherwise it is set to 0.
  • a vector called label will be obtained, and its dimension is the same as the number of rows of Main-ing.
  • the loss calculation relationship of the above embodiment is used to calculate cla and the corresponding label for multi-target classification. BCELoss.
  • the ResNet backbone network can be used to extract the features of each recipe step diagram in the operation diagram, and the features of the ResNet network before the classification layer are obtained as the features of each image. Then the recipe step diagram features are input into the LSTM network to obtain the overall features of the entire recipe step image group, and the feature encoding output of the last LSTM unit is taken as the image features of the recipe operation diagram.
  • any loss function in the prior art such as L1 norm loss function, mean square error loss function, cross entropy loss, etc.
  • the output features corresponding to the CLS of the basic transformer can be used as the text information features and the feature encoding output of the last LSTM unit, and the loss calculation is performed based on the following relationship, and then the above transformer network, LSTM network and ResNet network parameters are updated based on gradient backpropagation:
  • N is the number of training sample groups
  • is a hyperparameter that is fixed during training, such as 0.3.
  • N represents the total number of paired samples in this batch.
  • the image group features are Traverse (a total of N), and the selected target can be called a represents anchor (anchor sample).
  • the text feature encoding paired with the anchor sample is recorded as p generation
  • the unpaired text features are recorded as Similarly, the same traversal operation is performed on the text features. Represents the target sample selected in the traversal, and its corresponding positive image group feature sample is recorded as The non-corresponding
  • the recipe retrieval terminal device may include a human-computer interaction module such as a display screen, an input interface, an input keyboard, etc., and also includes a wireless transmission module.
  • the input keyboard may be a soft keyboard presented on the display screen.
  • the input interface can be used to achieve connection with an external device such as a USB flash drive. There may be multiple input interfaces.
  • the user can input a retrieval request to the recipe retrieval terminal device through the input keyboard.
  • the retrieval request carries the information to be retrieved, such as a recipe text or a recipe operation diagram.
  • the recipe retrieval terminal can send the retrieval request to the cloud server through the wireless transmission module.
  • the cloud server retrieves the corresponding database based on the trained text operation diagram mutual inspection model and can feed back the final mutual retrieval results to the recipe retrieval terminal device.
  • the recipe retrieval terminal device can display the retrieved target recipe text or target recipe operation diagram to the user through the display screen.
  • the embodiment of the present application also provides a corresponding device for the method of training a text operation diagram mutual inspection model and the text operation diagram mutual inspection method, which further makes the method more practical.
  • the device can be described from the perspective of functional modules and hardware.
  • the device for training a text operation diagram mutual inspection model and the text operation diagram mutual inspection device provided in the embodiment of the present application are introduced below.
  • the device for training a text operation diagram mutual inspection model and the text operation diagram mutual inspection device described below can correspond to each other with the method for training a text operation diagram mutual inspection model and the text operation diagram mutual inspection method described above.
  • FIG. 6 is a structural diagram of a device for training a text operation diagram mutual inspection model provided in an embodiment of the present application in a specific implementation manner, and the device may include:
  • a model building module 601 is used to build a text operation diagram mutual inspection model including a text information feature encoder and a step diagram feature encoder;
  • An identification information generating module 602 is used to generate recipe ingredient information by analyzing all recipe samples containing recipe ingredients in the training sample set;
  • the text data processing module 603 is used to extract the principal component features and the recipe mean features of the current text sample for each group of training samples in the training sample set by using the text information feature encoder, and actively learn the virtual component labels of the principal component features based on the recipe component information; the recipe mean features are determined according to all text features of the current text sample extracted by the text information feature encoder; based on the virtual component labels and the component prediction confidence threshold, determine whether the current recipe text feature of the current text sample is the principal component feature or the recipe mean feature;
  • An image feature extraction module 604 is used to extract the current recipe image features of the current operation diagram sample corresponding to the current text sample using a step diagram feature encoder;
  • the training module 605 is used to input the current recipe text features and the current recipe image features into the text operation graph mutual inspection model to perform model training.
  • the text data processing module 603 may be used to: determine target components greater than or equal to a component confidence threshold from the virtual component labels, and determine the main component according to the confidence corresponding to each target component.
  • the confidence level of the probability prediction of the principal component is used to determine whether the current recipe text feature of the current text sample is the principal component feature or the recipe mean feature, based on the numerical relationship between the principal component probability prediction confidence level and the component prediction confidence threshold.
  • Each element in the virtual component label is used to indicate the confidence level of the principal component corresponding to the recipe component information contained in the current text sample.
  • the above text data processing module 603 can also be used to: obtain the current output control mode; if the current output control mode is a binary switching mode, determine whether the principal component probability prediction confidence is greater than the component prediction confidence threshold; if the principal component probability prediction confidence is greater than the component prediction confidence threshold, then the current recipe text feature of the current text sample is the principal component feature; if the principal component probability prediction confidence is less than or equal to the component prediction confidence threshold, then the current recipe text feature of the current text sample is the recipe mean feature.
  • the above text data processing module 603 can be further used to: obtain the current output control mode; if the current output control mode is a mixed switching mode, compare the numerical relationship between the principal component probability prediction confidence and the component prediction confidence threshold and the preset confidence limit threshold; if the principal component probability prediction confidence is greater than the component prediction confidence threshold, the current recipe text feature of the current text sample is the principal component feature; if the principal component probability prediction confidence is less than or equal to the component prediction confidence threshold and greater than or equal to the confidence limit threshold, the current recipe text feature of the current text sample is the feature sum of the recipe mean feature and the principal component feature; if the principal component probability prediction confidence is less than the confidence limit threshold, the current recipe text feature of the current text sample is the recipe mean feature.
  • the above text data processing module 603 can be further used for: if the principal component probability prediction confidence is less than or equal to the component prediction confidence threshold, and greater than or equal to the confidence limit threshold, then the current recipe text feature of the current text sample is the output feature after feature cascading of the recipe mean feature and the principal component feature and processing through the fully connected layer.
  • the above-mentioned identification information generation module 602 can also be used to: obtain all original components contained in each recipe sample of the target recipe text sample set; perform data merging processing on each original component to merge the data of the same components together; count the merged original components to determine the total number corresponding to each type of component; delete the original components whose total number is less than a preset number threshold to obtain sample components; and generate a principal component table based on each sample component.
  • the above text data processing module 603 can also be further used to: compare the existing components contained in the current text sample with the sample components in the principal component table one by one; for each existing component, if the current sample component in the principal component table is the same as the current existing component, then the position element corresponding to the current sample component is set to a first preset identification value; if the current sample component in the principal component table is different from the current existing component, then the position element corresponding to the current sample component is set to a second preset identification value; generate a virtual component label according to the value of the position element corresponding to each sample component in the principal component table.
  • the above-mentioned text information feature encoder may include an input layer, a text feature extraction layer and an output data processing layer; the input layer includes a text data input unit and an ingredient identification flag input unit; the text data input unit includes a dish name input unit, a recipe step input unit and an ingredient input unit, which are used to sequentially input different types of data of each text sample in the training sample set; the ingredient identification flag input unit is used to input a flag bit used to identify the execution of the active learning ingredient information task; the text feature extraction layer is a converter-based bidirectional encoder, which is used to extract features from the input layer output information; the output data processing layer is used to actively learn the virtual component label corresponding to the main component feature extracted by the text feature extraction layer based on the flag bit, and determine the current recipe text feature of the current text sample based on the virtual component label and the ingredient prediction confidence threshold.
  • the input layer includes a text data input unit and an ingredient identification flag input unit
  • the text data input unit includes a dish name input unit, a recipe step input unit and an
  • the output data processing layer includes a feature selection controller, a principal component An output unit and a recipe mean feature output unit;
  • the recipe mean feature output unit includes a dish name feature output unit, a recipe step feature output unit and an ingredient feature output unit, which is used to output the feature averages of the dish name feature, the recipe step feature and the ingredient feature;
  • a principal component output unit used to output the principal component feature and obtain a virtual component label by performing an active learning task;
  • a feature selection controller used to determine the current recipe text feature based on the virtual component label and the ingredient prediction confidence threshold, and switch the principal component output unit and the recipe mean feature output unit to output the current recipe text feature.
  • the above-mentioned output data processing layer may include a first fully connected layer, a mapping layer, a second fully connected layer and a loss calculation layer; the first fully connected layer is used to receive the feature information corresponding to the output of the component identification mark input unit; the mapping layer is used to perform nonlinear mapping processing on the feature information; the second fully connected layer is used to map the features obtained after the mapping processing to the principal components, so as to obtain principal component features with the same dimension as the recipe ingredient information; the loss calculation layer is used to actively learn the virtual component labels of the principal component features based on the recipe ingredient information.
  • the above loss calculation layer can also be used to: generate a virtual component label according to the comparison result of the current text sample and the recipe ingredient information; the vector data corresponding to the virtual component label has the same dimension as the vector data corresponding to the principal component feature; call the loss calculation relationship to calculate the loss information of the virtual component label and the principal component feature, and the loss calculation relationship is:
  • loss cla is the loss information
  • M is the principal component feature corresponding to the dimension of the vector data
  • sigmoid() is the sigmoid function
  • label m is the virtual component label corresponding to the element at the mth position of the vector data
  • cla m is the principal component feature corresponding to the element at the mth position of the vector data.
  • the above-mentioned device may also include a text processing module, for example, for obtaining a flag used to identify the execution of the active learning component information task, and setting a text type identification value and a position information value for the flag to generate flag information; mapping each word of the flag information to a corresponding high-dimensional flag vector for input into a text information feature encoder.
  • a text processing module for example, for obtaining a flag used to identify the execution of the active learning component information task, and setting a text type identification value and a position information value for the flag to generate flag information; mapping each word of the flag information to a corresponding high-dimensional flag vector for input into a text information feature encoder.
  • the above text processing module can also be used to: map each word of the dish name, cooking steps and ingredients of the current text sample into a corresponding high-dimensional text vector, and at the same time map the position information of each word in the corresponding text data and the text type identifier that identifies the data type to which the text data belongs into a corresponding high-dimensional auxiliary vector; based on each high-dimensional text vector and its corresponding high-dimensional auxiliary vector, generate a text vector for input into a text information feature encoder.
  • the above-mentioned image feature extraction module 604 can also be used to: pre-train a step diagram feature encoder; the step diagram feature encoder includes a feature extraction network and a feature fusion network; input the current operation diagram sample corresponding to the current text sample into the feature extraction network to obtain the image features of all step diagrams contained in the current operation diagram sample; input the image features of each step diagram into the feature fusion network to obtain the current recipe image features of the current operation diagram sample.
  • the above image feature extraction module 604 can be further used to: the feature fusion network is a long short-term memory neural network, and the image feature fusion relationship is called to process the image features of each step diagram; the image feature fusion relationship is:
  • LSTM i is the i-th LSTM unit
  • ⁇ () is the output of the feature extraction network
  • I is the total number of step images contained in the current operation graph sample.
  • FIG. 7 is a structural diagram of a text operation diagram mutual inspection device provided in an embodiment of the present application in a specific implementation manner, and the device may include:
  • a model training module 701 is used to train a text operation graph mutual checking model in advance using any of the embodiments of the method for training a text operation graph mutual checking model as described above;
  • the feature acquisition module 702 is used to acquire the text features to be matched of the text to be retrieved; and acquire the image features to be matched of the operation image to be retrieved;
  • the mutual inspection result generating module 703 is used to input the to-be-matched text features and the to-be-matched image features into the text operation graph mutual inspection model to obtain the text operation graph mutual inspection result.
  • the functions of the functional modules of the cross-media retrieval device in the embodiment of the present application can be specifically implemented according to the method in the above method embodiment.
  • the specific implementation process can refer to the relevant description of the above method embodiment, which will not be repeated here.
  • FIG. 8 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application under one implementation.
  • the electronic device includes a memory 80 for storing a computer program; a processor 81 for implementing the method for training the text operation diagram mutual-checking model and/or the steps of the text operation diagram mutual-checking method as mentioned in any of the above embodiments when executing the computer program.
  • the processor 81 may include one or more processing cores, such as a 4-core processor or an 8-core processor.
  • the processor 81 may also be a controller, a microcontroller, a microprocessor or other data processing chip.
  • the processor 81 may be implemented in at least one hardware form of DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), or PLA (Programmable Logic Array).
  • the processor 81 may also include a main processor and a coprocessor.
  • the main processor is a processor for processing data in the wake-up state, also known as a CPU (Central Processing Unit); the coprocessor is a low-power processor for processing data in the standby state.
  • CPU Central Processing Unit
  • the processor 81 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content to be displayed on the display screen.
  • the processor 81 may also include an AI (Artificial Intelligence) processor, which is used to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • the memory 80 may include one or more computer non-volatile readable storage media, which may be non-transitory.
  • the memory 80 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices and flash memory storage devices.
  • the memory 80 may be an internal storage unit of an electronic device, such as a hard disk of a server.
  • the memory 80 may also be an external storage device of an electronic device, such as a plug-in hard disk equipped on a server, a smart memory card (Smart Media Card, SMC), Secure Digital (SD) card, Flash Card, etc.
  • the memory 80 may also include both an internal storage unit of the electronic device and an external storage device.
  • the memory 80 may not only be used to store application software and various data installed in the electronic device, such as: the code of the program used and generated in the process of executing the method for training the text operation diagram mutual inspection model and/or the text operation diagram mutual inspection method, but also be used to temporarily store the data that has been output or will be output.
  • the memory 80 is at least used to store the following computer program 801, wherein, after the computer program is loaded and executed by the processor 81, the method for training the text operation diagram mutual inspection model and/or the related steps of the text operation diagram mutual inspection method disclosed in any of the aforementioned embodiments can be implemented.
  • the resources stored in the memory 80 may also include an operating system 802 and data 803, etc., and the storage method may be a temporary storage or a permanent storage.
  • the operating system 802 may include Windows, Unix, Linux, etc.
  • the data 803 may include but is not limited to the data generated in the process of training the text operation diagram mutual inspection model and the result data obtained by training and/or the data corresponding to the text operation diagram mutual inspection result, etc.
  • the electronic device may further include a display screen 82, an input/output interface 83, a communication interface 84 or a network interface, a power supply 85 and a communication bus 86.
  • the display screen 82 and the input/output interface 83 such as a keyboard, belong to the user interface, and the optional user interface may also include a standard wired interface, a wireless interface, etc.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an OLED (Organic Light-Emitting Diode) touch device, etc.
  • the display may also be appropriately referred to as a display screen or a display unit, which is used to display information processed in the electronic device and to display a visual user interface.
  • the communication interface 84 may optionally include a wired interface and/or a wireless interface, such as a WI-FI interface, a Bluetooth interface, etc., which is usually used to establish a communication connection between the electronic device and other electronic devices.
  • the communication bus 86 may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect standard
  • EISA extended industry standard architecture
  • the bus may be divided into an address bus, a data bus, a control bus, etc.
  • FIG8 shows only one thick line, but this does not mean that there is only one bus or one type of bus.
  • FIG. 8 does not constitute a limitation on the electronic device, and may include more or fewer components than shown in the figure, for example, may also include a sensor 87 for realizing various functions.
  • the method for training the text operation diagram mutual inspection model and/or the text operation diagram mutual inspection method in the above-mentioned embodiment is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium to execute all or part of the steps of the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), electrically erasable programmable ROM, register, hard disk, multimedia card, card-type memory (such as SD or DX memory, etc.), magnetic memory, removable disk, CD-ROM, magnetic disk or optical disk, etc.
  • Various media that can store program codes include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), electrically erasable programmable ROM, register, hard disk, multimedia card, card-type memory (such as SD or DX memory, etc.), magnetic memory, removable disk, CD-ROM, magnetic disk or optical disk, etc.
  • Various media that can store program codes include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), electrically erasable programmable ROM, register, hard disk, multimedia card, card-type memory (such as SD or DX memory, etc.), magnetic memory, removable disk, CD-ROM, magnetic disk or optical disk, etc.
  • an embodiment of the present application also provides a non-volatile readable storage medium storing a computer program.
  • the computer program is executed by a processor, the method for training a text operation diagram mutual checking model and/or the steps of the text operation diagram mutual checking method as in any of the above embodiments are performed.
  • each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments.
  • the same or similar parts between the embodiments can be referred to each other.
  • the description is relatively simple, and the relevant parts can be referred to the method part description.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种文本操作图互检方法及装置、训练文本操作图互检模型的方法及装置、电子设备、非易失性可读存储介质,应用于信息检索技术。其中,方法包括通过分析所有菜谱样本所包含的菜谱成分生成菜谱成分信息;利用文本信息特征编码器提取当前文本样本的主成分特征和菜谱均值特征,并基于菜谱成分信息主动学习主成分特征的虚拟成分标签;基于虚拟成分标签和成分预测置信阈值,确定当前菜谱文本特征为主成分特征还是菜谱均值特征;利用步骤图特征编码器提取与当前文本样本对应的当前操作图样本的当前菜谱图像特征;将当前菜谱文本特征和当前菜谱图像特征输入至文本操作图互检模型进行模型训练,从而可实现菜谱文本与菜谱步骤图之间的高精度互检索。

Description

文本操作图互检方法及模型训练方法、装置、设备、介质
相关申请的交叉引用
本申请要求于2022年11月08日提交中国专利局,申请号为202211388902.8,申请名称为“文本操作图互检方法及模型训练方法、装置、设备、介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及信息检索技术领域,特别是涉及一种文本操作图互检方法及装置、训练文本操作图互检模型的方法及装置、电子设备、非易失性可读存储介质。
背景技术
随着计算机技术以及网络技术被广泛地应用在日常工作生活中,多媒体数据呈现爆发式增长,如新闻报道、微博淘宝等评论数据、微信聊天记录等多模式数据,表情包、文章配图、手机照片、医疗影像等图片数据,抖音、快手等视频媒体数据、城市摄像头数据等视频数据,同时伴随着有音频信息,如微信语音、视频配音等信息。这些不同多媒体形式的数据通常还共同用于描述同一物体或同一场景。为了方便管理多样的多媒体内容,不同媒体间实现灵活检索的方法应用而生。
相关技术通常采用简单的机器学习算法所构建的模型实现互检索,如采用Resnet(Residual Network,残差网络)-Bert(Bidirectional Encoder Representations from Transformers,来自变换器的双向编码器表征量)网络模型,对图像数据、文本数据、视频数据和音频数据中的至少一种数据进行分类检索,返回对应的分类结果;当对图像数据、文本数据、视频数据和音频数据中至少两种进行分类检索时,进行检索的图像数据、文本数据、视频数据或音频数据语义类别一致;Resnet-Bert网络模型对图像数据、视频数据和音频数据进行分类检索时采用Resnet模型,对文本数据进行分类检索时采用Bert模型。尽管利用效果较佳的Resnet卷积神经网络模型和目前在11项自然语言处理方面领先的Bert模型,可以获得到更高层、更抽象以及更丰富的特征表达。但是,由于菜谱文本包含的数据类型较多,不同文本数据之间具有一定的关系,基于菜谱文本利用这些现有模型检索对应的菜谱操作图,或者是基于菜谱操作图获取对应的菜谱文本,菜谱文本与菜谱步骤图之间的检索精度都无法满足实现需求。
发明内容
本申请提供了一种文本操作图互检方法及装置、训练文本操作图互检模型的方法及装置、电子设备、非易失性可读存储介质,可实现菜谱文本与菜谱步骤图之间的高精度互检索。
为解决上述技术问题,本申请实施例提供以下技术方案:
本申请实施例第一方面提供了一种训练文本操作图互检模型的方法,包括:
预先构建包括文本信息特征编码器和步骤图特征编码器的文本操作图互检模型,并通过分析目标菜谱文本样本集中的各菜谱样本所包含的菜谱成分,生成菜谱成分信息;
对训练样本集的每组训练样本,利用文本信息特征编码器提取当前文本样本的主成分特 征和菜谱均值特征,并基于菜谱成分信息主动学习主成分特征的虚拟成分标签;菜谱均值特征根据文本信息特征编码器提取当前文本样本的所有文本特征所确定;
基于虚拟成分标签和成分预测置信阈值,确定当前文本样本的当前菜谱文本特征为主成分特征还是菜谱均值特征;
利用步骤图特征编码器提取与当前文本样本对应的当前操作图样本的当前菜谱图像特征;
将当前菜谱文本特征和当前菜谱图像特征,输入至文本操作图互检模型,进行模型训练。
可选的,基于虚拟成分标签和成分预测置信阈值,确定当前文本样本的当前菜谱文本特征为主成分特征还是菜谱均值特征,包括:
虚拟成分标签中的每个元素,用于表示当前文本样本中包含菜谱成分信息对应主成分的置信度;
从虚拟成分标签中确定大于等于成分置信阈值的目标成分,并根据各目标成分对应的置信度确定主成分概率预测置信度;
根据主成分概率预测置信度和成分预测置信阈值之间的数值关系,确定当前文本样本的当前菜谱文本特征为主成分特征还是菜谱均值特征。
可选的,根据主成分概率预测置信度和成分预测置信阈值之间的数值关系,确定当前文本样本的当前菜谱文本特征为主成分特征还是菜谱均值特征,包括:
获取当前输出控制模式;
若当前输出控制模式为二值切换模式,判断主成分概率预测置信度是否大于成分预测置信阈值;
若主成分概率预测置信度大于成分预测置信阈值,则当前文本样本的当前菜谱文本特征为主成分特征;
若主成分概率预测置信度小于等于成分预测置信阈值,则当前文本样本的当前菜谱文本特征为菜谱均值特征。
可选的,根据主成分概率预测置信度和成分预测置信阈值之间的数值关系,确定当前文本样本的当前菜谱文本特征为主成分特征还是菜谱均值特征,包括:
获取当前输出控制模式;
若当前输出控制模式为混合切换模式,比较主成分概率预测置信度与成分预测置信阈值和预设的置信限度阈值之间的数值关系;
若主成分概率预测置信度大于成分预测置信阈值,则当前文本样本的当前菜谱文本特征为主成分特征;
若主成分概率预测置信度小于等于成分预测置信阈值、且大于等于置信限度阈值,则当前文本样本的当前菜谱文本特征为菜谱均值特征和主成分特征的特征和;
若主成分概率预测置信度小于置信限度阈值,则当前文本样本的当前菜谱文本特征为菜谱均值特征。
可选的,若当前输出控制模式为混合切换模式,比较主成分概率预测置信度与成分预测置信阈值和置信限度阈值之间的数值关系之和,包括:
若主成分概率预测置信度小于等于成分预测置信阈值、且大于等于置信限度阈值,则当 前文本样本的当前菜谱文本特征为将菜谱均值特征和主成分特征进行特征级联,并通过全连接层处理后的输出特征。
可选的,文本信息特征编码器包括输入层、文本特征提取层和输出数据处理层;
输入层包括文本数据输入单元和成分识别标志输入单元;文本数据输入单元包括菜名输入单元、菜谱步骤输入单元和成分输入单元,用于依次输入训练样本集的各文本样本的不同类型数据;成分识别标志输入单元,用于输入用于标识执行主动学习成分信息任务的标志位;
文本特征提取层为基于转换器的双向编码器,用于对输入层输出信息进行特征提取;
输出数据处理层,用于基于标志位,主动学习文本特征提取层所提取的主成分特征对应的虚拟成分标签,并基于虚拟成分标签和成分预测置信阈值确定当前文本样本的当前菜谱文本特征。
可选的,输出数据处理层包括特征选择控制器、主成分输出单元和菜谱均值特征输出单元;
菜谱均值特征输出单元包括菜名特征输出单元、菜谱步骤特征输出单元和成分特征输出单元,其用于输出菜名特征、菜谱步骤特征和成分特征的特征平均值;
主成分输出单元,用于输出主成分特征以及通过执行主动学习任务得到虚拟成分标签;
特征选择控制器,用于基于虚拟成分标签和成分预测置信阈值确定当前菜谱文本特征,并切换主成分输出单元和菜谱均值特征输出单元以输出当前菜谱文本特征。
可选的,主成分输出单元包括第一全连接层、映射层、第二全连接层和损失计算层;
第一全连接层,用于接收成分识别标志输入单元对应输出的特征信息;
映射层,用于对特征信息进行非线性映射处理;
第二全连接层,用于将映射处理后所得的特征映射至主成分,得到与菜谱成分信息维度相同的主成分特征;
损失计算层,用于基于菜谱成分信息主动学习主成分特征的虚拟成分标签。
可选的,基于菜谱成分信息主动学习主成分特征的虚拟成分标签,包括:
根据当前文本样本与菜谱成分信息的比对结果,生成虚拟成分标签;虚拟成分标签对应的向量数据与主成分特征对应的向量数据的维度相同;
调用损失计算关系式,计算虚拟成分标签与主成分特征的损失信息,损失计算关系式为:
式中,losscla为损失信息,M为主成分特征对应为向量数据的维度,sigmoid()为sigmoid函数,labelm为虚拟成分标签对应为向量数据的第m个位置上的元素,clam为主成分特征对应为向量数据的第m个位置上的元素。
可选的,通过分析目标菜谱文本样本集中的各菜谱样本所包含的菜谱成分,生成菜谱成分信息,包括:
获取目标菜谱文本样本集的每一个菜谱样本所包含的所有原始成分;
对各原始成分进行数据合并处理,以将相同成分的数据合并至一起;
统计合并后的各原始成分,确定每类成分对应的总数量;
删除总数量小于预设数量阈值的原始成分,得到样本成分;
基于各样本成分,生成主成分表。
可选的,根据当前文本样本与菜谱成分信息的比对结果,生成虚拟成分标签,包括:
将当前文本样本所包含的已有成分与主成分表的样本成分一一进行比对;
对每个已有成分,若主成分表中的当前样本成分与当前已有成分相同,则将当前样本成分对应的位置元素设置为第一预设标识值;
若主成分表中的当前样本成分与当前已有成分不相同,则将当前样本成分对应的位置元素设置为第二预设标识值;
根据主成分表的每个样本成分对应的位置元素的值,生成虚拟成分标签。
可选的,利用文本信息特征编码器提取当前文本样本的主成分特征和菜谱均值特征,并基于菜谱成分信息主动学习主成分特征的虚拟成分标签之前,还包括:
获取用于标识执行主动学习成分信息任务的标志,并为标识设置文本类型标识值和位置信息值,以生成标志信息;
将标志信息的每个单词映射为相应的高维标志向量,以用于输入文本信息特征编码器。
可选的,利用文本信息特征编码器提取当前文本样本的主成分特征和菜谱均值特征组之前,还包括:
分别将当前文本样本的菜名、做菜步骤和成分的每个单词映射为相应的高维文本向量,同时将每个单词在相应文本数据中的位置信息、标识文本数据所属数据类型的文本类型标识映射为相应的高维辅助向量;
基于各高维文本向量和其相应的高维辅助向量,生成文本向量,以用于输入文本信息特征编码器。
可选的,利用步骤图特征编码器提取与当前文本样本对应的当前操作图样本的当前菜谱图像特征,包括:
预先训练步骤图特征编码器;步骤图特征编码器包括特征提取网络和特征融合网络;
将与当前文本样本对应的当前操作图样本输入至特征提取网络,得到当前操作图样本包含的所有步骤图的图像特征;
将各步骤图的图像特征输入至特征融合网络中,得到当前操作图样本的当前菜谱图像特征。
可选的,特征融合网络为长短期记忆神经网络,将各步骤图的图像特征输入至特征融合网络中,得到当前操作图样本的当前菜谱图像特征,包括:
调用图像特征融合关系式处理各步骤图的图像特征;图像特征融合关系式为:
式中,为长短期记忆神经网络的第i个LSTM单元的输出,LSTMi为第i个LSTM单元,φ()为特征提取网络的输出,为当前操作图样本的第i张步骤图像,为长短期记忆神经网络的第i-1个LSTM单元的输出,I为当前操作图样本所包含的步骤图像的总 数。
本申请实施例第二方面提供了一种训练文本操作图互检模型的装置,包括:
模型构建模块,用于构建包括文本信息特征编码器和步骤图特征编码器的文本操作图互检模型;
识别信息生成模块,用于通过分析训练样本集中包含菜谱成分的所有菜谱样本,生成菜谱成分信息;
文本数据处理模块,用于对训练样本集的每组训练样本,利用文本信息特征编码器提取当前文本样本的主成分特征和菜谱均值特征,并基于菜谱成分信息主动学习主成分特征的虚拟成分标签;菜谱均值特征根据文本信息特征编码器提取当前文本样本的所有文本特征所确定;基于虚拟成分标签和成分预测置信阈值,确定当前文本样本的当前菜谱文本特征为主成分特征还是菜谱均值特征;
图像特征提取模块,用于利用步骤图特征编码器提取与当前文本样本对应的当前操作图样本的当前菜谱图像特征;
训练模块,用于将当前菜谱文本特征和当前菜谱图像特征,输入至文本操作图互检模型,进行模型训练。
本申请实施例第三方面提供了一种文本操作图互检方法,包括:
预先利用如前任意一项的训练文本操作图互检模型的方法,训练得到文本操作图互检模型;
获取待检索文本的待匹配文本特征;
获取待检索操作图的待匹配图像特征;
将待匹配文本特征和待匹配图像特征,输入至文本操作图互检模型,得到文本操作图互检结果。
本申请实施例第四方面提供了一种文本操作图互检装置,包括:
模型训练模块,用于预先利用如前任意一项的训练文本操作图互检模型的方法,训练得到文本操作图互检模型;
特征获取模块,用于获取待检索文本的待匹配文本特征;获取待检索操作图的待匹配图像特征;
互检结果生成模块,用于将待匹配文本特征和待匹配图像特征,输入至文本操作图互检模型,得到文本操作图互检结果。
本申请实施例还提供了一种电子设备,包括处理器和存储器,处理器用于执行存储器中存储的计算机程序时实现如前任一项训练文本操作图互检模型的方法和/或如前文本操作图互检方法的步骤。
本申请实施例最后还提供了一种非易失性可读存储介质,非易失性可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现如前任一项训练文本操作图互检模型的和/或如前文本操作图互检方法的步骤。
本申请提供的技术方案的优点在于,文本操作图互检模型中设置可以基于菜谱成分信息主动学习菜谱文本数据所包含的菜谱成分的功能,通过对提取的主成分特征的主动学习效果的检测,可以很好验证文本操作图互检模型的文本特征提取精度,并及时调整用于进行图文匹配的菜谱文本特征,从而可以很好地提取菜谱文本的高级语义信息,实现高可靠性分类, 去除冗余噪声,进而有效提高菜谱文本与菜谱操作图的互检索的精度。
此外,本申请实施例还针对训练文本操作图互检模型的方法提供了相应的文本操作图互检方法、实现装置、电子设备及非易失性可读存储介质,进一步使得方法更具有实用性,文本操作图互检方法、实现装置、电子设备及非易失性可读存储介质具有相应的优点。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性的,并不能限制本公开。
附图说明
为了更清楚的说明本申请实施例或相关技术的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单的介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种训练文本操作图互检模型的方法的流程示意图;
图2为本申请实施例提供的一种文本信息特征编码器的结构框架示意图;
图3为本申请实施例提供的一种文本操作图互检方法的流程示意图;
图4为本申请实施例提供的一个示例性应用场景的框架示意图;
图5为本申请实施例提供的一个示例性应用场景下的文本操作图互检模型的框架示意图;
图6为本申请实施例提供的训练文本操作图互检模型的装置的一种具体实施方式结构图;
图7为本申请实施例提供的文本操作图互检装置的一种具体实施方式结构图;
图8为本申请实施例提供的电子设备的一种具体实施方式结构图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面结合附图和具体实施方式对本申请作进一步的详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等是用于区别不同的对象,而不是用于描述特定的顺序。此外术语“包括”和“具有”以及他们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可包括没有列出的步骤或单元。
在介绍了本申请实施例的技术方案后,下面详细的说明本申请的各种非限制性实施方式。
首先参见图1,图1为本申请实施例提供的一种训练文本操作图互检模型的方法的流程示意图,本申请实施例可包括以下内容:
S101:预先构建包括文本信息特征编码器和步骤图特征编码器的文本操作图互检模型。
本步骤的文本操作图互检模型用于执行菜谱文本与菜谱操作图之间的互检索任务,也即将待检索的文本数据或者是待检索的操作图数据输入至训练好的文本操作图互检模型中,文本操作图互检模型通过从指定待检索数据库读取相应的数据进行匹配,并输出与待检索文本 或待检索操作图匹配的目标菜谱操作图或目标菜谱文本。举例来说,如果待检索任务是从图像数据库中检索到与待检索文本相对应的操作图,则向文本操作图互检模型输入待检索文本,文本操作图互检模型通过将待检索文本的菜谱文本特征与图像数据库的每个操作图的菜谱图像特征进行匹配,确定相似度最高的菜谱操作图作为目标菜谱操作图并输出。文本信息特征编码器用于对输入的菜谱文本数据进行编码,并输出最终的菜谱文本特征;步骤图特征编码器用于对输入的菜谱操作图数据进行编码,并输出最终的菜谱操作图特征。
S102:预先通过分析目标菜谱文本样本集中的各菜谱样本所包含的菜谱成分,生成菜谱成分信息。
在一些实施例中,目标菜谱文本样本集可为用于训练文本操作图互检模型的训练样本集的全部或一部分菜谱文本样本所组成,也可为从其他数据集中选取的菜谱文本组成,这均不影响本申请的实现。本实施例所指的训练样本集为用于训练文本操作图互检模型的样本数据,训练样本集包括多组训练样本,每组训练样本均包括相对应的文本样本和操作图样本,也就是文本样本和操作图样本为相匹配的一组样本数据,本实施例的文本样本以及后续的待检索文本均为菜谱文本,菜谱文本均包括菜名、做菜步骤和成分三类数据,操作图样本以及后续的待检索操作图均为菜谱操作图,至于训练样本组数可根据实际训练需求以及所采用的数据库来确定,本申请对此不作任何限定。本申请的操作图或者是说操作图样本包括一组具有先后操作顺序子图像,该组图像的每个子图像对应文本数据或者是说文本样本中的一个操作步骤也即做菜步骤。菜谱成分信息是指通过读取各菜谱样本所包含的菜谱成分所生成的菜谱成分统计信息,也即用于标识文本样本或待检索样本中所包含的成分数据。
S103:对训练样本集的每组训练样本,利用文本信息特征编码器提取当前文本样本的主成分特征和菜谱均值特征,并基于菜谱成分信息主动学习主成分特征的虚拟成分标签;菜谱均值特征根据文本信息特征编码器提取当前文本样本的所有文本特征所确定。基于虚拟成分标签和成分预测置信阈值,确定当前文本样本的当前菜谱文本特征为主成分特征还是菜谱均值特征。
由于每组训练样本均包括相互对应的一个文本样本和一个操作图样本,对于文本样本,将文本样本输入至文本信息特征编码器中,文本信息特征编码器包括文本输入功能、特征提取功能和带有主动学习功能的文本输出功能,文本信息特征编码器先基于特征提取功能提取输入的文本样本的文本特征,本实施例的文本样本包括菜名、做菜步骤和菜谱成分这三类文本数据,每类文本数据均会提取相应的文本特征,且本实施例还有设置用于表示进行主动学习功能的成分识别标志的输入位,该成分识别标志与文本样本或者是待检索样本一起作为模型输入,且每个输入均对应一个输出,也即用于输入成分识别标志的输入位对应的输出为主成分特征,用于输入菜名的输入位对应的输出为菜名特征,用于输入做菜步骤的输入位对应的输出为做菜步骤特征,用于输入菜谱成分的输入位对应的输出为菜谱成分特征,本实施例的菜谱均值特征是菜谱成分特征、做菜步骤特征和菜名特征联合之后所生成的特征,也即菜谱均值特征根据文本信息特征编码器提取当前文本样本的所有文本特征所确定。文本信息特征编码器的特征提取功能可基于任何一种现有的文本特征提取模型,如向量空间模型、词频方法、文档频次方法等,这均不影响本申请的实现。虚拟成分标签用于通过主动学习功能学习主成分特征所得到的主成分特征的标签。文本信息特征编码器最终输出的当前文本样本的文本特征称为当前菜谱文本特征,该特征为主成分特征还是菜谱均值特征,至于是主成分特 征还是菜谱均值特征,可以基于学习得到的虚拟成分标签以及预先设置的成分预测置信阈值来确定,也就是说,成分预测置信阈值用于标识所提取的主成分特征可以使用的最低限制。若虚拟成分标签和成分预测置信阈值可以标识当前所提取的主成分特征是精度高的特征,那么直接采用该主成分特征作为进行与操作图文本的图像特征匹配的特征,若虚拟成分标签和成分预测置信阈值可以标识当前所提取的主成分特征是低精度的特征,那么就不直接采用该主成分特征作为进行与操作图文本的图像特征匹配的特征,而是可由主成分特征和菜谱均值特征来共同确定最终的输出特征。
S104:利用步骤图特征编码器提取与当前文本样本对应的当前操作图样本的当前菜谱图像特征。
在上个步骤对一组训练样本的文本样本进行处理之后,本步骤对该文本样本对应的操作图样本进行相应的图像特征提取。由于操作图样本包含一组步骤图,故操作图样本的图像特征为这组步骤图的图像特征的集合,为了便于描述,将与当前文本样本对应的操作图样本称为当前操作图样本,将当前操作图样本的图像特征称为当前菜谱图像特征。本申请可采用任何一种可提取图像特征的网络结构搭建步骤图特征编码器,如人工神经网络、VGG(Visual Geometry Group Network,视觉几何群网络)等,本申请对此不作任何限定。
S105:将当前菜谱文本特征和当前菜谱图像特征,输入至文本操作图互检模型,进行模型训练。
对每组训练样本,将该组训练样本的文本样本的文本特征信息及其对应一个操作图样本的图像特征输入至S101步骤搭建的文本操作图互检模型中。模型训练过程中,会采用损失函数来指导模型的训练,然后通过诸如梯度反传等方式实现对文本操作图互检模型的各网络参数的更新直至满足条件,如达到迭代次数或者收敛效果较好。举例来说,文本操作图互检模型的训练过程可包括前向传播阶段和反向传播阶段,前向传播阶段是数据由低层次向高层次传播的阶段,反向传播阶段是当前向传播得出的结果与预期不相符时,将误差从高层次向底层次进行传播训练的阶段。具体来说,首先随机初始化文本操作图互检模型的所有网络层权值;然后输入携带数据类型信息的文本特征和图像特征经过模型各层的前向传播得到输出值;计算文本操作图互检模型的输出值,并基于损失函数计算该输出值的损失值。将误差反向传回文本操作图互检模型中,依次计算文本操作图互检模型各层的反向传播误差。文本操作图互检模型各层基于相应的反向传播误差对文本操作图互检模型的所有权重系数进行调整,实现权重的更新。重新随机选取新批次也即下一组训练样本的图像特征和携带数据类型信息的文本特征,然后重复迭代上述过程,直至计算得到的模型输出值与目标值之间的误差小于预设阈值,结束训练,并将模型当前各层参数作为训练好的文本操作图互检模型的网络参数。
在本申请实施例提供的技术方案中,文本操作图互检模型中设置可以基于菜谱成分信息主动学习菜谱文本数据所包含的菜谱成分的功能,通过对提取的主成分特征的主动学习效果的检测,可以很好验证文本操作图互检模型的文本特征提取精度,并及时调整用于进行图文匹配的菜谱文本特征,从而可以很好地提取菜谱文本的高级语义信息,实现高可靠性分类,去除冗余噪声,进而有效提高菜谱文本与菜谱操作图的互检索的精度。
上述实施例对文本信息特征编码器最终所输出的特征并不做任何限定,基于上述实施例,本申请还给出了一种可选的实施方式,可包括下述内容:
本实施例的虚拟成分标签中的每个元素,用于表示当前文本样本中包含菜谱成分信息对应主成分的置信度;从虚拟成分标签中确定大于等于成分置信阈值的目标成分,并根据各目标成分对应的置信度确定主成分概率预测置信度;根据主成分概率预测置信度和成分预测置信阈值之间的数值关系,确定当前文本样本的当前菜谱文本特征为主成分特征还是菜谱均值特征。
在一些实施例中,主动学习如自监督学习可以得到主成分特征对应的分类概率,该分类概率代表主动学习网络如主成分自监督分类网络输出的该输入样本的主成分概率预测值,例如:[0.001,0.02,…,0.91,…,0.006]。基于次,本实施例在确定最终输出特征类型时,可根据输入样本的主成分概率预测置信度值来对输入特征进行切换,切换方法如下:计算主成分概率预测确信度,计算方法如下:获取虚拟成分标签中的主动学习分类概率如[0.001,0.02,…,0.91,…,0.006],每个数字代表了该样本包含主成分信息表中所对应的主成分的置信度,成分置信阈值例如可为0.5,按照该阈值抽取分类概率中所有大于阈值0.5的值,来构建可信主成分信息表;计算可信主成分信息表中所有概率值的均值,记为主成分概率预测置信度。然后便可根据主成分概率预测置信度和预先设置的成分预测置信阈值如0.9来确定最终输出特征了,作为一种可选的实施方式,对于当前文本样本的当前菜谱文本特征为主成分特征还是菜谱均值特征的选择过程,本实施例可根据不同需要进行灵活切换,并提前设置输出文本特征的输出控制模式,本实施例的输出控制模式包括混合切换模式和二值切换模块,基于不同的输出控制模式选择相应的特征输出,该过程可包括:
获取当前输出控制模式,判断当前输出控制模式为二值切换模式还是混合切换模式,作为一种可选的实施方式,若当前输出控制模式为二值切换模式,则判断主成分概率预测置信度是否大于成分预测置信阈值;若主成分概率预测置信度大于成分预测置信阈值,则当前文本样本的当前菜谱文本特征为主成分特征;若主成分概率预测置信度小于等于成分预测置信阈值,则当前文本样本的当前菜谱文本特征为菜谱均值特征。
若当前输出控制模式为混合切换模式,比较主成分概率预测置信度与成分预测置信阈值和预设的置信限度阈值之间的数值关系;置信限度阈值可根据实际需求灵活确定,本申请对成分预测置信阈值和预设的置信限度阈值的取值不做任何限定。作为另一种可选的实施方式,若主成分概率预测置信度大于成分预测置信阈值,则当前文本样本的当前菜谱文本特征为主成分特征;若主成分概率预测置信度小于等于成分预测置信阈值、且大于等于置信限度阈值,则当前文本样本的当前菜谱文本特征为菜谱均值特征和主成分特征的特征和;若主成分概率预测置信度小于置信限度阈值,则当前文本样本的当前菜谱文本特征为菜谱均值特征。作为再一种可选的实施方式,若主成分概率预测置信度小于等于成分预测置信阈值、且大于等于置信限度阈值,则当前文本样本的当前菜谱文本特征还可为将菜谱均值特征和主成分特征进行特征级联,并通过全连接层处理后的输出特征。
其中,菜谱均值特征即为菜名、成分、步骤文本在双向编码器对应的输出特征的均值。若主成分概率预测置信度>成分预测置信阈值,则主成分概率预测确信度高,说明文本信息特征编码器的文本特征提取功能及主成分主动学习分类网络可以很好地提取菜谱文本的高级语义信息,实现高可靠性分类,去除冗余噪声,该特征具有良好的表达效果,故输出成分主动学习的分类特征也即主成分特征。若主成分概率预测置信度<成分预测置信阈值,则输出菜名、成分、步骤文本的双向编码器输出均值。文本信息特征编码器的文本特征提取功能及 主成分主动学习分类网络无法确认改菜谱的主成分,主成分特征中还包含大量噪声,为获得好的检索效果,本实施例可取输入菜谱文本对应的特征提取的所有输出特征的均值作为最终的整个菜谱文本特征。此外,若主成分概率预测置信度<成分预测置信阈值,还可输出菜谱均值特征与主成分特征相加后特征,作为最终的整个菜谱文本的当前菜谱文本特征;还可将菜谱均值特征与主成分特征进行特征级联后,然后再通过一层全连接层后的输出特征作为最终的整个菜谱文本的当前菜谱文本特征。
在上述实施例中,对于如何执行步骤S102不做限定,本实施例中给出菜谱成分信息的一种可选的生成方式,可包括如下步骤:
获取目标菜谱文本样本集的每一个菜谱样本所包含的所有原始成分;对各原始成分进行数据合并处理,以将相同成分的数据合并至一起;统计合并后的各原始成分,确定每类成分对应的总数量;删除总数量小于预设数量阈值的原始成分,得到样本成分;基于各样本成分,生成主成分表。相应的,虚拟成分标签的生成过程可包括:将当前文本样本所包含的已有成分与主成分表的样本成分一一进行比对;对每个已有成分,若主成分表中的当前样本成分与当前已有成分相同,则将当前样本成分对应的位置元素设置为第一预设标识值;若主成分表中的当前样本成分与当前已有成分不相同,则将当前样本成分对应的位置元素设置为第二预设标识值;根据主成分表的每个样本成分对应的位置元素的值,生成虚拟成分标签。
在一些实施例中,文本样本包括多种类型的数据,也即菜谱文本可包括成分、做菜步骤和菜名三种类型数据;为了便于描述,从菜谱样本读取的成分数据称为原始成分,通过数据合并和数据删除操作,从这些原始成分中选择的成分可称为样本成分。通过该实施例所列举的数据选择方式可将原始成分中不重要的数据去掉,提高整个数据处理效率。为了便于存储和检索,可将菜谱成分信息以表的形式表现,也即基于各样本成分生成主成分表。
上述实施例对文本信息特征编码器的结构并不做任何限定,本实施例还给出文本信息特征编码器的一种可选的结构,可包括下述内容:
文本信息特征编码器可包括输入层、文本特征提取层和输出数据处理层;输入层包括文本数据输入单元和成分识别标志输入单元;文本数据输入单元包括菜名输入单元、菜谱步骤输入单元和成分输入单元,用于依次输入训练样本集的各文本样本的不同类型数据;成分识别标志输入单元用于输入用于标识执行主动学习成分信息任务的标志位。文本特征提取层为基于转换器的双向编码器,用于对输入层输出信息进行特征提取;输出数据处理层,用于基于标志位,主动学习文本特征提取层所提取的主成分特征对应的虚拟成分标签,并基于虚拟成分标签和成分预测置信阈值确定当前文本样本的当前菜谱文本特征。
在一些实施例中,对于输入层来说,可设置多个输入位,不同输入位对应不同的输入数据,若文本数据有多类,文本数据输入单元对应包括多个输入位,不同输入位对应不同数据类型的数据,以文本数据为菜谱文本为例,菜谱文本包括做菜步骤数据、成分数据和菜名数据,相应的,文本数据输入单元可包括做菜步骤的数据的输入位、成分的数据的输入位和菜名的数据的输入位,如图2中的最底层部分。用于标识执行主动学习成分信息任务的标志位可根据实际需求灵活选择,例如可使用CLS作为该标志位。成分识别标志输入单元用于输入该标志位,如果当前执行任务需要进行主动学习任务,也即需要进行主动学习式分类。那么成分识别标志输入单元输入相应的标志位,但如果当前执行任务不需要进行主动学习任务,那么成分识别标志输入单元就不输入相应的标志位,或者是输入标识不执行主动学习任 务的另一种指定标志位。对于模型的输入层,可直接输入一个列向量,向量的起始位置为标志位向量元素,后续为文本特征向量元素。
其中,基于转换器的双向编码器也即采用transformer模型结构,可选的,如图2中的中间部分,其可包括依次连接的Masked Multihead Attention(掩码多头注意力)层、第一Add+Normalization(残差连接+添加规范化)层、Feed Forward(前馈控制)层、第二Add+Normalization层以及双向注意力模块,上下注意力模块向Masked Multihead Attention层输入信息。
在一些实施例中,输出数据处理层包括特征选择控制器、主成分输出单元和菜谱均值特征输出单元;菜谱均值特征输出单元包括菜名特征输出单元、菜谱步骤特征输出单元和成分特征输出单元,其用于输出菜名特征、菜谱步骤特征和成分特征的特征平均值;主成分输出单元,用于输出主成分特征以及通过执行主动学习任务得到虚拟成分标签;特征选择控制器,用于基于虚拟成分标签和成分预测置信阈值确定当前菜谱文本特征,并切换主成分输出单元和菜谱均值特征输出单元以输出当前菜谱文本特征。
在一些实施例中,特征选择控制器用于执行输出控制模式的切换,切换模式为2种,第一种定义为二值切换模式,实现方式为当成分预测置信阈值大于成分预测置信阈值时,输出主成分输出单元特征,当成分预测置信阈值小于等于成分预测置信阈值,输出菜谱均值特征输出单元特征。其中,A在训练初始时候,可人为设定。第二种定义为混合切换模式,实现方式为当成分预测置信阈值大于成分预测置信阈值时,输出主成分输出单元特征,当成分预测置信阈值小于置信限度阈值时,输出菜谱均值特征输出单元特征。当成分预测置信阈值在成分预测置信阈值和置信限度阈值之间时,输出主成分输出单元特征与菜谱均值特征输出单元的特征之和,或者输出主成分输出单元特征与菜谱均值特征输出单元的特征级联后,再经过全连接层后的输出特征。其中,置信限度阈值在训练初始时候,可以人为设定。特征选择控制器的切换模式,即二值切换模式或混合切换模式也可在训练时人为设定。
在一些实施例中,输出数据处理层对文本特征提取层输出的特征进行处理,也即输出数据处理层可先识别是否存在标志位,若存在标志位,则判断标志位是否是用于执行主动学习任务的,如果标志位是用于执行主动学习任务的,则基于菜谱成分信息对主成分输出单元输出的主成分特征进行主动学习。如果标志位不是用于执行主动学习任务的,则不需要进行主动学习。可选的,主成分输出单元可包括第一全连接层、映射层、第二全连接层和损失计算层;第一全连接层,用于接收成分识别标志输入单元对应输出的特征信息;映射层,用于基于映射函数如非线性映射函数或线性映射函数对特征信息进行非线性映射处理,如可采用ReLU(Linear rectification function,线性整流函数)、Leaky ReLU(带泄露线性整流函数)等。第二全连接层,用于将映射处理后所得的特征映射至主成分,得到与菜谱成分信息维度相同的主成分特征;;损失计算层,用于基于菜谱成分信息主动学习主成分特征的虚拟成分标签。以图2为例,主成分输出单元也即成分识别标志输入单元对应的输出经过第一全连接层FC,随后通过ReLU层进行非线性映射,最后再通过第二全连接层FC,将特征映射到当前文本样本中的主成分数据上。
本实施例还提供了如何基于菜谱成分信息主动学习主成分特征的虚拟成分标签的一种可选的实施方式,可包括下述内容:
根据当前文本样本与菜谱成分信息的比对结果,生成虚拟成分标签;虚拟成分标签对应 的向量数据与主成分特征对应的向量数据的维度相同;调用损失计算关系式,计算虚拟成分标签与主成分特征的损失信息,损失计算关系式为:
式中,losscla为损失信息,M为主成分特征对应为向量数据的维度,sigmoid()为sigmoid函数,labelm为虚拟成分标签对应为向量数据的第m个位置上的元素,clam为主成分特征对应为向量数据的第m个位置上的元素。
在一些实施例中,由于主成分特征中包括多个成分特征,各成分可能对应菜谱识别信息的一个成分特征或多个成分特征或者不存在于菜谱成分信息中,为了标识主成分特征与菜谱成分信息之间的对应关系,通过数据比对或者是说特征比对,生成虚拟成分标签,该虚拟成分标签对应的向量数据与主成分特征对应的向量数据的维度相同。以菜谱文本举例来说,菜谱成分信息可为主成分表,主成分特征包括文本样本的主成分数据,若主成分表的成分存在于菜谱文本的主成分特征中,则主成分表的对应位置变量可置1,否则,则置0。通过上述操作,可处理后的主成分表作为标签也即虚拟成分标签,该标签对应的向量维度与主成分表行数相同。
由上可知,本实施例提供了文本信息特征编码器的一种可选的模型结构,有利于提取更加准确的文本特征。为了便于文本特征,在利用文本信息特征编码器提取当前文本样本的当前菜谱文本特征之前,还可包括:
分别将当前文本样本的不同数据类型的文本数据如菜名、做菜步骤和成分的每个单词映射为相应的高维文本向量,同时将每个单词在相应文本数据中的位置信息、标识文本数据所属数据类型的文本类型标识映射为相应的高维辅助向量;基于各高维文本向量和其相应的高维辅助向量,生成文本向量,以用于输入文本信息特征编码器。其中,文本类型标识可预先根据实际需求应用灵活选择。
以菜谱文本样本举例来说,菜谱文本样本包括3种类型文本信息:做菜步骤、成分信息和菜名,菜品的文本类型标识可设置为1,成分信息的文本类型标识可设置为2,操作步骤的文本类型标识可设置为3,将所有文本信息打包成为一个长的输入序列:对于菜名,可利用wordToembedding(词映射)方法将菜名的每个单词映射成为一个高维向量,对于位置信息,可按照单词的顺序序列依次递增。对于成分信息,可先用逗号将各成分信息分隔,再通过wordToembedding方法将所有成分信息映射成为高维列向量,成分信息的文本类型在本申请中定义为2。成分信息的位置信息按照成分的输入顺序依次递增,如图2所示。同理,对于操作步骤,可依次对每个步骤进行编码,如第一个步骤编码为序号1,第二个步骤可编码为序号2;然后将所有操作步骤的每个单词通过wordToembedding方法映射成为高维列向量。同样的,对文本类型标识和位置信息也可通过wordToembedding的方法进行映射,得到文本类型标识、位置信息的嵌入embedding特征,也即采用低维的向量表示一个物体的方式。最后,可将文本信息、文本类型标识、位置信息的embedding特征相加,输入至文本信息特征编码器。
进一步的,对于标志位,在输入文本信息特征编码器之前,可先获取用于标识执行主动学习成分信息任务的标志,并为标识设置文本类型标识值和位置信息值,以生成标志信息; 将标志信息的每个单词映射为相应的高维标志向量,以用于输入文本信息特征编码器。
举例来说,预先定义标志位为CLS标志,同时定义该标志位的位置信息为0,文本类型标识为0,将标志位及其位置信息、文本类型标识作为一个标志信息,通过wordToembedding的方法对该标志信息进行映射,获得标志位的embedding特征文本信息、文本类型信息、位置信息的embedding特征会相加。
上述实施例对步骤图特征编码器的结构并不做任何限定,本实施例还提供了步骤图特征编码器的一种可选的模型结构,可包括下述内容:
预先训练用于提取操作图的图像特征的步骤图特征编码器,其可包括特征提取网络和特征融合网络;特征提取网络用于提取输入操作图的每张步骤图的图像特征,特征融合网络用于将特征提取网络提取的每张操作图的图像特征整合为一个图像特征,以作为输入操作图的图像特征。对于训练好的步骤图特征编码器,在提取文本样本的文本特征之后,由于每组训练样本包括一对儿相匹配文本样本和操作图样本,为了便于描述,将已经提取文本特征的文本样本称为当前文本样本,将与当前文本样本对应的操作图样本称为当前操作图样本,将该当前操作图样本输入至步骤图特征编码器,步骤图特征编码器利用特征提取网络对该当前操作图样本进行特征提取,得到当前操作图样本包含的所有步骤图的图像特征。步骤图特征编码器将各步骤图的图像特征输入至特征融合网络中,得到当前操作图样本的当前菜谱图像特征。
可选的,特征融合网络可为长短期记忆神经网络,相应的,将各步骤图的图像特征输入至特征融合网络中,得到当前操作图样本的当前菜谱图像特征的过程,可包括:
调用图像特征融合关系式处理各步骤图的图像特征;图像特征融合关系式为:
式中,为长短期记忆神经网络的第i个LSTM(Long Short-Term Memory,长短期记忆网络)单元的输出,LSTMi为第i个LSTM单元,φ()为特征提取网络的输出,为当前操作图样本的第i张步骤图像,为长短期记忆神经网络的第i-1个LSTM单元的输出,I为当前操作图样本所包含的步骤图像的总数。
本实施例通过采用特征提取和特征融合分开的方式生成操作图样本的图像特征,有利于提升图像特征提取精准度。
此外,本实施例还提供了文本操作图互检方法,请参阅图3,可包括下述内容:
S301:预先训练文本操作图互检模型。
本实施例可利用如上任意一个训练文本操作图互检模型的方法的实施例所记载的方式来训练文本操作图互检模型。
S302:获取待检索文本的待匹配文本特征。
待匹配文本特征即为上述实施例的当前样本文本的当前菜谱文本特征,本步骤可通过上述实施例中的文本样本的文本特征的提取方式,此处,便不再赘述。
S303:获取待检索操作图的待匹配图像特征。
本步骤可通过上述实施例中的操作图样本的图像特征的提取方式,此处,便不再赘述。
S304:将待匹配文本特征和待匹配图像特征,输入至文本操作图互检模型,得到文本操作图互检结果。
在推理过程中,可预先加载S301训练好的权重系数。对待检索操作图或待检索文本进行特征提取,并存入待检索文本数据集或待检索图像数据库中。用户给定任意待检索数据,可为待检索操作图,也可为待检索文本。提取待检索数据的文本特征信息或图像特征,输入至文本操作图互检模型。将待检索数据的特征与相对应待检索数据集中所有样本特征进行距离匹配。例如:若待检索数据是文本数据,则相应的待检索数据集即为待检索图像数据集,将待检索文本与该数据集中所有的操作图特征进行马氏距离计算,距离最小的样本即为与待检索文本最匹配的操作图,将该操作图作为检索结果进行输出。
由上可知,本实施例可实现菜谱文本与菜谱步骤图之间的高精度互检索。
需要说明的是,本申请中各步骤之间没有严格的先后执行顺序,只要符合逻辑上的顺序,则这些步骤可以同时执行,也可按照某种预设顺序执行,图1和图3只是一种示意方式,并不代表只能是这样的执行顺序。
最后,为了使所属领域技术人员更加清楚明白本申请的实施方式,本实施例还以菜谱文本操作图互检索作为一个示意性的例子阐述本申请所提供的实现文本操作图互检索的过程,本实施例所示的菜谱文本与菜谱操作图的互检索任务的执行过程可包括:
如图4所示,本实施例可包括菜谱检索终端设备和云服务器,用户可以在菜谱检索终端设备上执行操作,菜谱检索终端设备通过网络实现与云服务器的交互,云服务器可以部署文本操作图互检模型,如图5所示,为了使得文本操作图互检模型可以实现菜谱文本与菜谱操作图互检索的功能,需要对文本操作图互检模型进行训练。在训练过程中,可以由菜谱检索终端设备向云服务器传输训练样本集,训练样本集可预先写入U盘,将U盘插入菜谱检索终端设备的输入接口。训练样本集可包含有多组训练样本,每组训练样本包括相对应的一个菜谱文本样本和一个菜谱操作图样本,每个菜谱文本样本可包括操作步骤(instruction list)、成分信息(ingredients)和菜名(Title)。Instructions为做菜的步骤,在下文中统一用步骤表示。Ingredients为菜的成分,在下文统一用成分表示。
在训练开始前,可获取所有菜谱文本样本的成分数据生成成分信息列表。在生成成分信息列表之后,将相同成分的数据合并成为1个数据,并统计每一项成分合并后的数量。例如[78面粉]、[56鸡蛋]、[67西红柿]、[81水]、……、[5荠菜]、[3燕窝]和[2鱼翅]。对于合并处理后的成分信息列表,若成分信息数目过少,如数量小于5,则从表中删除该成分信息。筛选后的成分信息为:[78面粉]、[56鸡蛋]、[67西红柿]、[81水]、……、[5荠菜]。将筛选后的成分信息表作为最终生成的主成分表,可定义为变量Main-ing,主成分表为向量,向量长度等于筛选后的成分信息的行数。
基于基本的transformer模型搭建文本信息特征编码器,对于文本样本中的操作步骤、成分信息和菜名的文本数据、文本类型标识和位置信息,可利用wordToembedding方法将每个单词映射成为一个高维向量,将该高维向量作为各自的embedding特征,并将各embedding特征相加得到一个长输入序列,以作为文本信息特征编码器的输入,同时在每个菜谱文本信息的第一的位置,加入用于标识主动学习分类的CLS标志信息,也即在长输入 序列的起始位置附加CLS标志信息的embedding特征,CLS标志信息的embedding特征为将标志位、连通其均为0的位置信息和文本类型标识,通过wordToembedding方法进行映射后所得。在基本的transformer的CLS对应的输出位置,提取其输出特征,用来进行执行主动学习分类任务,以及在模型训练过程中与其对应的菜谱步骤图特征计算损失。
对于主动学习分类任务的一种可选的实现方式:提取基本的transformer模型的CLS对应的输出特征,如图2所示,该特征首先经过一个全连接层FC,随后通过ReLU进行非线性映射,最后再通过一个全连接层FC,将特征映射到主成分,得到与Main-ing相同的维度,为了便于描述,该特征称为cla,cla会进行分类损失的计算:提取每个菜谱文本的成分信息,该菜谱文本的成分信息与生成的主成分表Main-ing进行比对。若主成分表的成分存在于菜谱文本的成分信息中,则主成分表的对应位置变量置1,否则置0。通过上述操作,会得到一个名为label向量,其维度与Main-ing的行数相同。最后利用上述实施例的损失计算关系式将cla和与其对应的label计算用于进行多目标分类的BCELoss。
如图4所示,可采用ResNet骨干网络backbone提取操作图的每一张菜谱步骤图特征,获取ResNet网络在分类层前一层的特征做为每一张图像的特征。然后将菜谱步骤图特征输入到LSTM网络,获取整体菜谱步骤图像组的总体特征,取最后一个LSTM单元的特征编码输出作为菜谱操作图的图像特征。
在得到训练样本集的每组训练样本的菜谱操作图的图像特征和菜谱文本特征信息之后,可采用任何一种现有技术中的损失函数如L1范数损失函数、均方误差损失函数、交叉熵损失等指导模型训练,使其收敛。可选的,为了实现菜谱文本与菜谱步骤图的检索,可将基本的transformer的CLS对应的输出特征作为文本信息特征与最后一个LSTM单元的特征编码输出,基于下述关系式进行损失运算,然后基于梯度反传对上述transformer网络、LSTM网络及ResNet网络参数进行更新:
式中,为损失函数,N为训练样本组数,▽为超参数,在训练时固定,如可设置为0.3。在训练过程中可遍历N次,N代表在本batch(批次)中,共有N个成对的样本。首先对图像组特征进行遍历(共N个),遍历选中的目标就可称为a代表anchor(锚点样本)。与锚点样本成对的文本特征编码记为p代 表positive。在本batch中与不配对的文本特征记为同理,对于文本特征也做相同的遍历操作,代表遍历中被选中的目标样本,与其对应的正图像组特征样本记为不对应的记为
进一步的,菜谱检索终端设备可以包括人机交互模组如显示屏、输入接口、输入键盘等,还包括无线传输模块。当显示屏为触摸屏时,输入键盘可以是在显示屏上呈现的软键盘。输入接口可以用于实现与外部设备如U盘的连接。输入接口可以有多个。在实际应用中,用户可以通过输入键盘向菜谱检索终端设备输入检索请求,检索请求携带待检索信息,如菜谱文本或菜谱操作图,菜谱检索终端可以通过无线传输模块向云服务器发送该检索请求,云服务器基于训练好的文本操作图互检模型检索相应的数据库可以将最终互检索结果反馈至菜谱检索终端设备,菜谱检索终端设备可以通过显示屏向用户展示所检索到的目标菜谱文本或目标菜谱操作图。
本申请实施例还针对训练文本操作图互检模型的方法以及文本操作图互检方法提供了相应的装置,进一步使得方法更具有实用性。其中,装置可从功能模块的角度和硬件的角度分别说明。下面对本申请实施例提供的训练文本操作图互检模型的装置以及文本操作图互检装置进行介绍,下文描述的训练文本操作图互检模型的装置以及文本操作图互检装置与上文描述的训练文本操作图互检模型的方法以及文本操作图互检方法可相互对应参照。
基于功能模块的角度,首先参见图6,图6为本申请实施例提供的训练文本操作图互检模型的装置在一种具体实施方式下的结构图,该装置可包括:
模型构建模块601,用于构建包括文本信息特征编码器和步骤图特征编码器的文本操作图互检模型;
识别信息生成模块602,用于通过分析训练样本集中包含菜谱成分的所有菜谱样本,生成菜谱成分信息;
文本数据处理模块603,用于对训练样本集的每组训练样本,利用文本信息特征编码器提取当前文本样本的主成分特征和菜谱均值特征,并基于菜谱成分信息主动学习主成分特征的虚拟成分标签;菜谱均值特征根据文本信息特征编码器提取当前文本样本的所有文本特征所确定;基于虚拟成分标签和成分预测置信阈值,确定当前文本样本的当前菜谱文本特征为主成分特征还是菜谱均值特征;
图像特征提取模块604,用于利用步骤图特征编码器提取与当前文本样本对应的当前操作图样本的当前菜谱图像特征;
训练模块605,用于将当前菜谱文本特征和当前菜谱图像特征,输入至文本操作图互检模型,进行模型训练。
可选的,在本实施例的一些实施方式中,上述文本数据处理模块603可用于:从虚拟成分标签中确定大于等于成分置信阈值的目标成分,并根据各目标成分对应的置信度确定主成 分概率预测置信度;根据主成分概率预测置信度和成分预测置信阈值之间的数值关系,确定当前文本样本的当前菜谱文本特征为主成分特征还是菜谱均值特征。虚拟成分标签中的每个元素,用于表示当前文本样本中包含菜谱成分信息对应主成分的置信度。
作为上述实施例的一种可选的实施方式,上述文本数据处理模块603还可用于:获取当前输出控制模式;若当前输出控制模式为二值切换模式,判断主成分概率预测置信度是否大于成分预测置信阈值;若主成分概率预测置信度大于成分预测置信阈值,则当前文本样本的当前菜谱文本特征为主成分特征;若主成分概率预测置信度小于等于成分预测置信阈值,则当前文本样本的当前菜谱文本特征为菜谱均值特征。
作为上述实施例的另一种可选的实施方式,上述文本数据处理模块603进一步可用于:获取当前输出控制模式;若当前输出控制模式为混合切换模式,比较主成分概率预测置信度与成分预测置信阈值和预设的置信限度阈值之间的数值关系;若主成分概率预测置信度大于成分预测置信阈值,则当前文本样本的当前菜谱文本特征为主成分特征;若主成分概率预测置信度小于等于成分预测置信阈值、且大于等于置信限度阈值,则当前文本样本的当前菜谱文本特征为菜谱均值特征和主成分特征的特征和;若主成分概率预测置信度小于置信限度阈值,则当前文本样本的当前菜谱文本特征为菜谱均值特征。
作为上述实施例的再一种可选的实施方式,上述文本数据处理模块603还可进一步用于:若主成分概率预测置信度小于等于成分预测置信阈值、且大于等于置信限度阈值,则当前文本样本的当前菜谱文本特征为将菜谱均值特征和主成分特征进行特征级联,并通过全连接层处理后的输出特征。
可选的,在本实施例的一些实施方式中,上述识别信息生成模块602还可用于:获取目标菜谱文本样本集的每一个菜谱样本所包含的所有原始成分;对各原始成分进行数据合并处理,以将相同成分的数据合并至一起;统计合并后的各原始成分,确定每类成分对应的总数量;删除总数量小于预设数量阈值的原始成分,得到样本成分;基于各样本成分,生成主成分表。
作为上述实施例的一种可选的实施方式,上述文本数据处理模块603还可进一步用于:将当前文本样本所包含的已有成分与主成分表的样本成分一一进行比对;对每个已有成分,若主成分表中的当前样本成分与当前已有成分相同,则将当前样本成分对应的位置元素设置为第一预设标识值;若主成分表中的当前样本成分与当前已有成分不相同,则将当前样本成分对应的位置元素设置为第二预设标识值;根据主成分表的每个样本成分对应的位置元素的值,生成虚拟成分标签。
可选的,在本实施例的另一些实施方式中,上述文本信息特征编码器可包括输入层、文本特征提取层和输出数据处理层;输入层包括文本数据输入单元和成分识别标志输入单元;文本数据输入单元包括菜名输入单元、菜谱步骤输入单元和成分输入单元,用于依次输入训练样本集的各文本样本的不同类型数据;成分识别标志输入单元,用于输入用于标识执行主动学习成分信息任务的标志位;文本特征提取层为基于转换器的双向编码器,用于对输入层输出信息进行特征提取;输出数据处理层,用于基于标志位,主动学习文本特征提取层所提取的主成分特征对应的虚拟成分标签,并基于虚拟成分标签和成分预测置信阈值确定当前文本样本的当前菜谱文本特征。
作为上述实施例的一种可选的实施方式,输出数据处理层包括特征选择控制器、主成分 输出单元和菜谱均值特征输出单元;菜谱均值特征输出单元包括菜名特征输出单元、菜谱步骤特征输出单元和成分特征输出单元,其用于输出菜名特征、菜谱步骤特征和成分特征的特征平均值;主成分输出单元,用于输出主成分特征以及通过执行主动学习任务得到虚拟成分标签;特征选择控制器,用于基于虚拟成分标签和成分预测置信阈值确定当前菜谱文本特征,并切换主成分输出单元和菜谱均值特征输出单元以输出当前菜谱文本特征。
作为上述实施例的另一种可选的实施方式,上述输出数据处理层可包括第一全连接层、映射层、第二全连接层和损失计算层;第一全连接层,用于接收成分识别标志输入单元对应输出的特征信息;映射层,用于对特征信息进行非线性映射处理;第二全连接层,用于将映射处理后所得的特征映射至主成分,得到与菜谱成分信息维度相同的主成分特征;损失计算层,用于基于菜谱成分信息主动学习主成分特征的虚拟成分标签。
作为上述实施例的一种可选的实施方式,上述损失计算层还可用于:根据当前文本样本与菜谱成分信息的比对结果,生成虚拟成分标签;虚拟成分标签对应的向量数据与主成分特征对应的向量数据的维度相同;调用损失计算关系式,计算虚拟成分标签与主成分特征的损失信息,损失计算关系式为:
式中,losscla为损失信息,M为主成分特征对应为向量数据的维度,sigmoid()为sigmoid函数,labelm为虚拟成分标签对应为向量数据的第m个位置上的元素,clam为主成分特征对应为向量数据的第m个位置上的元素。
可选的,在本实施例的其他一些实施方式中,上述装置例如还可包括文本处理模块,用于获取用于标识执行主动学习成分信息任务的标志,并为标识设置文本类型标识值和位置信息值,以生成标志信息;将标志信息的每个单词映射为相应的高维标志向量,以用于输入文本信息特征编码器。
作为上述实施例的一种可选的实施方式,上述文本处理模块还可用于:分别将当前文本样本的菜名、做菜步骤和成分的每个单词映射为相应的高维文本向量,同时将每个单词在相应文本数据中的位置信息、标识文本数据所属数据类型的文本类型标识映射为相应的高维辅助向量;基于各高维文本向量和其相应的高维辅助向量,生成文本向量,以用于输入文本信息特征编码器。
可选的,在本实施例的其他一些实施方式中,上述图像特征提取模块604还可用于:预先训练步骤图特征编码器;步骤图特征编码器包括特征提取网络和特征融合网络;将与当前文本样本对应的当前操作图样本输入至特征提取网络,得到当前操作图样本包含的所有步骤图的图像特征;将各步骤图的图像特征输入至特征融合网络中,得到当前操作图样本的当前菜谱图像特征。
作为上述实施例的一种可选的实施方式,上述图像特征提取模块604还可进一步用于:特征融合网络为长短期记忆神经网络,调用图像特征融合关系式处理各步骤图的图像特征;图像特征融合关系式为:
式中,为长短期记忆神经网络的第i个LSTM单元的输出,LSTMi为第i个LSTM单元,φ()为特征提取网络的输出,为当前操作图样本的第i张步骤图像,为长短期记忆神经网络的第i-1个LSTM单元的输出,I为当前操作图样本所包含的步骤图像的总数。
其次,请参见图7,图7为本申请实施例提供的文本操作图互检装置在一种具体实施方式下的结构图,该装置可包括:
模型训练模块701,用于预先利用如上任意一个训练文本操作图互检模型的方法的实施例训练得到文本操作图互检模型;
特征获取模块702,用于获取待检索文本的待匹配文本特征;获取待检索操作图的待匹配图像特征;
互检结果生成模块703,用于将待匹配文本特征和待匹配图像特征,输入至文本操作图互检模型,得到文本操作图互检结果。
本申请实施例跨媒体检索装置的各功能模块的功能可根据上述方法实施例中的方法具体实现,其具体实现过程可以参照上述方法实施例的相关描述,此处不再赘述。
由上可知,本申请实施例可实现菜谱文本与菜谱步骤图之间的高精度互检索。
上文中提到的训练文本操作图互检模型的装置以及文本操作图互检装置均是从功能模块的角度描述,进一步的,本申请还提供一种电子设备,是从硬件角度描述。图8为本申请实施例提供的电子设备在一种实施方式下的结构示意图。如图8所示,该电子设备包括存储器80,用于存储计算机程序;处理器81,用于执行计算机程序时实现如上述任一实施例提到的训练文本操作图互检模型的方法和/或文本操作图互检方法的步骤。
其中,处理器81可以包括一个或多个处理核心,比如4核心处理器、8核心处理器,处理器81还可为控制器、微控制器、微处理器或其他数据处理芯片等。处理器81可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器81也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器81可以集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器81还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器80可以包括一个或多个计算机非易失性可读存储介质,该计算机非易失性可读存储介质可以是非暂态的。存储器80还可包括高速随机存取存储器以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。存储器80在一些实施例中可以是电子设备的内部存储单元,例如服务器的硬盘。存储器80在另一些实施例中也可以是电子设备的外部存储设备,例如服务器上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC), 安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器80还可以既包括电子设备的内部存储单元也包括外部存储设备。存储器80不仅可以用于存储安装于电子设备的应用软件及各类数据,例如:执行训练文本操作图互检模型的方法和/或文本操作图互检方法过程中使用及产生的程序的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。本实施例中,存储器80至少用于存储以下计算机程序801,其中,该计算机程序被处理器81加载并执行之后,能够实现前述任一实施例公开的训练文本操作图互检模型的方法和/或文本操作图互检方法的相关步骤。另外,存储器80所存储的资源还可以包括操作系统802和数据803等,存储方式可以是短暂存储或者永久存储。其中,操作系统802可以包括Windows、Unix、Linux等。数据803可以包括但不限于训练文本操作图互检模型过程中所产生的数据以及训练得到的结果数据和/或文本操作图互检结果对应的数据等。
在一些实施例中,上述电子设备还可包括有显示屏82、输入输出接口83、通信接口84或者称为网络接口、电源85以及通信总线86。其中,显示屏82、输入输出接口83比如键盘(Keyboard)属于用户接口,可选的用户接口还可以包括标准的有线接口、无线接口等。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。显示器也可以适当的称为显示屏或显示单元,用于显示在电子设备中处理的信息以及用于显示可视化的用户界面。通信接口84可选的可以包括有线接口和/或无线接口,如WI-FI接口、蓝牙接口等,通常用于在电子设备与其他电子设备之间建立通信连接。通信总线86可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。为便于表示,图8中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
本领域技术人员可以理解,图8中示出的结构并不构成对该电子设备的限定,可以包括比图示更多或更少的组件,例如还可包括实现各类功能的传感器87。
本申请实施例电子设备的各功能模块的功能可根据上述方法实施例中的方法具体实现,其具体实现过程可以参照上述方法实施例的相关描述,此处不再赘述。
由上可知,本申请实施例可实现菜谱文本与菜谱步骤图之间的高精度互检索。
可以理解的是,如果上述实施例中的训练文本操作图互检模型的方法和/或文本操作图互检方法以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、电可擦除可编程ROM、寄存器、硬盘、多媒体卡、卡型存储器(例如SD或DX存储器等)、磁性存储器、可移动磁盘、CD-ROM、磁碟或者光盘等各种可以存储程序代码的介质。
基于此,本申请实施例还提供了一种非易失性可读存储介质,存储有计算机程序,计算机程序被处理器执行时如上任意一实施例训练文本操作图互检模型的方法和/或文本操作图互检方法的步骤。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的硬件包括装置及电子设备而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
以上对本申请所提供的一种文本操作图互检方法及装置、训练文本操作图互检模型的方法及装置、电子设备、非易失性可读存储介质进行了详细介绍。本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。

Claims (20)

  1. 一种训练文本操作图互检模型的方法,其特征在于,包括:
    预先构建包括文本信息特征编码器和步骤图特征编码器的文本操作图互检模型,并通过分析目标菜谱文本样本集中的各菜谱样本所包含的菜谱成分,生成菜谱成分信息;
    对训练样本集的每组训练样本,利用所述文本信息特征编码器提取当前文本样本的主成分特征和菜谱均值特征,并基于所述菜谱成分信息主动学习所述主成分特征的虚拟成分标签;所述菜谱均值特征根据所述文本信息特征编码器提取所述当前文本样本的所有文本特征所确定;
    基于所述虚拟成分标签和成分预测置信阈值,确定所述当前文本样本的当前菜谱文本特征为主成分特征还是菜谱均值特征;
    利用所述步骤图特征编码器提取与所述当前文本样本对应的当前操作图样本的当前菜谱图像特征;
    将所述当前菜谱文本特征和所述当前菜谱图像特征,输入至所述文本操作图互检模型,进行模型训练。
  2. 根据权利要求1所述的训练文本操作图互检模型的方法,其特征在于,所述基于所述虚拟成分标签和成分预测置信阈值,确定所述当前文本样本的当前菜谱文本特征为主成分特征还是菜谱均值特征,包括:
    所述虚拟成分标签中的每个元素,用于表示所述当前文本样本中包含所述菜谱成分信息对应主成分的置信度;
    从所述虚拟成分标签中确定大于等于成分置信阈值的目标成分,并根据各目标成分对应的置信度确定主成分概率预测置信度;
    根据所述主成分概率预测置信度和所述成分预测置信阈值之间的数值关系,确定所述当前文本样本的当前菜谱文本特征为主成分特征还是菜谱均值特征。
  3. 根据权利要求2所述的训练文本操作图互检模型的方法,其特征在于,所述根据所述主成分概率预测置信度和所述成分预测置信阈值之间的数值关系,确定所述当前文本样本的当前菜谱文本特征为主成分特征还是菜谱均值特征,包括:
    获取当前输出控制模式;
    若所述当前输出控制模式为二值切换模式,判断所述主成分概率预测置信度是否大于所述成分预测置信阈值;
    若所述主成分概率预测置信度大于所述成分预测置信阈值,则所述当前文本样本的当前菜谱文本特征为主成分特征;
    若所述主成分概率预测置信度小于等于所述成分预测置信阈值,则所述当前文本样本的当前菜谱文本特征为菜谱均值特征。
  4. 根据权利要求2所述的训练文本操作图互检模型的方法,其特征在于,所述根据所述主成分概率预测置信度和所述成分预测置信阈值之间的数值关系,确定所述当前文本样本的当前菜谱文本特征为主成分特征还是菜谱均值特征,包括:
    获取当前输出控制模式;
    若所述当前输出控制模式为混合切换模式,比较所述主成分概率预测置信度与所述成分预测置信阈值和预设的置信限度阈值之间的数值关系;
    若所述主成分概率预测置信度大于所述成分预测置信阈值,则所述当前文本样本的当前菜谱文本特征为主成分特征;
    若所述主成分概率预测置信度小于等于所述成分预测置信阈值、且大于等于所述置信限度阈值,则所述当前文本样本的当前菜谱文本特征为所述菜谱均值特征和所述主成分特征的特征和;
    若所述主成分概率预测置信度小于所述置信限度阈值,则所述当前文本样本的当前菜谱文本特征为菜谱均值特征。
  5. 根据权利要求4所述的训练文本操作图互检模型的方法,其特征在于,所述若所述当前输出控制模式为混合切换模式,比较所述主成分概率预测置信度与所述成分预测置信阈值和所述置信限度阈值之间的数值关系之和,包括:
    若所述主成分概率预测置信度小于等于所述成分预测置信阈值、且大于等于所述置信限度阈值,则所述当前文本样本的当前菜谱文本特征为将所述菜谱均值特征和所述主成分特征进行特征级联,并通过全连接层处理后的输出特征。
  6. 根据权利要求1所述的训练文本操作图互检模型的方法,其特征在于,所述文本信息特征编码器包括输入层、文本特征提取层和输出数据处理层;
    所述输入层包括文本数据输入单元和成分识别标志输入单元;所述文本数据输入单元包括菜名输入单元、菜谱步骤输入单元和成分输入单元,用于依次输入所述训练样本集的各文本样本的不同类型数据;所述成分识别标志输入单元,用于输入用于标识执行主动学习成分信息任务的标志位;
    所述文本特征提取层为基于转换器的双向编码器,用于对所述输入层输出信息进行特征提取;
    所述输出数据处理层,用于基于所述标志位,主动学习所述文本特征提取层所提取的主成分特征对应的虚拟成分标签,并基于所述虚拟成分标签和成分预测置信阈值确定所述当前文本样本的当前菜谱文本特征。
  7. 根据权利要求6所述的训练文本操作图互检模型的方法,其特征在于,所述输出数据处理层包括特征选择控制器、主成分输出单元和菜谱均值特征输出单元;
    所述菜谱均值特征输出单元包括菜名特征输出单元、菜谱步骤特征输出单元和成分特征输出单元,其用于输出菜名特征、菜谱步骤特征和成分特征的特征平均值;
    所述主成分输出单元,用于输出主成分特征以及通过执行主动学习任务得到虚拟成分标签;
    所述特征选择控制器,用于基于所述虚拟成分标签和成分预测置信阈值确定当前菜谱文本特征,并切换所述主成分输出单元和所述菜谱均值特征输出单元以输出当前菜谱文本特征。
  8. 根据权利要求7所述的训练文本操作图互检模型的方法,其特征在于,所述主成分输出单元包括第一全连接层、映射层、第二全连接层和损失计算层;
    所述第一全连接层,用于接收所述成分识别标志输入单元对应输出的特征信息;
    所述映射层,用于对所述特征信息进行非线性映射处理;
    所述第二全连接层,用于将映射处理后所得的特征映射至主成分,得到与所述菜谱成分信息维度相同的主成分特征;
    所述损失计算层,用于基于所述菜谱成分信息主动学习所述主成分特征的虚拟成分 标签。
  9. 根据权利要求8所述的训练文本操作图互检模型的方法,其特征在于,所述基于所述菜谱成分信息主动学习所述主成分特征的虚拟成分标签,包括:
    根据所述当前文本样本与所述菜谱成分信息的比对结果,生成虚拟成分标签;所述虚拟成分标签对应的向量数据与所述主成分特征对应的向量数据的维度相同;
    调用损失计算关系式,计算所述虚拟成分标签与所述主成分特征的损失信息,所述损失计算关系式为:
    式中,losscla为所述损失信息,M为所述主成分特征对应为向量数据的维度,sigmoid()为sigmoid函数,labelm为所述虚拟成分标签对应为向量数据的第m个位置上的元素,clam为所述主成分特征对应为向量数据的第m个位置上的元素。
  10. 根据权利要求9所述的训练文本操作图互检模型的方法,其特征在于,所述通过分析目标菜谱文本样本集中的各菜谱样本所包含的菜谱成分,生成菜谱成分信息,包括:
    获取所述目标菜谱文本样本集的每一个菜谱样本所包含的所有原始成分;
    对各原始成分进行数据合并处理,以将相同成分的数据合并至一起;
    统计合并后的各原始成分,确定每类成分对应的总数量;
    删除总数量小于预设数量阈值的原始成分,得到样本成分;
    基于各样本成分,生成主成分表。
  11. 根据权利要求10所述的训练文本操作图互检模型的方法,其特征在于,所述根据所述当前文本样本与所述菜谱成分信息的比对结果,生成虚拟成分标签,包括:
    将所述当前文本样本所包含的已有成分与所述主成分表的样本成分一一进行比对;
    对每个已有成分,若所述主成分表中的当前样本成分与当前已有成分相同,则将所述当前样本成分对应的位置元素设置为第一预设标识值;
    若所述主成分表中的当前样本成分与当前已有成分不相同,则将所述当前样本成分对应的位置元素设置为第二预设标识值;
    根据所述主成分表的每个样本成分对应的位置元素的值,生成所述虚拟成分标签。
  12. 根据权利要求1至11任意一项所述的训练文本操作图互检模型的方法,其特征在于,所述利用所述文本信息特征编码器提取当前文本样本的主成分特征和菜谱均值特征,并基于所述菜谱成分信息主动学习所述主成分特征的虚拟成分标签之前,还包括:
    获取用于标识执行主动学习成分信息任务的标志,并为所述标识设置文本类型标识值和位置信息值,以生成标志信息;
    将所述标志信息的每个单词映射为相应的高维标志向量,以用于输入所述文本信息特征编码器。
  13. 根据权利要求12所述的训练文本操作图互检模型的方法,其特征在于,所述利用所述文本信息特征编码器提取当前文本样本的主成分特征和菜谱均值特征组之前,还 包括:
    分别将所述当前文本样本的菜名、做菜步骤和成分的每个单词映射为相应的高维文本向量,同时将每个单词在相应文本数据中的位置信息、标识文本数据所属数据类型的文本类型标识映射为相应的高维辅助向量;
    基于各高维文本向量和其相应的高维辅助向量,生成文本向量,以用于输入所述文本信息特征编码器。
  14. 根据权利要求1所述的训练文本操作图互检模型的方法,其特征在于,所述利用所述步骤图特征编码器提取与所述当前文本样本对应的当前操作图样本的当前菜谱图像特征,包括:
    预先训练步骤图特征编码器;所述步骤图特征编码器包括特征提取网络和特征融合网络;
    将与所述当前文本样本对应的当前操作图样本输入至所述特征提取网络,得到所述当前操作图样本包含的所有步骤图的图像特征;
    将各步骤图的图像特征输入至特征融合网络中,得到所述当前操作图样本的当前菜谱图像特征。
  15. 根据权利要求14所述的训练文本操作图互检模型的方法,其特征在于,所述特征融合网络为长短期记忆神经网络,所述将各步骤图的图像特征输入至特征融合网络中,得到所述当前操作图样本的当前菜谱图像特征,包括:
    调用图像特征融合关系式处理各步骤图的图像特征;所述图像特征融合关系式为:
    式中,为所述长短期记忆神经网络的第i个LSTM单元的输出,LSTMi为第i个LSTM单元,φ()为所述特征提取网络的输出,为所述当前操作图样本的第i张步骤图像,为所述长短期记忆神经网络的第i-1个LSTM单元的输出,I为所述当前操作图样本所包含的步骤图像的总数。
  16. 一种文本操作图互检方法,其特征在于,包括:
    预先利用如权利要求1至15任意一项所述的训练文本操作图互检模型的方法,训练得到文本操作图互检模型;
    获取待检索文本的待匹配文本特征;
    获取待检索操作图的待匹配图像特征;
    将所述待匹配文本特征和所述待匹配图像特征,输入至所述文本操作图互检模型,得到文本操作图互检结果。
  17. 一种训练文本操作图互检模型的装置,其特征在于,包括:
    模型构建模块,用于构建包括文本信息特征编码器和步骤图特征编码器的文本操作图互检模型;
    识别信息生成模块,用于通过分析训练样本集中包含菜谱成分的所有菜谱样本,生成菜谱成分信息;
    文本数据处理模块,用于对所述训练样本集的每组训练样本,利用所述文本信息特征编码器提取当前文本样本的主成分特征和菜谱均值特征,并基于所述菜谱成分信息主动学习所述主成分特征的虚拟成分标签;所述菜谱均值特征根据所述文本信息特征编码器提取所述当前文本样本的所有文本特征所确定;基于所述虚拟成分标签和成分预测置信阈值,确定所述当前文本样本的当前菜谱文本特征为主成分特征还是菜谱均值特征;
    图像特征提取模块,用于利用所述步骤图特征编码器提取与所述当前文本样本对应的当前操作图样本的当前菜谱图像特征;
    训练模块,用于将所述当前菜谱文本特征和所述当前菜谱图像特征,输入至所述文本操作图互检模型,进行模型训练。
  18. 一种文本操作图互检装置,其特征在于,包括:
    模型训练模块,用于预先利用如权利要求1至15任意一项所述的训练文本操作图互检模型的方法,训练得到文本操作图互检模型;
    特征获取模块,用于获取待检索文本的待匹配文本特征;获取待检索操作图的待匹配图像特征;
    互检结果生成模块,用于将所述待匹配文本特征和所述待匹配图像特征,输入至所述文本操作图互检模型,得到文本操作图互检结果。
  19. 一种电子设备,其特征在于,包括处理器和存储器,所述处理器用于执行所述存储器中存储的计算机程序时实现如权利要求1或15任一项所述训练文本操作图互检模型的方法和/或如权利要求16所述文本操作图互检方法的步骤。
  20. 一种非易失性可读存储介质,其特征在于,所述非易失性可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1或15任一项所述训练文本操作图互检模型的和/或如权利要求16所述文本操作图互检方法的步骤。
PCT/CN2023/101222 2022-11-08 2023-06-20 文本操作图互检方法及模型训练方法、装置、设备、介质 WO2024098763A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211388902.8A CN115618043B (zh) 2022-11-08 2022-11-08 文本操作图互检方法及模型训练方法、装置、设备、介质
CN202211388902.8 2022-11-08

Publications (1)

Publication Number Publication Date
WO2024098763A1 true WO2024098763A1 (zh) 2024-05-16

Family

ID=84877991

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101222 WO2024098763A1 (zh) 2022-11-08 2023-06-20 文本操作图互检方法及模型训练方法、装置、设备、介质

Country Status (2)

Country Link
CN (1) CN115618043B (zh)
WO (1) WO2024098763A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115618043B (zh) * 2022-11-08 2023-04-07 苏州浪潮智能科技有限公司 文本操作图互检方法及模型训练方法、装置、设备、介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111651674A (zh) * 2020-06-03 2020-09-11 北京妙医佳健康科技集团有限公司 双向搜索方法、装置及电子设备
CN112925935A (zh) * 2021-04-13 2021-06-08 电子科技大学 基于模态内及模态间混合融合的图像菜谱检索方法
CN114896373A (zh) * 2022-07-15 2022-08-12 苏州浪潮智能科技有限公司 图文互检模型训练方法及装置、图文互检方法、设备
CN114896249A (zh) * 2022-05-18 2022-08-12 河北大学 非平衡区域树索引结构以及n维空间逆近邻查询算法
CN114896429A (zh) * 2022-07-12 2022-08-12 苏州浪潮智能科技有限公司 一种图文互检方法、系统、设备及计算机可读存储介质
CN114969405A (zh) * 2022-04-30 2022-08-30 苏州浪潮智能科技有限公司 一种跨模态图文互检方法
CN115062208A (zh) * 2022-05-30 2022-09-16 苏州浪潮智能科技有限公司 数据处理方法、系统及计算机设备
CN115618043A (zh) * 2022-11-08 2023-01-17 苏州浪潮智能科技有限公司 文本操作图互检方法及模型训练方法、装置、设备、介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111651674A (zh) * 2020-06-03 2020-09-11 北京妙医佳健康科技集团有限公司 双向搜索方法、装置及电子设备
CN112925935A (zh) * 2021-04-13 2021-06-08 电子科技大学 基于模态内及模态间混合融合的图像菜谱检索方法
CN114969405A (zh) * 2022-04-30 2022-08-30 苏州浪潮智能科技有限公司 一种跨模态图文互检方法
CN114896249A (zh) * 2022-05-18 2022-08-12 河北大学 非平衡区域树索引结构以及n维空间逆近邻查询算法
CN115062208A (zh) * 2022-05-30 2022-09-16 苏州浪潮智能科技有限公司 数据处理方法、系统及计算机设备
CN114896429A (zh) * 2022-07-12 2022-08-12 苏州浪潮智能科技有限公司 一种图文互检方法、系统、设备及计算机可读存储介质
CN114896373A (zh) * 2022-07-15 2022-08-12 苏州浪潮智能科技有限公司 图文互检模型训练方法及装置、图文互检方法、设备
CN115618043A (zh) * 2022-11-08 2023-01-17 苏州浪潮智能科技有限公司 文本操作图互检方法及模型训练方法、装置、设备、介质

Also Published As

Publication number Publication date
CN115618043B (zh) 2023-04-07
CN115618043A (zh) 2023-01-17

Similar Documents

Publication Publication Date Title
CN112685565B (zh) 基于多模态信息融合的文本分类方法、及其相关设备
CN110232183B (zh) 关键词提取模型训练方法、关键词提取方法、装置及存储介质
WO2021121198A1 (zh) 基于语义相似度的实体关系抽取方法、装置、设备及介质
WO2024098533A1 (zh) 图文双向搜索方法、装置、设备及非易失性可读存储介质
WO2022241950A1 (zh) 文本摘要生成方法、装置、设备及存储介质
WO2024098623A1 (zh) 跨媒体检索及模型训练方法、装置、设备、菜谱检索系统
US20230057010A1 (en) Term weight generation method, apparatus, device and medium
WO2024098524A1 (zh) 文本视频的互检索及模型训练方法、装置、设备及介质
WO2024045444A1 (zh) 一种视觉问答任务的处理方法、装置、设备和非易失性可读存储介质
WO2023108994A1 (zh) 一种语句生成方法及电子设备、存储介质
CN110930980A (zh) 一种中英文混合语音的声学识别模型、方法及系统
WO2024098763A1 (zh) 文本操作图互检方法及模型训练方法、装置、设备、介质
JP2022502758A (ja) 符号化方法、装置、機器およびプログラム
WO2024098525A1 (zh) 视频文本互检方法及其模型训练方法、装置、设备、介质
CN113158656B (zh) 讽刺内容识别方法、装置、电子设备以及存储介质
CN112287069A (zh) 基于语音语义的信息检索方法、装置及计算机设备
CN112632244A (zh) 一种人机通话的优化方法、装置、计算机设备及存储介质
CN115099239B (zh) 一种资源识别方法、装置、设备以及存储介质
CN115840808B (zh) 科技项目咨询方法、装置、服务器及计算机可读存储介质
CN116304042A (zh) 一种基于多模态特征自适应融合的虚假新闻检测方法
CN112287656A (zh) 文本比对方法、装置、设备和存储介质
JP7309811B2 (ja) データ注釈方法、装置、電子機器および記憶媒体
CN114722832A (zh) 一种摘要提取方法、装置、设备以及存储介质
CN111552819B (zh) 一种实体提取方法、装置及可读存储介质
CN117038099A (zh) 医疗类术语标准化方法以及装置