CN117011875A - Method, device, equipment, medium and program product for generating multimedia page - Google Patents

Method, device, equipment, medium and program product for generating multimedia page Download PDF

Info

Publication number
CN117011875A
CN117011875A CN202310980755.1A CN202310980755A CN117011875A CN 117011875 A CN117011875 A CN 117011875A CN 202310980755 A CN202310980755 A CN 202310980755A CN 117011875 A CN117011875 A CN 117011875A
Authority
CN
China
Prior art keywords
page
processed
multimedia
target
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310980755.1A
Other languages
Chinese (zh)
Inventor
郑艺秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202310980755.1A priority Critical patent/CN117011875A/en
Publication of CN117011875A publication Critical patent/CN117011875A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the application discloses a method, a device, equipment, a medium and a program product for generating a multimedia page, which can be applied to application scenes such as computer vision technology, voice technology, natural voice processing and the like; the embodiment of the application detects the element types of the page elements to be processed in the multimedia page to be processed; carrying out semantic recognition on the appointed page elements so as to determine semantic information of the multimedia page to be processed according to a semantic recognition result; determining a target page element to be processed associated with the semantic information from the page elements to be processed; converting content information in the target page element to be processed into a target page element; and generating the processed multimedia page by the target page element. Therefore, the method and the device can automatically and quickly generate the new processed multimedia page from the original multimedia page to be processed, simplify the manufacturing process of the multimedia page and improve the manufacturing efficiency of the multimedia page.

Description

Method, device, equipment, medium and program product for generating multimedia page
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, a medium, and a program product for generating a multimedia page.
Background
A multimedia page refers to a page containing multimedia content such as text, images, video, audio, and the like. In general, a multimedia page may include complex components of layout elements, interactive special effects, animation effects, etc., to present corresponding multimedia content through these components.
Because the multimedia page has a certain complexity, a professional is usually required to participate in the design, such as adding text, images, video, audio and other multimedia contents, and performing layout and style setting, and the professional is required to complete the multimedia page through a multi-step manufacturing process, so that the manufacturing process is complex.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment, a medium and a program product for generating a multimedia page, which can simplify the manufacturing process of the multimedia page and improve the manufacturing efficiency of the multimedia page.
The embodiment of the application provides a method for generating a multimedia page, which comprises the following steps: detecting a to-be-processed page element in a to-be-processed multimedia page, and an element type of the to-be-processed page element; carrying out semantic recognition on specified page elements to determine semantic information of the multimedia page to be processed according to a semantic recognition result, wherein the specified page elements are the page elements to be processed with the element types being specified element types; determining a target page element to be processed associated with the semantic information from the page elements to be processed; converting the content information in the target page element to be processed into a target page element; and generating the processed multimedia page by the target page element.
The embodiment of the application also provides a device for generating the multimedia page, which comprises the following steps: the detection unit is used for detecting to-be-processed page elements in the to-be-processed multimedia page and element types of the to-be-processed page elements; the identification unit is used for carrying out semantic identification on the appointed page element so as to determine semantic information of the multimedia page to be processed according to a semantic identification result, wherein the appointed page element is the page element to be processed with the element type being the appointed element type; the determining unit is used for determining a target page element to be processed associated with the semantic information from the page elements to be processed; the conversion unit is used for converting the content information in the target page element to be processed into a target page element; and the generating unit is used for generating the processed multimedia page by the target page element.
The embodiment of the application also provides electronic equipment, which comprises a processor and a memory, wherein the memory stores a plurality of instructions; the processor loads instructions from the memory to execute steps in any of the methods for generating a multimedia page provided by the embodiments of the present application.
The embodiment of the application also provides a computer readable storage medium, which stores a plurality of instructions, wherein the instructions are suitable for being loaded by a processor to execute the steps in any of the methods for generating the multimedia pages provided by the embodiment of the application.
The embodiment of the application also provides a computer program product, which comprises a computer program/instruction, wherein the computer program/instruction realizes the steps in any of the methods for generating the multimedia pages provided by the embodiment of the application when being executed by a processor.
The embodiment of the application can detect the page elements to be processed in the multimedia page to be processed and the element types of the page elements to be processed; carrying out semantic recognition on specified page elements to determine semantic information of the multimedia page to be processed according to a semantic recognition result, wherein the specified page elements are the page elements to be processed with the element types being specified element types; determining a target page element to be processed associated with the semantic information from the page elements to be processed; converting the content information in the target page element to be processed into a target page element; and generating the processed multimedia page by the target page element.
According to the method and the device, the to-be-processed page elements of the to-be-processed multimedia page are detected, and the to-be-processed page elements are converted to obtain new target page elements, so that the original to-be-processed multimedia page automatically and quickly generates a new processed multimedia page, the manufacturing process of the multimedia page is simplified, and the manufacturing efficiency of the multimedia page is improved. In addition, the element type of the multimedia page to be processed is detected, so that semantic information of the multimedia page to be processed is determined based on the element of the page to be processed with the specified element type, content information related to the semantic information in the multimedia page to be processed is obtained and used for generating a target page element, redundant information is reduced, the association relation between the generated target page element and the multimedia page to be processed is increased, and semantic accuracy of the generated multimedia page to be processed is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1a is a schematic view of a scenario of a method for generating a multimedia page according to an embodiment of the present application;
fig. 1b is a flowchart illustrating a method for generating a multimedia page according to an embodiment of the present application;
fig. 1c is a schematic diagram of a detection result of a to-be-processed page element in a to-be-processed multimedia page according to an embodiment of the present application;
FIG. 1d is a schematic diagram of determining target elements and extracting content information according to an embodiment of the present application;
FIG. 1e is a schematic diagram of a conventional page layout provided by an embodiment of the present application;
FIG. 1f is a schematic diagram of partitioning a multimedia page to be processed according to an embodiment of the present application;
FIG. 1g is a schematic diagram of a multimedia page to be processed according to an embodiment of the present application;
FIG. 1h is a schematic diagram of yet another multimedia page to be processed according to an embodiment of the present application;
FIG. 1i is a schematic diagram of generating text effects provided by an embodiment of the present application;
FIG. 1j is a schematic diagram of a second page element provided by an embodiment of the present application;
FIG. 1k is a schematic diagram of an animation setup page provided by an embodiment of the present application;
fig. 2a is a flow chart illustrating a method for generating a multimedia page according to another embodiment of the present application;
FIG. 2b is a schematic diagram of a landing page setup page provided by an embodiment of the present application;
FIG. 2c is a schematic diagram of a floor page modification page provided by an embodiment of the present application;
FIG. 2d is a schematic diagram of an image modification page provided by an embodiment of the present application;
fig. 2e is a schematic diagram of a landing page displayed by a client according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a device for generating a multimedia page according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.
The embodiment of the application provides a method, a device, equipment, a medium and a program product for generating a multimedia page.
The generating device of the multimedia page can be integrated in an electronic device, and the electronic device can be a terminal, a server and other devices. The terminal can be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer (Personal Computer, PC) or the like; the server may be a single server or a server cluster composed of a plurality of servers.
In some embodiments, the generating device of the multimedia page may be integrated in a plurality of electronic devices, for example, the generating device of the multimedia page may be integrated in a plurality of servers, and the generating method of the multimedia page is implemented by the plurality of servers.
In some embodiments, the server may also be implemented in the form of a terminal.
For example, referring to fig. 1a, the method for generating a multimedia page may be integrated in a server, where the server may obtain a to-be-processed multimedia page uploaded by a terminal, and detect to-be-processed page elements in the to-be-processed multimedia page and element types of the to-be-processed page elements; carrying out semantic recognition on the appointed page element to determine semantic information of the multimedia page to be processed according to a semantic recognition result, wherein the appointed page element is the page element to be processed with the element type being the appointed element type; determining a target page element to be processed associated with the semantic information from the page elements to be processed; converting content information in the target page element to be processed into a target page element; and generating a processed multimedia page by the target page element, and returning the processed multimedia page to the terminal for display.
The following will describe in detail. The order of the following examples is not limited to the preferred order of the examples. In the following description, references to the term "first/second/third" are merely to distinguish similar objects and do not represent a particular ordering of objects, it being understood that the "first/second/third" may be interchanged with a particular order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.
It will be appreciated that in the specific embodiments of the present application, data relating to sound data, avatars, user operations, etc. relating to a user may be required to obtain user approval or consent when the embodiments of the present application are applied to specific products or technologies, and the collection, use and processing of the related data may be required to comply with relevant laws and regulations and standards of the relevant country and region.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Computer Vision (CV) is a science of studying how to "look" at a machine, and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, detection and measurement on a target, and further perform graphic processing to make the Computer process into an image more suitable for human eyes to observe or transmit to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important innovation for the development of computer vision technology, and a pre-trained model in the vision fields of swin-transformer, viT, V-MOE, MAE and the like can be rapidly and widely applied to downstream specific tasks through fine tuning. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.
Key technologies to the speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future. The large model technology brings revolution for the development of the voice technology, and WavLM, uniSpeech and other pre-training models which use a transducer architecture have strong generalization and universality and can excellently finish voice processing tasks in all directions.
Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. The natural language processing relates to natural language, namely the language used by people in daily life, and is closely researched with linguistics; meanwhile, the method relates to an important technology of model training in the fields of computer science and mathematics and artificial intelligence, and a pre-training model is developed from a large language model (Large Language Model) in the NLP field. Through fine tuning, the large language model can be widely applied to downstream tasks. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The pre-training model is the latest development result of deep learning, and integrates the technology.
The Pre-training model (Pre-training model), also called a matrix model and a large model, refers to a deep neural network (Deep neural network, DNN) with large parameters, trains massive unlabeled data, utilizes the function approximation capability of the large-parameter DNN to enable PTM to extract common features on the data, and is suitable for downstream tasks through fine tuning (fine tuning), efficient fine tuning (PEFT) of parameters, prompt-tuning and other technologies. Therefore, the pre-training model can achieve ideal effects in a small sample (Few-shot) or Zero sample (Zero-shot) scene. PTM can be classified according to the data modality of the process into a language model (ELMO, BERT, GPT), a visual model (swin-transducer, viT, V-MOE), a speech model (VALL-E), a multi-modal model (ViBERT, CLIP, flamingo, gato), etc., wherein a multi-modal model refers to a model that builds a representation of the characteristics of two or more data modalities. The pre-training model is an important tool for outputting Artificial Intelligence Generation Content (AIGC), and can also be used as a general interface for connecting a plurality of specific task models.
With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, digital twin, virtual man, robot, artificial Intelligence Generated Content (AIGC), conversational interactions, smart medical, smart customer service, game AI, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.
In this embodiment, a method for generating a multimedia page related to artificial intelligence is provided, as shown in fig. 1b, a specific flow of the method for generating a multimedia page may be as follows:
110. and detecting the element types of the page elements to be processed in the multimedia page to be processed.
Wherein the multimedia page to be processed refers to a multimedia page for generating a new multimedia page. A multimedia page refers to a page containing multimedia content such as text, images, video, audio, and the like. The multimedia pages may be various forms of pages, such as HTML pages, web pages, H5 pages, application pages, and other pages capable of presenting multimedia content.
In practical application, the method for generating the multimedia page provided by the embodiment of the application can be applied to different application scenes. In different scenarios, the multimedia pages to be processed may vary according to the specific requirements and application scenarios. For example, in an e-commerce scene, the multimedia page to be processed may be a commodity display page, a commodity popularization page, etc., and these pages may include page elements such as a product picture, a video introduction, a specification parameter table, etc.; in a news media scene, the multimedia pages to be processed can be news report pages, thematic report pages and the like, and the pages can contain news contents in the forms of characters, pictures, audio, video and the like; in a game scene, the multimedia pages to be processed can be game popularization pages, game scenario pages and the like, and the pages can comprise game contents characterized in the forms of characters, pictures, audio, video and the like; in the social scene, the multimedia pages to be processed can be sharing pages, virtual community space pages, application promotion pages and the like, and the pages can comprise photos, music, videos and social contents such as characters edited by a user.
The page elements to be processed refer to page elements in the multimedia page to be processed. For example, the page elements to be processed are multimedia content in various content forms, such as text, audio, images, video, etc., and the page elements may be components presented in the form of buttons, forms, content areas, navigation, cards, pop-up boxes, etc. in the multimedia page.
Wherein the element type refers to a name or a tag for distinguishing or defining different page elements. For example, the element type may be represented in a content representation of the page element, such as an image, a graphic, a text, a video, etc., a component tag of the page element in the multimedia page, or a custom name or tag. For example, component tags such as < div > (block-level container), < video > (video), < audio > (audio), < img > (image), < text > (text), < button > (button), < h1> - < h6> (different levels of title), < table > (table), etc. corresponding to the page elements to be processed may be used as the corresponding element types thereof.
For example, in the case of obtaining agreement or authorization of a right person related to the multimedia page to be processed, tag types related to the multi-page elements such as < img >, < video >, < audio > and the like can be detected by analyzing the page structure of the multimedia page to be processed, page elements corresponding to the tags are extracted from the multimedia page as the page elements to be processed, and corresponding tags are used as the element types of the page elements.
In some embodiments, the page element to be processed may include a visual element, where the visual element refers to a visual page element, such as a text, a graphic (e.g. a key), an image, a video, etc. that is displayed in a visual component on the multimedia page, and thus, the element type of the page element to be processed may be a text, a graphic, an image, a video, etc. Specifically, visual elements in the multimedia page to be processed may be detected by the object detection model. The object detection model may include, but is not limited to, one or more of FCN (full convolutional neural network), segNet (deep convolutional encoder-decoder architecture for image segmentation), deep lab (hole convolutional spatial pyramid pooling layer), mask R-CNN (region-based convolutional neural network), U-Net (U-network based image segmentation network), gated SCNN (Gated shape convolutional neural network for semantic segmentation), and the like.
In some embodiments, when the page element to be processed includes a visual element, the visual element in the page can be accurately positioned by using a prediction frame in the feature map through the feature map of the multimedia page to be processed, so that the customized, fine-grained visual element can be detected in the page without calling source file data of the multimedia page to be processed. Specifically, the page element to be processed is detected by the following steps:
Extracting features of a multimedia page to be processed to obtain a page feature map;
setting a plurality of prediction frames for the page feature map;
carrying out regression processing on the prediction frame to adjust the center point of the prediction frame according to the regression processing result so as to obtain an adjusted prediction frame;
and determining the page element corresponding to the adjusted prediction frame as the page element to be processed.
The page feature map is a feature map of the multimedia page to be processed, and the feature map contains feature representations of the multimedia page to be processed.
The object detection model may include a feature extraction network and a regression processing network. For example, a multimedia page to be processed in the form of an image may be acquired, and subjected to preprocessing such as scaling, normalization, and the like. And inputting the preprocessed image into a feature extraction network, wherein the feature extraction network can be formed by stacking a plurality of convolution layers and pooling layers, and the feature extraction network carries out forward propagation on the input preprocessed image so as to extract the features of the image and obtain a page feature map. The central point coordinates of each prediction box may be defined as an offset with respect to the upper left corner of the feature region where it is located by predefining a set of prediction boxes (Anchor boxes) comprising a plurality of preset rectangular boxes, which may contain targets of different sizes and aspect ratios, and using this offset to determine the central point position of the prediction box. And carrying out regression processing on the predicted frames through a regression processing network, and matching the predicted frames with actual targets in the page feature graphs to obtain central points (namely predicted central points) of the predicted frames when the predicted frames are matched with the actual targets, and adjusting the central points of the predicted frames to the predicted central points to obtain the adjusted predicted frames. And the page elements in the area of the multimedia page to be processed corresponding to the adjusted prediction frame can be used as the page elements to be processed. For example, as shown in fig. 1c, the detection result of the to-be-processed page elements in the to-be-processed multimedia page selects a plurality of to-be-processed page elements in the to-be-processed multimedia page at corresponding position frames of the to-be-processed multimedia page through the adjusted prediction frames (dashed frames in the figure).
In some implementations, the to-be-processed page elements in the to-be-processed multimedia page may be detected using the YOLOv8 network model as a target detection model. The YOLOv8 network model includes a Backbone (Backbone) network, a Neck (neg) network, and a Head (Head) network. The backbone network comprises a convolution module, a C2f (channel to feature) module and an SPPF (space pyramid pooling feature) module, wherein the convolution module (Conv module) comprises a convolution layer, a batch normalization layer and a SiLU activation function layer, the C2f module comprises a convolution module, a Bottleneck module and a residual structure module, and the SPPF module comprises the convolution layer and the pooling layer. The neck network includes a convolution module, a C2f module, and an upsampling layer. The header network includes a convolution module for detection. Therefore, the main network and the neck network can be used as a feature extraction network to extract features of the multimedia page to be processed, and a page feature map is obtained. Specifically, the backbone network can obtain an initial feature map by carrying out convolution processing on a multimedia page to be processed, and then the neck network carries out secondary extraction on the initial feature map to obtain page feature maps with different scales. The convolution module for detection in the head network can be used as a regression processing network to carry out regression processing on the input page feature graphs with different scales, so as to obtain an adjusted prediction frame. The network model adopts an end-to-end design, and the target detection task is regarded as a single regression problem. The target detection is realized by dividing grids on an input image and simultaneously predicting a plurality of boundary boxes and corresponding confidence and category probabilities of the boundary boxes on each grid. It can be appreciated that the network model is an anchor-free model, so that the center of the object can be directly predicted instead of the offset of the known anchor frame, the number of frame predictions is reduced, and the detection efficiency is improved.
In some embodiments, in the process of detecting the page element to be processed in the multimedia page to be processed, the page feature map may be divided into a plurality of feature areas, so that whether an object exists in each feature area or not and the position thereof are predicted by using the prediction frame, and object detection is performed on each feature area, so that target information can be effectively captured under different scales.
For example, the page feature map is divided into a plurality of grids, i.e., feature areas, of fixed size, such as n×n, where each feature area corresponds to a local area in the multimedia page to be processed and includes a feature representation of the area, and one feature area may correspond to one or more prediction frames, i.e., one feature area may predict one or more adjusted prediction frames. And regarding any characteristic region, taking one of the adjusted prediction frames with highest confidence in the corresponding adjusted prediction frame as a final adjusted prediction frame, and taking page elements in the region of the multimedia page to be processed corresponding to the prediction frame as page elements to be processed.
In some embodiments, the corresponding page elements to be processed can be classified rapidly and accurately based on the feature representation of the adjusted prediction frame in the page feature map. Specifically, the element type of the page element to be processed is detected by the following steps:
Obtaining object characteristics of a predicted object in the adjusted prediction frame from the page characteristic diagram;
classifying the object characteristics to obtain object types;
the object type is taken as the element type of the page element to be processed.
The prediction object refers to a target object which is responsible for prediction by a prediction frame. The object type refers to a category or category label to which the target object belongs.
For example, the object detection model also includes a category prediction network. For each adjusted prediction frame, the category prediction network may acquire a feature representation of the prediction frame at a corresponding position of the page feature map (i.e., an object feature of a predicted object), and predict an object type of the predicted feature representation, where the object type is a category or category label, such as text, graphics, images, video, etc., that characterizes a target object in the prediction frame, and use the object type as an element label of a corresponding page element to be processed. For example, as shown in fig. 1c, the detected result of the to-be-processed page element in the to-be-processed multimedia page is marked above the adjusted prediction frame in the figure, and the element type of each detected to-be-processed page element is marked. The text view characterizing element type is text, the ImageView characterizing element type is image, the Button characterizing element type is Button, and numbers behind the element types in the figure represent confidence degrees of corresponding prediction frames.
In practical applications, the object features may be classified in a variety of ways, such as classification of the object features using a combination of one or more of a full-connection layer network, a convolutional neural network, a recurrent neural network, a residual network, and the like. For example, the object features may be input into a full-connection layer network, the full-connection layer network multiplies the input object features with a weight matrix, and performs linear transformation through an activation function to obtain a classification result, i.e., an object type.
In some implementations, the header network of the YOLOv8 network model may be utilized to detect the element type of the page element to be processed, i.e., to classify the page element to be processed. The Head network adopts a decoupling Head (coupled Head) structure to separate the detection process and the classification process, so that the flexibility and the mobility of the detection process and the classification process are improved, the reasoning speed of the detection process and the classification process is accelerated, and the detection efficiency is improved. Specifically, the header network includes a convolution module for detection and a convolution module for classification, which constitute a detection branch (i.e., regression processing network) and a classification branch (class prediction network), respectively, for detecting a page element to be processed and determining an element type of the page element to be processed.
120. And carrying out semantic recognition on the designated page elements so as to determine semantic information of the multimedia page to be processed according to the semantic recognition result.
The specified page element is a page element to be processed with the element type being the specified element type. The specified element type refers to a specific element type set according to an application scenario or actual requirements. For example, one or more of the element types of images, graphics, text, video, etc. may be used as the specified element type.
The semantic information refers to information representing the multimedia page semantic. The semantic information may be text descriptions, keywords, tags, etc. of the page elements to express the semantics of the page elements. The presentation of the semantic information may be text, vectors, or other forms of data.
For example, the prediction frame corresponding to the specified page element can be divided from the multimedia page to be processed, and then semantic recognition is performed on the division result to obtain semantic information. For specified page elements of different element types, different semantic recognition methods may be employed. For example, taking the type of the designated element as an image, the image in the prediction frame of the "ImageView" of all types in the multimedia page to be processed as shown in fig. 1c can be intercepted, the image represents the corresponding designated page element, then, the feature vectors of the intercepted image are extracted by using a computer vision technology such as a pretrained image classification model (AlexNet (deep convolutional neural network), VGG (VGG convolutional network), resNet (residual network) and the like, and the classification result is obtained by reasoning the feature vectors, wherein the classification result is the represented semantic meaning of the multimedia page to be processed, and the semantic meaning can be used as the semantic information of the multimedia page to be processed.
In some implementations, the specified element type may be determined according to the application scenario to make semantic information determined by the specified element type more accurate. For example, for any application scene such as a multimedia page in an e-commerce scene, a large number of n multimedia pages can be obtained, semantic recognition is performed on page elements of various element types in each multimedia page, semantic content contained in each page element is determined, and element types (hereinafter referred to as candidate element types) of the page element with the highest matching degree of the contained semantic content and the page tag carried by the multimedia page to be processed are recorded. And counting the candidate element types of the n multimedia pages, and taking the candidate element type with the highest occurrence number as the appointed element type. The page tag refers to content for classifying, describing or classifying the multimedia page to be processed, and the page tag can be a tag customized according to the content of the multimedia page to be processed.
In some embodiments, the element type with the highest occurrence number in the multimedia page to be processed can be used as a designated element type, so that semantic information can better describe and classify the content in the multimedia page to be processed, and the accuracy of representing the semantic of the multimedia page is improved. For example, in the multimedia page to be processed as shown in fig. 1c, the number of the elements of the page to be processed of the "TextView" type is the largest, and thus the designated element type may be determined as "text".
Since there may be a plurality of specified page elements in the multimedia page to be processed, the semantics of the characterization of the plurality of specified page elements may be different. Therefore, in some embodiments, when a plurality of specified page elements are available, the semantic recognition may be performed on the specified page elements to obtain initial semantic information, for example, after the n specified page elements are respectively identified to obtain "semantic 1", "semantic 2", … "semantic n", the semantic information with higher occurrence frequency may be counted as the semantic information of the multimedia page to be processed, for example, the semantic information located in the previous preset number may be used as the semantic information of the multimedia page to be processed, and the preset number may be 1 or an integer greater than 1.
In some embodiments, since the text contains a large amount of image and easy-to-understand semantic content, the text can be used as a designated element type, so that the difficulty of semantic recognition is reduced, and more accurate and consistent semantic information is obtained. Specifically, the specified element type includes text, and semantic recognition is performed on the specified page element to determine semantic information of the multimedia page to be processed according to the semantic recognition result, including:
extracting appointed page elements from a multimedia page to be processed;
Performing text recognition processing on the appointed page element to obtain initial text information;
and carrying out semantic understanding on the initial text information to obtain semantic keywords so as to determine semantic information of the multimedia page to be processed according to the semantic keywords.
For example, taking a text as an example of a specified element type, images in all prediction frames of which the type is "TextView" in the multimedia page to be processed as shown in fig. 1c can be intercepted, and the images represent corresponding specified page elements. Thus, the image 1, the image 2 and the image 3 corresponding to the three specified page elements are extracted, text information is extracted from the intercepted images respectively by using a character recognition (OCR) technology, for example, the text 1, the text 2 and the text 3 are obtained, and all the extracted texts are spliced to obtain initial text information such as { text 1, text 2 and text 3}. And then, carrying out semantic understanding on the initial text information by using one or more of a pre-trained model such as a dialogue generation model (ChatGPT), a pre-trained language model (BERT) and the like, outputting semantic keywords which are key points of the text, such as a game, wherein the semantic keywords can be used as semantic information of a multimedia page or semantic keywords which are positioned in a preset quantity are used as semantic information of the multimedia page to be processed.
130. And determining a target page element to be processed associated with the semantic information from the page elements to be processed.
The target page element to be processed refers to a page element associated with semantic information.
For example, the page element to be processed and the semantic information may be encoded respectively, the page element to be processed and the semantic information are represented as vectors respectively, similarity calculation is performed on the vectors corresponding to the semantic information and the vectors corresponding to the page elements to be processed one by one, for example, cosine similarity, euclidean distance or manhattan distance is calculated, and the page element to be processed, which has similarity with the semantic information higher than a preset similarity threshold, is used as the target page element to be processed. The content related to the semantic information in the multimedia page to be processed is obtained, redundant information is reduced, and the incidence relation between the generated target page element and the multimedia page to be processed is increased. For example, as shown in fig. 1d, a schematic diagram of determining target elements and extracting content information may detect, according to semantic information "game", a plurality of target page elements associated with the target page elements from the multimedia page to be processed shown in fig. 1d (1), where the plurality of target page elements are selected by a detection frame (solid line frame in the figure) in the figure as shown in fig. 1d (2). The location and size of the detection frame of any target page element may be consistent with its corresponding prediction frame.
It should be noted that, since the specified page element is determined according to the type of the specified element, there may be a plurality of specified page elements in the multimedia page to be processed, and the plurality of specified page elements may include page elements with low similarity to the semantic information, so as to determine that the target page element is not the same as the specified page element.
In some embodiments, the prediction frame corresponding to the to-be-processed page element may be divided from the to-be-processed multimedia page, and then semantic recognition is performed on the division result to obtain semantic information of the to-be-processed page element (hereinafter referred to as to-be-processed semantic information). For target page elements to be processed of different element types, different semantic recognition methods can be adopted. For example, taking the element type of the page element to be processed as an image, the image in the prediction frame corresponding to the page element to be processed can be intercepted, the image represents the corresponding page element to be processed, then a computer vision technology such as a pretrained image classification model (AlexNet), VGG (VGG convolutional network model), resNet (residual network) and the like is used for extracting the feature vectors of the intercepted image, and the classification result of the feature vectors is obtained by reasoning the feature vectors, wherein the classification result is the represented semantic meaning of the feature vectors, namely the semantic information to be processed. The semantic information to be processed and the semantic information of the multimedia page to be processed can be respectively expressed as vectors by using a trained Word vector model, such as Word2Vec, gloVe or FastText, similarity calculation is carried out on the vectors corresponding to the semantic information and the vectors corresponding to the page elements to be processed one by one, and the page elements to be processed, the similarity of which is higher than a preset similarity threshold, are used as target page elements to be processed.
It should be noted that, if the page element to be processed is the specified page element, the semantic information obtained by performing semantic recognition on the specified page element in the foregoing step may be directly obtained as the semantic information to be processed.
140. And converting the content information in the target page element to be processed into the target page element.
The target page element refers to a page element generated by content information conversion.
In practical application, the content information can be converted into the target multimedia content in a plurality of different modes according to the element types corresponding to the content information. For example, content information such as text, images, etc. may be input into a generation model by which target page elements are generated, which may include, but is not limited to, a combination of one or more of an image generation model such as generating a countermeasure network (GAN), a video generation model such as a conditional generation countermeasure network (cGAN), a text generation model such as a Recurrent Neural Network (RNN), and an audio generation model such as a text-to-speech model (TTS).
In some embodiments, before converting the content information in the target pending page element into the target page element, the method further includes:
and extracting content information from the target page element to be processed.
The content information refers to specific content in the page element to be processed. The content information in the different types of page elements to be processed may be different. For example, when the element type of the page element to be processed is text, the content information may be text information; when the element type of the page element to be processed is an image, the content information can be image information and/or text information of a text displayed by the image; when the element type of the page element to be processed is audio, the content information may be audio information, and/or text information of text contained in the audio; when the element type of the page element to be processed is video, the content information may be image information of a video frame in the video, and/or audio information of audio contained in the video, and/or text information of text contained in the video.
In practical applications, the content information may be extracted from the page elements to be processed in a number of different ways. For example, text, images or video in the page element to be processed may be directly used as its content information.
For another example, when the page element to be processed contains content information of text, an image of the page element to be processed may be extracted from the multimedia page to be processed, and text information (hereinafter referred to as text information to be processed) may be extracted from the extracted image by using a character recognition (OCR) technology. Or cutting an image of text information from the image using an image segmentation model SAM.
For another example, when the page element to be processed contains content information of an image or video, the content information may be segmented from the image or video frame using one or more of computer vision techniques such as a pre-trained image segmentation model (SAM), a semantic segmentation network (SegNet), a deep label network (deep lab), a Mask area convolutional neural network (Mask R-CNN), a U-shaped network (U-Net), a Gated cavity convolutional neural network (Gated SCNN), and the like. For example, as shown in a schematic diagram of determining a target element and extracting content information in fig. 1d, content information in an image form corresponding to each of the to-be-processed multimedia pages is respectively segmented from images of a plurality of to-be-processed multimedia pages shown in (2) in fig. 1d, and the segmented content information is shown in (3) in fig. 1 d.
It should be noted that, when the content information of the page element to be processed includes multiple content information in the text, the image and the video, only the content information corresponding to the element type may be extracted according to the element type of the page element to be processed, for example, if the page element to be processed is an image, only the content information in the form of the image is extracted, the text in the image may not be identified, and all the content information may also be extracted.
In some embodiments, the content information in the specific page area can be determined quickly and accurately by matching the target page area corresponding to the content information with the designated page area, and enhancement processing is performed, so that the efficiency of generating the target page element is improved, the visual effect of the key information is highlighted, and the information conveying effect is improved. Specifically, the target page element includes a first page element, the multimedia page to be processed includes a plurality of page areas, and content information in the target page element to be processed is converted into the target page element, including:
determining a target page area corresponding to the content information from the plurality of page areas;
determining content information corresponding to a target page area matched with the designated page area as target content information;
and carrying out enhancement processing on the target content information to obtain the first page element.
The designated page area is set according to an application scene or actual requirements. For example, since multimedia page content generally has a relatively fixed page layout, especially the page layout of multimedia pages in the same application scenario is relatively consistent. For example, for any application scene, such as a multimedia page in an e-commerce scene, a large number of n multimedia pages can be obtained, and a conventional page layout in the application scene is determined by analyzing the page layouts of the n multimedia pages, for example, the conventional page layout in the e-commerce scene is a conventional page layout as shown in fig. 1e, in which various layout elements such as "background", "main title", "main body", and "benefit point" in the conventional page layout are shown in the figure, and corresponding page areas in the page are shown by dotted frames. Wherein, the main title refers to the core content in the multimedia page, which is used for conveying the core characteristics of the product or service, and is usually placed in the top area of the page; the main body refers to main content in a multimedia page, which is used for conveying detailed descriptions, characteristics, advantages and the like of products or services; the benefit point refers to content in the multimedia page that benefits the user, and is used to convey product or service related advantages, benefits, value, and the like.
In some embodiments, the corresponding subdivision application scene can be further determined according to the page tag carried by the multimedia page to be processed, such as a game, a cartoon, a channel tag carried by the multimedia page, and the like, so as to obtain a conventional layout page in the subdivision scene, and the conventional layout page is used for determining the designated page area. Or the to-be-processed multimedia page after detecting the to-be-processed page element and the element type of the to-be-processed page element can be matched with a plurality of conventional page layouts, and one conventional page layout with the highest matching degree is used for determining the designated page area.
The first page element is a page element obtained by target content information enhancement processing. The enhancement processing refers to a processing method for enhancing the perceived effect of the target content information. For example, the enhancement processing may include, but is not limited to, one or more of zooming in, adding dynamic effects, adding color effects, increasing contrast, blurring the background, enhancing details, and the like.
For example, as shown in a schematic diagram of partitioning a multimedia page to be processed in fig. 1f, the multimedia page to be processed may be divided into 3×3 grids, that is, page areas, and then a conventional page layout of a corresponding scene is acquired and matched with a target page area, so as to determine layout elements to which the target page area belongs. As shown in a schematic diagram of a to-be-processed multimedia page in fig. 1g, content information 1-6 is extracted from a plurality of target to-be-processed page elements of the to-be-processed multimedia page, and occupies page areas 1-3, page area 4, page areas 7-9, page areas 2-3, page area 1, and page areas 4-6, respectively. From the association relationship between each page area and the layout element of the page layout, it is possible to determine the contents of the content information 1 and the content information 4 belonging to the layout element "main title", the contents of the content information 2 and the content information 5 belonging to the layout element "main body", and the contents of the content information 3 and the content information 6 belonging to the layout element "benefit point". If the layout element "main title" is a preset layout element that needs to be subjected to enhancement processing, the page area corresponding to the layout element can be used as a designated page area, the content information 1 and the content information 4 corresponding to the page area are determined as target content information, and the dynamic effect of flicker is added to the content information 1 and the content information 4.
In some embodiments, enhancement processing can be performed on the target content information in the target page area, so as to reduce the influence of the enhancement processing on the multimedia page layout and avoid influencing the overall page effect.
In some embodiments, the corresponding enhancement processing method may be selected according to the type of the layout element corresponding to the target content information. For example, different enhancement processing methods may be preset for different types of layout elements, for example, any target content information belongs to a "main title", a method corresponding to the "main title" may be selected from the preset enhancement processing methods, for example, flashing, and a dynamic effect of the flashing may be added to the target content information.
In some modes, the target content information can be subjected to three-dimensional reconstruction so as to highlight the visual effect of the key information and improve the transmission effect of the information. For example, if the specified page area is the page area corresponding to the layout element "main body", and if the corresponding target content information in the specified page area is in an image form, for example, if the specified page area is a shoe image, a corresponding 3D special effect can be drawn on the shoe image by a 3D rendering engine, for example, thread. Js3D, so as to obtain the visual effect of 3D dynamic display, that is, the first page element.
In some embodiments, enhancement processing may be performed on the semantically-related target content information using a specific method to highlight the visual effect of the key information and enhance the conveying effect of the information. Specifically, enhancement processing is performed on the target content information to obtain a first page element, including:
extracting content semantic information from the target content information;
determining target content information corresponding to the content semantic information with the association relation as associated content information;
and carrying out enhancement processing on the associated content information through a specified enhancement processing method corresponding to the associated content information to obtain the first page element.
The content semantic information refers to information representing content information semantics. The content semantic information may be a text description, keywords, tags, etc. of the content information to express the semantics of the content information. The presentation of the semantic information may be text, vectors, or other forms of data.
The association relationship refers to an association relationship of content information. For example, the association may include, but is not limited to, a similarity association, a semantic association, and the like. For example, any plurality of target content information whose similarity is larger than a preset similarity threshold may be determined as the associated content information, or any plurality of target content information that are synonyms, anti-ambiguities, or belong to a-inclusion relationship may be determined as the associated content information.
The specified enhancement method is a preset method for enhancing the associated content information. For example, a corresponding preset enhancement processing method can be selected according to the type of the layout element corresponding to the associated content information to carry out enhancement processing on the layout element.
For example, the target content information in the form of an image may be semantically identified using a computer vision technique such as a pre-trained image classification model, e.g., alexNet (deep convolutional neural network), VGG (VGG convolutional network model), resNet (residual network), etc., to obtain content semantic information, or the target content information in the form of a text may be semantically identified using one or more of a pre-trained model, e.g., dialogue generation model (ChatGPT), pre-trained language model (BERT), etc., to obtain content semantic information. The content information 1-6 in the multimedia page to be processed respectively identifies the semantics of the content information 1-6 as 'game rewards', 'character 1', 'get rewards', 'download channels', 'character 2', 'information', wherein the similarity of the 'game rewards' and the 'get rewards' is higher than a preset similarity threshold value, so that the content information 1 and the content information 3 can be determined as associated content information, and the same enhancement processing such as amplification can be performed on the associated content information. The enhancement process may not be performed for other target content information not belonging to the associated content information.
In some embodiments, the specified enhancement processing method may be a method for interactively displaying the semantically related target content information, so as to highlight the visual effect of the key information and improve the transmission effect of the information. Specifically, enhancement processing is performed on the associated content information by a specified enhancement processing method corresponding to the associated content information, so as to obtain a first page element, including:
determining an interaction path according to target page areas corresponding to the plurality of associated content information;
and generating a first page element corresponding to the associated content information according to the interaction path, wherein the first page element comprises the associated content information interacted along the interaction path.
The moving path refers to a path through which the associated content information passes when interacting.
For example, as shown in the schematic diagram of the to-be-processed multimedia page in fig. 1h, the content information 1 and the content information 3 associated in the to-be-processed multimedia page correspond to the page areas 1 to 3 and the page areas 7 to 8, respectively, generate an interaction path 1 from the page areas 1 to 3 to the page areas 7 to 8, and generate a dynamic effect that the content information 1 moves from the page areas 1 to 3 to the page areas 7 to 8 along the path, and also generate an interaction path 2 from the page areas 7 to 8 to the page areas 1 to 3, and generate a dynamic effect that the content information 2 moves from the page areas 7 to 8 to the page areas 1 to 3 along the path.
Multiple interaction paths can be generated among different associated content information, and particularly when the associated content information is multiple, the multiple interaction paths can be generated, and at the moment, dynamic effects corresponding to all the interaction paths can be displayed in the page. In some embodiments, to simplify page content, the conveying effect of information is improved. Multiple interaction paths may be alternately displayed. Or for any associated content information, if the associated content information has a plurality of interaction paths pointing to the interaction paths of the page areas corresponding to different content information, only one interaction path pointing to other content information with highest association can be displayed.
Because the size of the page area corresponding to the different associated content information may be different, in some embodiments, when the content information moves along the interaction path, the content information may be scaled according to the size of the page area pointed by the interaction path, so that the size of the content information matches the pointed page area. For example, if the content information 1 and the content information 2 are related content information, the content information 1 corresponds to the page areas 1 to 3, that is, the shape thereof is large, and the content information 2 corresponds to the page area 4, that is, the shape thereof is small. In the dynamic effect of the content information 1 moving along the interaction path from the page areas 1 to 3 to the page area 4, the content information 1 may be gradually reduced so as to be displayed in the page area 4 when moving to the page area 4. Meanwhile, in the dynamic effect that the content information 2 moves from the page area 4 to the page areas 1 to 3 along the interactive path, the content information 1 may be gradually enlarged so that when it moves to the page areas 1 to 3, the page areas 1 to 3 are displayed in a larger size, for example, the page areas 1 to 3 are fully paved.
In some embodiments, new display style parameters may be set for text-form content information to quickly generate corresponding target page elements, thereby improving efficiency of generating target page elements. Specifically, the content information in the target to-be-processed page element includes text information, the target page element includes a second page element, and converting the content information in the target to-be-processed page element into the target page element includes:
acquiring preset display style parameters;
and displaying the text information according to the preset display style parameters to obtain a second page element.
Wherein the display style parameter refers to a parameter for defining appearance characteristics of text information in a page. For example, display style parameters may include, but are not limited to, parameters of one or more of font, size, color, transparency, spacing, italics, etc., and may also include parameters of text effects such as shading, neon lights, fading, flames, etc. The preset display style parameters refer to display style parameters set according to the needs or application scenes.
For example, different display style parameters may be preset according to scenes. Bold fonts, larger font sizes, and striking colors may be used, as in e-commerce scenarios, to enhance the conveying of information. The second page element of the text special effect such as artistic style dynamic effect, 3D font effect, font dynamic effect and the like can also be generated by using special effect generating tools such as an image processing tool (Adobe Photoshop), a 3D text generator, an intelligent generating tool (midjourn) and the like, and typesetting such as dynamic typesetting, intelligent layout, text background restoration and the like can also be performed on the fonts. As shown in fig. 1i, when the page element to be processed is an image and the image contains text information, the corresponding text information can be identified from the image by standard pattern identification, non-standard pattern identification, OCR text identification and other methods. And generating the special effect words with flame effect through the preset word special effect such as flame special effect. The special effect text can be used to replace text in the original page element to be processed, resulting in a second page element as shown in fig. 1 j. The special effect text can also be directly used as a second page element, or added into a target page element obtained by converting other content information.
In some embodiments, to avoid duplicate content in the generated multimedia page, the text information used to generate the second page element does not contain text information in the target content information. For example, the content information of the layout element corresponding to the specified page area, such as the "main title", may be subjected to enhancement processing, and the content information in the form of text other than the "main title" may be displayed in a preset display style parameter.
In some embodiments, the text information in the multimedia page to be processed may be combined, and corresponding multimedia content may be generated, so as to generate new multimedia content matched with the whole text content of the page, so as to improve the transmission effect of the whole page information. Specifically, the content information in the target to-be-processed page element includes text information, the target page element includes a third page element, and converting the content information in the target to-be-processed page element into the target page element includes:
combining the text information to obtain total text information;
and generating target multimedia content corresponding to the total text information to obtain a third page element.
The merging process is to integrate a plurality of text messages, and the integrated text message is the total text message. The multiple text messages can be combined by various methods, such as obtaining total text messages from the multiple text messages, or reasoning and summarizing the spliced text messages to form a unified and complete text message.
The target multimedia content refers to multimedia content corresponding to the total text information. The target multimedia content may include, but is not limited to, at least one of images, video, audio, and the like.
For example, text 1, text 2, and text 3 extracted from the multimedia page to be processed may be spliced to obtain initial total text information such as { text 1, text 2, text 3}. And then using a pre-trained model such as one or more of a dialogue generation model (ChatGPT), a pre-trained language model (BERT) and the like to carry out reasoning summary on the initial total text information, so as to obtain total text information, wherein { text 1, text 2 and text 3} can be used as the input of the model, using an encoder module in the model such as converting an input sentence or phrase into a vector representation, capturing semantic and context information, using a decoder in the model, and gradually generating new text by combining the generated context vectors. In the decoding process, conversion words, transitional phrases, etc. can be inserted as needed to increase the consistency of the text. And inputting the total text information into a generating model such as an image generating model, a video generating model or an audio generating model to generate a new image, video or audio, namely target multimedia content, namely a third page element.
In some embodiments, the total text information can be obtained by combining all text information in the content information, so that the whole text content of the multimedia page to be processed is obtained, and the transmission effect of the whole page information is improved.
In some embodiments, key text may be extracted from the total text information to guide the generation of the target multimedia content of the corresponding semantics to increase the relevance of the generated target multimedia content to the semantic representation of the original pending page. Specifically, generating the target multimedia content corresponding to the total text information to obtain a third page element includes:
extracting text key information from the total text information;
and taking the text key information as semantic guidance information, and generating target multimedia content corresponding to the semantic guidance information to obtain a third page element.
Where text key information refers to information, such as words, words or phrases, etc., that has a particular meaning or importance in a given text.
The semantic guidance refers to generating corresponding content through the semantic guidance. Semantic guidance information refers to information used for semantic guidance, such as words, words or phrases, which can represent semantic content, and the semantic guidance information can be expressed in text, vectors or other forms of data.
For example, text keywords may be extracted from the total text information using statistical-based keyword extraction methods such as word frequency-inverse document frequency (TF-IDF), graph-based ranking algorithms (TextRank), etc., or using Natural Language Processing (NLP) based models such as gated loop units (GRU), sequence-to-sequence models (seq 2 seq). The extracted keywords are used as the prompt words of a Stable Diffusion model (Stable Diffusion), namely semantic guidance information, the Stable Diffusion model utilizes randomness in the Diffusion process, and a series of progressive Diffusion steps are carried out on an initial noise image based on guidance of the prompt words, so that a high-resolution real sample image is gradually generated.
In some implementations, audio of the total text information may be generated to provide sound effects for the multimedia page, enhancing content presentation effects of the multimedia page from the auditory dimension. Specifically, the target multimedia content includes target audio, and generating target multimedia content corresponding to the total text information to obtain a third page element includes:
acquiring sound resources;
and generating target audio corresponding to the total text information according to the sound resource to obtain a third page element.
Where sound resources refer to data for authoring audio based on text. The audio resources may include various sound elements such as human voice, instrument performance, environmental sound effects, and the like. Sound resources may be obtained by capturing sound adjustments, edits in the real world, or may be obtained by computer generation or simulation. It should be noted that, the sound resource acquired by the present application needs to be agreed or licensed by the right person related to the sound resource.
The target audio is the audio corresponding to the total text information.
For example, in the case of consent or permission of a corresponding right person to obtain real-person sound, a sound resource acquisition tool, such as so-vits-svc, may be trained using real-person sound, resulting in a trained timbre model (i.e., sound resource), which may generate corresponding target audio based on the total text information.
In some embodiments, when the target multimedia content includes target audio, the third page element includes an avatar animation whose audio is the target audio. For example, in case of approval or approval from the user, an avatar animation of the user's avatar generation target audio may be acquired. Or an avatar whose tone characteristics of the sound resource are matched may be acquired, and an avatar animation of the avatar is generated. For example, the user may select a sound resource in an animation setup page as shown in fig. 1k, and the avatar or upload a new avatar, in which the total text information is also displayed, and after selecting sound resource a and the avatar, the user may click on a "generate" control in the page to generate an avatar animation in the drawing (i.e., a third page element).
In some embodiments, when the content information includes text information, different page elements may be obtained by converting text information with different text numbers by using different methods according to the length of the text information, that is, the number of words, so as to avoid the occurrence of duplicate content in the generated multimedia page.
The text data amount refers to the number of characters contained in the text information, for example, the number of characters can be obtained by counting the characters carrying meaning in the text information (i.e. the characters excluding punctuation marks, spaces and other nonsensical characters).
For example, after extracting the content information from the target element to be processed, if the extracted 10 pieces of content information include 5 pieces of text information, according to the number of words of each piece of text information in the content information, the text information with the number of words lower than the preset number, such as the text information 1 and the text information 3, may be displayed with preset display style parameters, so as to obtain the second page element. The preset number is set according to application scenes or actual needs.
150. And generating the processed multimedia page by the target page element.
For example, the target page element may be embedded into a specified page to generate a processed multimedia page. The designated page can be a page customized according to an application scene or actual needs, and can also be the same page as a multimedia page format or a page template to be processed. For example, a new page, such as a web page, may be created as the specified page, or a page template of the multimedia page to be processed, such as a web page template of the multimedia page to be processed, may be obtained as the specified page.
In practical application, the target page element can be embedded into the designated page according to the element type of the target page element, the page area corresponding to the corresponding content information, and the like. For example, the first page element may be embedded into a page area corresponding to the target content information in the specified page, or interactive display may be performed at a corresponding position in the specified page according to the interaction path; or the second page element can be embedded into a page area corresponding to the content information corresponding to the page element in the appointed page; or a third page element in the form of a video or an image can be embedded into a page area corresponding to a main body of a layout element in a designated page; or a third page element in audio form may be embedded into the specified page as a background sound. If the page areas corresponding to the target page elements are all or partially overlapped, the overlapped target page elements can be alternately displayed. Or displaying according to the preset display layer priority level, if the second page element is set to be displayed on the highest priority layer, the third page element is set to be displayed on the lowest priority layer, and the first page element is set to be displayed on the middle layer.
In some embodiments, when converting the content information into the target page element, one or more preset styles may be predefined, one or more styles of target page elements are generated for each content information, and the multimedia pages of the styles are generated by the same style of target page element, so that one or more styles of multimedia pages can be obtained. The preset style conditions can be display styles, such as animation, writing, abstraction, shorthand, gouache and the like. The preset style can also be a page tag carried by the multimedia page to be processed, such as games, cartoons and the like. It should be noted that, in order to generate a target page element of any style, a corresponding generation model such as an image generation model, a video generation model and the like may be trained by using a sample of the style, or keywords related to the style may be added to semantic guidance information to generate a corresponding style page element, or display style parameters corresponding to the style may be preset, and enhancement processing methods such as special effects and the like may be used.
In some embodiments, after the processed multimedia page is generated, modifications may be made to the page, such as modifying displayed text, images, 3D special effects, avatar animations, and the like. For example, the user may modify sound resources and avatars for generating an avatar animation in an animation setup page as shown in fig. 1k to generate a new avatar animation and replace the avatar animation in the processed multimedia page with the new avatar animation. For another example, the image in the multimedia page after the user-defined text or the replacement may be input into the image generation model to generate a new image, and the new image may be used to replace the image in the multimedia page after the processing.
The generation scheme of the multimedia page provided by the embodiment of the application can be applied to various multimedia page generation scenes. For example, taking a multimedia page in an e-commerce scene as an example, detecting a to-be-processed page element in the to-be-processed multimedia page and an element type of the to-be-processed page element; carrying out semantic recognition on the appointed page element to determine semantic information of the multimedia page to be processed according to a semantic recognition result, wherein the appointed page element is the page element to be processed with the element type being the appointed element type; determining a target page element to be processed associated with the semantic information from the page elements to be processed; converting content information in the target page element to be processed into a target page element; and generating the processed multimedia page by the target page element.
From the above, the embodiment of the application can automatically and rapidly generate the new processed multimedia page from the original multimedia page to be processed by detecting the page element to be processed of the multimedia page to be processed and converting the detected page element to obtain the new target page element, thereby simplifying the manufacturing process of the multimedia page and improving the manufacturing efficiency of the multimedia page. In addition, the element type of the multimedia page to be processed is detected, so that semantic information of the multimedia page to be processed is determined based on the element of the page to be processed with the specified element type, content information related to the semantic information in the multimedia page to be processed is obtained and used for generating a target page element, redundant information is reduced, the association relation between the generated target page element and the multimedia page to be processed is increased, and semantic accuracy of the generated multimedia page to be processed is improved.
The method described in the above embodiments will be described in further detail below.
In this embodiment, a method according to an embodiment of the present application will be described in detail by taking a multimedia page as a landing page as an example.
As shown in fig. 2a, a specific flow of a method for generating a multimedia page is as follows:
210. and acquiring the multimedia page to be processed.
For example, the method for generating the multimedia page may be implemented by a landing page advertisement tool system, and the advertiser may start a landing page derivative function switch at the management end of the landing page advertisement tool system to enter a landing page setting page as shown in fig. 2b, and the advertiser may select an existing landing page in the landing page advertisement system as a landing page to be processed in the page, or click an "upload landing page" control in the page, and upload the landing page to be processed (i.e. the multimedia page to be processed). After the advertiser selects a landing page to be processed, such as landing page a, the landing page advertisement system may determine whether landing page a has been reviewed, and if so, automatically trigger a derivative function that operates for approximately 3 minutes, and may select to operate the function in the background. If the landing page is not checked, ending the process.
The landing page advertisement refers to a strategy of directing advertisement links to a specially manufactured webpage so as to improve advertisement conversion rate. This web page typically contains information, pictures, and video related to the advertisement to entice the user to click or submit a form for purchase or other action. The derivative function refers to a function of generating a new landing page (i.e., a processed multimedia page) from a landing page to be processed.
220. And detecting the element types of the page elements to be processed in the multimedia page to be processed.
For example, after triggering the derivative function, the landing page advertisement tool system may invoke a pre-trained landing page model, such as the YOLOv8 network model, to detect landing page elements (i.e., page elements to be processed) in landing page a and element types of the landing page elements in preparation for subsequent element reconstruction.
The YOLOv8 network model can be classified and trained by using a multi-industry data set to realize the function of detecting elements and element types of the landing page, and then fine-tuning is performed for downstream tasks (generalizing to specific landing page tasks and scenes) to form a pre-trained landing page model.
230. And carrying out semantic recognition on the designated page elements so as to determine semantic information of the multimedia page to be processed according to the semantic recognition result.
The specified page element is a page element to be processed with the element type being the specified element type. For example, if m landing page elements are detected, text in the landing page element of the type "TextView" may be identified from the m landing page elements, e.g., to obtain text 1, text 2, and text 3, and all the extracted text is spliced to obtain { text 1, text 2, text 3}. And then, using a dialogue generation model (ChatGPT) to carry out semantic understanding on the spliced text, outputting a key point of the text, namely a semantic keyword such as a game, and taking the key point as semantic information of the landing page A.
240. And determining a target page element to be processed associated with the semantic information from the page elements to be processed.
For example, the landing page element 1, the landing page element 2, the landing page element 4, and the landing page element 5 associated with the semantic information "game" may be detected from m landing page elements of the landing page a as landing page elements to be reconstructed (i.e., target page elements to be processed).
250. And extracting content information from the target page element to be processed.
For example, content information such as text and image in the original page element to be reconstructed may be extracted from the original page elements of different element types. Such as cutting using an image segmentation model (SAM) to obtain content information. SAM, collectively known as Segment analysis, is a computer vision technique that aims to separate a target object in an image from the background. Such techniques may be used in many applications such as image segmentation, object detection, virtual reality, video editing, and image recognition. Segment analysis techniques use various algorithms, such as convolutional neural networks, region segmentation, image segmentation, edge detection, and the like, to achieve segmentation of the target object. The development of this technology has enabled computers to better understand images and to identify and locate different objects therein, thereby making computers more intelligent and useful in a variety of application areas.
260. The content information is converted into a target page element.
For example, it may be reconstructed based on content information of a landing page element to be reconstructed, such as generating an operational special effect, cutting text through an image segmentation model (SAM) and fusing a style map, generating an artistic style word with an intelligent generation tool (midjourn ey), summarizing advertisement information content to form a prompt word with a dialogue generation model (ChatGPT), calling a text-to-text map function of a Stable Diffusion model (Stable Diffusion) API or being generated by the graph-to-map function, training a tone model through a sound resource acquisition tool (so-vits-svc) to generate audio of an AI virtual person (i.e., avatar), and integrating information in a landing page by the ChatGPT, driving an avatar picture to be converted into a virtual person video through a chat robot (Sad Talker) technology, extracting an alpha (alpha) channel to extract a virtual person action frame, so as to generate an avatar animation.
The virtual person is also called as a digital person, and the virtual person simulates a virtual person image similar to a real person on a computer, and the research field of the virtual person relates to the performance, movement and behavior of the human. Midjourn ey is an artificial intelligence program developed by the same research laboratory that can generate images from text. Stable diffration is a deep learning text-to-image generation model. It is mainly used to generate detailed images from the text description, although it can also be applied to other tasks such as internal and external drawing, and generating translations of pictorial drawings under the direction of hint words. The Sad Talker is an AI tool for speaking pictures, and can generate a virtual person video by inputting texts, tone resources and head images. The so-vits-svc, i.e. AI clone tone, is an audio-to-audio frequency, belongs to tone conversion algorithm, supports normal speaking, and also supports tone conversion of singing voice. The ChatGPT full-scale chat generation pre-training converter is an artificial intelligent chat robot program developed by OpenAI. The program uses a large language model based on GPT-3.5, GPT-4 architecture and trains with reinforcement learning.
Specifically, the path of special effect movement (i.e., interactive path) of the target content information matched with the designated page area can be planned by defining and classifying the interface position (page area) of the landing page element to be reconstructed on the landing page A, and finally special effect fusion is performed to output the picture (i.e., the first page element).
Text information can be obtained by cutting from an image through an image segmentation model (SAM), and the text information is fused with a preset style sheet by using an intelligent generation tool (Midjourney) to generate special effect words (namely, second page elements).
All text information in the landing page can be summarized through a dialogue generation model (ChatGPT), total text information is obtained, a prompt term prompt (namely semantic guidance information) of a Stable Diffusion model (Stable Diffusion) is generated, then pictures are reconstructed through a Stable Diffusion model (Stable Diffusion) in a text graph mode to obtain a third page element containing target multimedia content, and more wind grid pictures can be generated through the picture generation mode for auxiliary selection through the landing page element in an image form or the target multimedia content in the landing page element to be reconstructed.
The voice resource can be obtained by training a tone model through a voice resource obtaining tool (so-vits-svc), then the total text information summarized through a dialogue generation model (ChatGPT) is obtained, the virtual human video can be generated by fusing voice, text and head portrait through a SadTalker technology according to head portrait selected by a user, and then an AI virtual person (i.e. virtual image animation) with an alpha channel to generate a transparent background can be embedded into the advertisement.
3D commodity reconstruction and space vision simplification can be carried out on the target content information such as the commodity picture or the ground element to be reconstructed corresponding to the space picture through the wiree.js3D engine reconstruction.
270. And generating the processed multimedia page by the target page element.
For example, a new landing page (i.e., a designated page) with the same or similar page target as the landing page a may be created, and all of the elements of the landing page after reconstruction may be embedded into the new landing page to obtain the landing page B (i.e., the processed multimedia page). One landing page may include multiple landing page pages, for example, landing pages may include pages of load pages, curtain bounce pages, unlock character pages, and character video pages.
For any floor page element to be reconstructed, the derivation function can reconstruct a plurality of reconstructed floor page elements with different styles based on the floor page elements. All reconstructed landing page elements corresponding to each style can be embedded into a new landing page to obtain the landing page of the style. Thus, if N styles exist, N landing pages B with different styles can be obtained and returned to the management end of the landing page advertisement tool system for display.
280. And modifying the processed multimedia page to obtain a modified multimedia page.
For example, an advertiser can select a satisfactory one of N landing pages B displayed by the landing page advertisement tool system, perform secondary AI-assisted modification, such as modifying text, pictures, virtual persons (i.e., avatars), 3D resources, etc., and can assist in regenerating multiple sheets for selection and view effects in real time.
For example, the floor page advertisement tool system may include a floor page modification page as shown in FIG. 2c, in which an advertiser may select a load page of the generated floor pages to modify, in which the logo, virtual person, and virtual person actions displayed by the load page may be modified, and new sound resources may be uploaded.
For example, the floor page advertising tool system may include an image modification page as shown in FIG. 2d, in which an advertiser may change the image generation model, edit the hint word, and modify the style of the generated image, and after clicking on the "generate" control in the diagram, the advertiser may generate the modified image in the diagram. The image modification page may also provide the functionality of a holy.
290. And sending the modified multimedia page to the content delivery system so that the client acquires the modified multimedia page from the content delivery system.
For example, after any landing page is modified, the advertiser may click on a "submit landing page" control in the landing page modification page as shown in fig. 2c, generate a compressed package in H5 format from the modified landing page (i.e., modified multimedia content), upload the compressed package to the content delivery system, and provide a corresponding online link, so that the client application obtains the compressed version from the content delivery system through the online link, to display the landing page, or directly deliver the compressed version to the client application to display the landing page.
For example, when the landing page is displayed in the client application, the landing page displayed by the client as shown in fig. 2e displays a virtual person animation, the audio of the animation is target audio, such as playing a male and female voice, playing a record and dubbing, and the virtual person can perform corresponding actions and playing on a port at a key node, such as performing introduction actions on advertisement commodities, simultaneously playing on the port, and performing 3D display on the commodities. The characters and pictures in the landing page are correspondingly stylized special effects to obtain special effect characters and stylized pictures.
From the above, according to the embodiment of the application, according to different types of landing pages configured by an advertiser, the derivative functions of the corresponding different landing pages can be triggered after the detection of the landing page model, so that the landing page detection classification and page reconstruction are provided, the labor and material cost is reduced, and the rapid mass production of the landing pages is realized. Meanwhile, vision, 3D reconstruction, text special effects, operation special effects and AI virtual man technologies are introduced, so that the interestingness of the landing page advertisement is enriched, the reality and immersion sense are increased, the retention of users is improved, and the conversion rate is improved.
In order to better implement the method, the embodiment of the application also provides a device for generating the multimedia page, which can be integrated in electronic equipment, wherein the electronic equipment can be a terminal, a server and other equipment. The terminal can be a mobile phone, a tablet personal computer, an intelligent Bluetooth device, a notebook computer, a personal computer and other devices; the server may be a single server or a server cluster composed of a plurality of servers.
For example, in this embodiment, a method according to an embodiment of the present application will be described in detail by taking a specific integration of a device for generating a multimedia page in a server as an example.
For example, as shown in fig. 3, the generating device of the multimedia page may include a detecting unit 310, an identifying unit 320, a determining unit 330, a converting unit 340, and a generating unit 350, as follows:
first detection unit 310
The method is used for detecting the to-be-processed page elements in the to-be-processed multimedia page and the element types of the to-be-processed page elements.
In some embodiments, the detection unit 310 may specifically be configured to: extracting features of a multimedia page to be processed to obtain a page feature map; setting a plurality of prediction frames for the page feature map; carrying out regression processing on the prediction frame to adjust the center point of the prediction frame according to the regression processing result so as to obtain an adjusted prediction frame; and determining the page element corresponding to the adjusted prediction frame as the page element to be processed.
In some embodiments, the detection unit 310 may specifically be configured to: obtaining object characteristics of a predicted object in the adjusted prediction frame from the page characteristic diagram; classifying the object characteristics to obtain object types; the object type is taken as the element type of the page element to be processed.
(two) identification unit 320
The method is used for carrying out semantic recognition on the appointed page element so as to determine semantic information of the multimedia page to be processed according to a semantic recognition result, wherein the appointed page element is the page element to be processed with the element type being the appointed element type.
In some embodiments, the specified element type includes text, and the identifying unit 320 may be specifically configured to: extracting appointed page elements from a multimedia page to be processed; performing text recognition processing on the appointed page element to obtain initial text information; and carrying out semantic understanding on the initial text information to obtain semantic keywords so as to determine semantic information of the multimedia page to be processed according to the semantic keywords.
(III) determination unit 330
And the method is used for determining a target to-be-processed page element associated with the semantic information from the to-be-processed page elements.
(IV) conversion unit 340
The method is used for converting the content information in the target page element to be processed into the target page element.
In some embodiments, the target page element includes a first page element, the multimedia page to be processed includes a plurality of page areas, and the converting unit 340 may specifically be configured to: determining a target page area corresponding to content information in a target page element to be processed from the plurality of page areas; determining content information corresponding to a target page area matched with the designated page area as target content information; and carrying out enhancement processing on the target content information to obtain the first page element.
In some embodiments, the enhancing the target content information to obtain a first page element includes: extracting content semantic information from the target content information; determining target content information corresponding to the content semantic information with the association relation as associated content information; and carrying out enhancement processing on the associated content information through a specified enhancement processing method corresponding to the associated content information to obtain the first page element.
In some embodiments, enhancement processing is performed on the associated content information by a specified enhancement processing method corresponding to the associated content information, to obtain a first page of pixels, including: determining an interaction path according to target page areas corresponding to the plurality of associated content information; and generating a first page element corresponding to the associated content information according to the interaction path, wherein the first page element comprises the associated content information interacted along the interaction path.
In some embodiments, the content information in the target page element to be processed includes text information, the target page element includes a second page element, and the converting unit 340 may specifically be configured to: acquiring preset display style parameters; and displaying the text information according to the preset display style parameters to obtain a second page element.
In some embodiments, the content information in the target page element to be processed includes text information, the target page element includes a third page element, and the converting unit 340 may specifically be configured to: combining all the text information to obtain total text information; and generating target multimedia content corresponding to the total text information to obtain a third page element.
In some embodiments, generating the target multimedia content corresponding to the total text information to obtain the third page element includes: extracting text key information from the total text information; and taking the text key information as semantic guidance information, and generating target multimedia content corresponding to the semantic guidance information to obtain a third page element.
In some embodiments, the target multimedia content includes target audio, and generating target multimedia content corresponding to the total text information to obtain a third page element includes: acquiring sound resources; and generating target audio corresponding to the total text information according to the sound resource to obtain a third page element.
(fifth) generating unit 350
For generating a processed multimedia page from the target page element.
In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.
As can be seen from the above, the generating device of the multimedia page of the present embodiment includes a detecting unit, an identifying unit, a determining unit, a converting unit, and a generating unit. The detection unit is used for detecting the page elements to be processed in the multimedia page to be processed and the element types of the page elements to be processed; the identification unit is used for carrying out semantic identification on the appointed page element so as to determine semantic information of the multimedia page to be processed according to a semantic identification result, wherein the appointed page element is the page element to be processed with the appointed element type; the determining unit is used for determining a target page element to be processed associated with the semantic information from the page elements to be processed; the conversion unit is used for converting the content information in the target page element to be processed into a target page element; and the generating unit is used for generating the processed multimedia page by the target page element.
Therefore, the embodiment of the application can promote the generation of the new processed multimedia page automatically and quickly by detecting the to-be-processed page element of the to-be-processed multimedia page and converting the to-be-processed page element into the new target page element, simplify the manufacturing process of the multimedia page and promote the manufacturing efficiency of the multimedia page. In addition, the element type of the multimedia page to be processed is detected, so that semantic information of the multimedia page to be processed is determined based on the element of the page to be processed with the specified element type, content information related to the semantic information in the multimedia page to be processed is obtained and used for generating a target page element, redundant information is reduced, the association relation between the generated target page element and the multimedia page to be processed is increased, and semantic accuracy of the generated multimedia page to be processed is improved.
The embodiment of the application also provides electronic equipment which can be a terminal, a server and other equipment. The terminal can be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer and the like; the server may be a single server, a server cluster composed of a plurality of servers, or the like.
In some embodiments, the generating device of the multimedia page may be integrated in a plurality of electronic devices, for example, the generating device of the multimedia page may be integrated in a plurality of servers, and the generating method of the multimedia page is implemented by the plurality of servers.
In this embodiment, a detailed description will be given taking an example that the electronic device of this embodiment is a server, for example, as shown in fig. 4, which shows a schematic structural diagram of the server according to the embodiment of the present application, specifically:
the server may include one or more processor cores 'processors 410, one or more computer-readable storage media's memory 420, a power supply 430, an input module 440, and a communication module 450, among other components. Those skilled in the art will appreciate that the server architecture shown in fig. 4 is not limiting of the server and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:
the processor 410 is a control center of the server, connects various parts of the entire server using various interfaces and lines, performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 420, and calling data stored in the memory 420. In some embodiments, processor 410 may include one or more processing cores; in some embodiments, processor 410 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 410.
The memory 420 may be used to store software programs and modules, and the processor 410 may perform various functional applications and data processing by executing the software programs and modules stored in the memory 420. The memory 420 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the server, etc. In addition, memory 420 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, memory 420 may also include a memory controller to provide processor 410 with access to memory 420.
The server also includes a power supply 430 that provides power to the various components, and in some embodiments, the power supply 430 may be logically connected to the processor 410 via a power management system, such that charge, discharge, and power consumption management functions are performed by the power management system. Power supply 430 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
The server may also include an input module 440, which input module 440 may be used to receive entered numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
The server may also include a communication module 450, and in some embodiments the communication module 450 may include a wireless module, through which the server may wirelessly transmit over short distances, thereby providing wireless broadband internet access to the user. For example, the communication module 450 may be used to assist a user in e-mail, browsing web pages, accessing streaming media, and the like.
Although not shown, the server may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 410 in the server loads executable files corresponding to the processes of one or more application programs into the memory 420 according to the following instructions, and the processor 410 executes the application programs stored in the memory 420, so as to implement various functions as follows:
detecting an element type of a to-be-processed page element in the to-be-processed multimedia page; carrying out semantic recognition on the appointed page element to determine semantic information of the multimedia page to be processed according to a semantic recognition result, wherein the appointed page element is the page element to be processed with the element type being the appointed element type; determining a target page element to be processed associated with the semantic information from the page elements to be processed; converting content information in the target page element to be processed into a target page element; and generating the processed multimedia page by the target page element.
The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.
As can be seen from the above, the embodiment of the application can automatically and rapidly generate a new processed multimedia page from the original multimedia page to be processed by detecting the to-be-processed page element of the to-be-processed multimedia page and converting the to-be-processed page element to obtain a new target page element, thereby simplifying the manufacturing process of the multimedia page and improving the manufacturing efficiency of the multimedia page. In addition, the element type of the multimedia page to be processed is detected, so that semantic information of the multimedia page to be processed is determined based on the element of the page to be processed with the specified element type, content information related to the semantic information in the multimedia page to be processed is obtained and used for generating a target page element, redundant information is reduced, the association relation between the generated target page element and the multimedia page to be processed is increased, and semantic accuracy of the generated multimedia page to be processed is improved.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.
To this end, an embodiment of the present application provides a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any of the methods for generating a multimedia page provided by the embodiments of the present application. For example, the instructions may perform the steps of:
detecting an element type of a to-be-processed page element in the to-be-processed multimedia page; carrying out semantic recognition on the appointed page element to determine semantic information of the multimedia page to be processed according to a semantic recognition result, wherein the appointed page element is the page element to be processed with the element type being the appointed element type; determining a target page element to be processed associated with the semantic information from the page elements to be processed; converting content information in the target page element to be processed into a target page element; and generating the processed multimedia page by the target page element.
Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
According to one aspect of the present application, there is provided a computer program product or computer program comprising computer programs/instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer program/instructions from the computer-readable storage medium, and the processor executes the computer program/instructions to cause the electronic device to perform the methods provided in the various alternative implementations provided in the above-described embodiments.
The instructions stored in the storage medium can execute the steps in any of the methods for generating a multimedia page provided by the embodiments of the present application, so that the beneficial effects that any of the methods for generating a multimedia page provided by the embodiments of the present application can be achieved, which are detailed in the previous embodiments and are not described herein.
The foregoing has described in detail the methods, apparatuses, devices, media and program products for generating multimedia pages provided by the embodiments of the present application, and specific examples have been applied to illustrate the principles and embodiments of the present application, where the foregoing examples are provided to assist in understanding the methods and core ideas of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, the present description should not be construed as limiting the present application.

Claims (15)

1. A method for generating a multimedia page, comprising:
detecting a to-be-processed page element in a to-be-processed multimedia page, and an element type of the to-be-processed page element;
carrying out semantic recognition on specified page elements to determine semantic information of the multimedia page to be processed according to a semantic recognition result, wherein the specified page elements are the page elements to be processed with the element types being specified element types;
Determining a target page element to be processed associated with the semantic information from the page elements to be processed;
converting the content information in the target page element to be processed into a target page element;
and generating the processed multimedia page by the target page element.
2. The method for generating a multimedia page according to claim 1, wherein the page element to be processed is detected by:
extracting the characteristics of the multimedia page to be processed to obtain a page characteristic diagram;
setting a plurality of prediction frames for the page feature map;
carrying out regression processing on the prediction frame to adjust the center point of the prediction frame according to the regression processing result so as to obtain an adjusted prediction frame;
and determining the page element corresponding to the adjusted prediction frame as a page element to be processed.
3. The method for generating a multimedia page according to claim 2, wherein the element type of the page element to be processed is detected by:
acquiring object features of a predicted object in the adjusted prediction frame from the page feature map;
classifying the object features to obtain object types;
And taking the object type as the element type of the page element to be processed.
4. The method for generating a multimedia page according to claim 1, wherein the specified element type includes text, the performing semantic recognition on the specified page element to determine semantic information of the multimedia page to be processed according to a semantic recognition result includes:
extracting the appointed page element from the multimedia page to be processed;
performing text recognition processing on the appointed page element to obtain initial text information;
and carrying out semantic understanding on the initial text information to obtain semantic keywords so as to determine semantic information of the multimedia page to be processed according to the semantic keywords.
5. The method for generating a multimedia page according to claim 1, wherein the target page element includes a first page element, the multimedia page to be processed includes a plurality of page areas, and the converting the content information in the target page element to the target page element includes:
determining a target page area corresponding to the content information in the target page element to be processed from the plurality of page areas;
Determining content information corresponding to the target page area matched with the designated page area as target content information;
and carrying out enhancement processing on the target content information to obtain the first page element.
6. The method for generating a multimedia page according to claim 5, wherein the enhancing the target content information to obtain the first page element comprises:
extracting content semantic information from the target content information;
determining the target content information corresponding to the content semantic information with the association relation as associated content information;
and carrying out enhancement processing on the associated content information through a specified enhancement processing method corresponding to the associated content information to obtain a first page element.
7. The method for generating a multimedia page according to claim 6, wherein the enhancing the associated content information by the specified enhancement processing method corresponding to the associated content information to obtain a first page element includes:
determining an interaction path according to the target page areas corresponding to the plurality of associated content information;
and generating a first page of pixels corresponding to the associated content information according to the interaction path, wherein the first page of pixels comprise the associated content information interacted along the interaction path.
8. The method for generating a multimedia page according to claim 1, wherein the content information in the target page element to be processed includes text information, the target page element includes a second page element, and the converting the content information in the target page element to be processed into the target page element includes:
acquiring preset display style parameters;
and displaying the text information according to the preset display style parameters to obtain the second page element.
9. The method for generating a multimedia page according to claim 1, wherein the content information in the target page element to be processed includes text information, the target page element includes a third page element, and the converting the content information in the target page element to be processed into the target page element includes:
combining all the text information to obtain total text information;
and generating target multimedia content corresponding to the total text information to obtain the third page element.
10. The method for generating a multimedia page according to claim 9, wherein said generating the target multimedia content corresponding to the total text information to obtain the third page element includes:
Extracting text key information from the total text information;
and taking the text key information as semantic guidance information, and generating target multimedia content corresponding to the semantic guidance information to obtain the third page element.
11. The method for generating a multimedia page according to claim 9, wherein the target multimedia content includes target audio, and the generating the target multimedia content corresponding to the total text information to obtain the third page element includes:
acquiring sound resources;
and generating target audio corresponding to the total text information according to the sound resource to obtain the third page element.
12. A multimedia page generation apparatus, comprising:
the detection unit is used for detecting to-be-processed page elements in the to-be-processed multimedia page and element types of the to-be-processed page elements;
the identification unit is used for carrying out semantic identification on the appointed page element so as to determine semantic information of the multimedia page to be processed according to a semantic identification result, wherein the appointed page element is the page element to be processed with the element type being the appointed element type;
The determining unit is used for determining a target page element to be processed associated with the semantic information from the page elements to be processed;
the conversion unit is used for converting the content information in the target page element to be processed into a target page element;
and the generating unit is used for generating the processed multimedia page by the target page element.
13. An electronic device comprising a processor and a memory, the memory storing a plurality of instructions; the processor loads instructions from the memory to perform the steps in the method of generating a multimedia page as claimed in any one of claims 1 to 11.
14. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps in the method of generating a multimedia page according to any one of claims 1 to 11.
15. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps in the method of generating a multimedia page as claimed in any one of claims 1 to 11.
CN202310980755.1A 2023-08-04 2023-08-04 Method, device, equipment, medium and program product for generating multimedia page Pending CN117011875A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310980755.1A CN117011875A (en) 2023-08-04 2023-08-04 Method, device, equipment, medium and program product for generating multimedia page

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310980755.1A CN117011875A (en) 2023-08-04 2023-08-04 Method, device, equipment, medium and program product for generating multimedia page

Publications (1)

Publication Number Publication Date
CN117011875A true CN117011875A (en) 2023-11-07

Family

ID=88570676

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310980755.1A Pending CN117011875A (en) 2023-08-04 2023-08-04 Method, device, equipment, medium and program product for generating multimedia page

Country Status (1)

Country Link
CN (1) CN117011875A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117708337A (en) * 2024-02-05 2024-03-15 杭州杰竞科技有限公司 Man-machine interaction method and system oriented to complex localization
CN117708337B (en) * 2024-02-05 2024-04-26 杭州杰竞科技有限公司 Man-machine interaction method and system oriented to complex localization

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117708337A (en) * 2024-02-05 2024-03-15 杭州杰竞科技有限公司 Man-machine interaction method and system oriented to complex localization
CN117708337B (en) * 2024-02-05 2024-04-26 杭州杰竞科技有限公司 Man-machine interaction method and system oriented to complex localization

Similar Documents

Publication Publication Date Title
Uppal et al. Multimodal research in vision and language: A review of current and emerging trends
Zhan et al. Multimodal image synthesis and editing: A survey and taxonomy
US11409791B2 (en) Joint heterogeneous language-vision embeddings for video tagging and search
CN111488931B (en) Article quality evaluation method, article recommendation method and corresponding devices
US20210209289A1 (en) Method and apparatus for generating customized content based on user intent
CN112257661A (en) Identification method, device and equipment of vulgar image and computer readable storage medium
CN114390217A (en) Video synthesis method and device, computer equipment and storage medium
CN114969282B (en) Intelligent interaction method based on rich media knowledge graph multi-modal emotion analysis model
Wu et al. Inferring emotional tags from social images with user demographics
CN113239961A (en) Method for generating sequence images based on text for generating confrontation network
Khurram et al. Dense-captionnet: a sentence generation architecture for fine-grained description of image semantics
He et al. Deep learning in natural language generation from images
Ge et al. Exploring local detail perception for scene sketch semantic segmentation
Wu et al. Sentimental visual captioning using multimodal transformer
Mei et al. Vision and language: from visual perception to content creation
Meo et al. Aesop: A visual storytelling platform for conversational ai and common sense grounding
CN116611496A (en) Text-to-image generation model optimization method, device, equipment and storage medium
Huang et al. Recent advances in artificial intelligence for video production system
Huang et al. A Survey for Graphic Design Intelligence
CN111062207B (en) Expression image processing method and device, computer storage medium and electronic equipment
KR102446305B1 (en) Method and apparatus for sentiment analysis service including highlighting function
CN117011875A (en) Method, device, equipment, medium and program product for generating multimedia page
CN114529635A (en) Image generation method, device, storage medium and equipment
Javaid et al. Manual and non-manual sign language recognition framework using hybrid deep learning techniques
Newnham Machine Learning with Core ML: An iOS developer's guide to implementing machine learning in mobile apps

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication