CN120086355B - Model quantitative reasoning acceleration method, device, equipment and medium - Google Patents

Model quantitative reasoning acceleration method, device, equipment and medium

Info

Publication number
CN120086355B
CN120086355B CN202510525474.6A CN202510525474A CN120086355B CN 120086355 B CN120086355 B CN 120086355B CN 202510525474 A CN202510525474 A CN 202510525474A CN 120086355 B CN120086355 B CN 120086355B
Authority
CN
China
Prior art keywords
precision
processing
token
configuration
processing block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510525474.6A
Other languages
Chinese (zh)
Other versions
CN120086355A (en
Inventor
瞿晓阳
王健宗
陶伟
卢昊骋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Ping An Communication Technology Co Ltd
Original Assignee
Shenzhen Ping An Communication Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Ping An Communication Technology Co Ltd filed Critical Shenzhen Ping An Communication Technology Co Ltd
Priority to CN202510525474.6A priority Critical patent/CN120086355B/en
Publication of CN120086355A publication Critical patent/CN120086355A/en
Application granted granted Critical
Publication of CN120086355B publication Critical patent/CN120086355B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Machine Translation (AREA)

Abstract

本发明涉及人工智能技术领域,可应用于医疗健康及金融科技等业务场景中,公开了一种模型量化推理加速方法、装置、设备及介质,包括:将输入文本划分为多个处理块,对非首个处理块进行重要性评分,按评分结果分配计算精度格式,确定每个处理块的统一量化配置;将网络模块划分为配置共享组,组内共享对应处理块的量化配置;根据统一量化配置执行块级量化推断,生成模型推理结果。本发明通过基于token重要性分数统一确定每个处理块的量化配置,并在网络模块组内复用该配置,实现了块级别的精度分配与并行量化推理,在保障推理精度的同时大幅降低显存开销和配置时间开销,有效提升长文本推理任务中的执行效率与显存利用率。

The present invention relates to the field of artificial intelligence technology, and can be applied to business scenarios such as medical health and financial technology. A model quantization reasoning acceleration method, device, equipment and medium are disclosed, including: dividing the input text into multiple processing blocks, scoring the importance of non-first processing blocks, allocating the calculation precision format according to the scoring results, and determining the unified quantization configuration of each processing block; dividing the network module into a configuration sharing group, and sharing the quantization configuration of the corresponding processing block within the group; performing block-level quantization inference according to the unified quantization configuration to generate model reasoning results. The present invention uniformly determines the quantization configuration of each processing block based on the token importance score, and reuses the configuration in the network module group, thereby realizing block-level precision allocation and parallel quantization reasoning, greatly reducing the video memory overhead and configuration time overhead while ensuring the reasoning accuracy, and effectively improving the execution efficiency and video memory utilization in long text reasoning tasks.

Description

Model quantitative reasoning acceleration method, device, equipment and medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a model quantitative reasoning acceleration method, device, equipment and storage medium.
Background
In recent years, a Large Language Model (LLM) is excellent in natural language processing tasks, and is widely applied to scenes such as dialogue systems, machine translation, question-answering systems and the like. However, as model scale continues to expand, it faces significant memory footprint issues when performing long text reasoning tasks. Most of the mainstream LLMs have billions or even billions of parameters, and models and related intermediate states need to be completely loaded into a GPU video memory in the reasoning process, so that extremely high requirements on hardware resources are provided.
When applied to long text inference tasks, such as long document generation, complex document summary extraction, or multiple rounds of dialog processing, input sequences often contain thousands to tens of thousands of token, and such ultra-long sequences significantly increase the computational complexity and memory requirements of the attention mechanism. Specifically, the dimension of the attention matrix increases with the square of the length of the input sequence, and a large number of intermediate activation values also need to be temporarily stored in the video memory, which is very likely to cause memory overflow or inference failure. The existing video memory optimization means, such as gradient check points, activation value unloading and the like, are mainly aimed at training phases, and are difficult to be directly applied to the reasoning process. In addition, distributed reasoning can alleviate the resource bottleneck of a single device, but in actual deployment, the requirement on infrastructure is high, the configuration is complex, and the distributed reasoning is difficult to stably operate in a marginal end or medium-low resource environment.
In the field of medical health business, LLM is used in medical record summary generation, medical question-answering systems, diagnosis and treatment record analysis, and other scenarios. These tasks often involve a large amount of medical terms and contextual information, with a greater reliance on the model's reasoning capabilities under long text. The currently adopted general quantization compression method can reduce the occupation of a video memory, but often neglects the reservation of certain key information in a medical text, so that the model accuracy is reduced, even medical information understanding deviation occurs, and the result reliability is affected.
In the field of financial and scientific business, large language models are widely used for text-intensive tasks such as contract parsing, financial summary generation, customer risk assessment and the like. These tasks typically involve structured, text-based compliance reports or historical transaction records that place extremely high demands on the memory efficiency in the reasoning process. However, the current model is easy to fail due to insufficient display memory when processing text input such as financial documents, and the stability and expandability of the model are seriously affected.
In order to relieve the video memory pressure, the quantization technology has become one of the mainstream compression means, and is widely adopted in inference scenes in particular. Prior studies have attempted to use a hybrid precision quantization approach, i.e., a token of high importance is assigned a higher bit width, while a token of low importance is used a lower bit width, thereby reducing overall video memory usage while maintaining model precision. However, hybrid accuracy quantization faces the problem of inefficient configuration selection during reasoning. Because the current method needs to dynamically analyze the importance of each token and determine the bit width strategy in each round of reasoning process, the process is time-consuming and serious, speed advantages brought by quantification are weakened, and particularly, the influence is more remarkable in long text tasks.
In summary, the prior art has key shortages in coping with the problem of memory overhead in the process of long text reasoning of a large language model, and especially in aspects of efficiency of quantitative configuration strategies, importance recognition mechanisms in the cross-task field, block-level reasoning resource scheduling and the like, the prior art still needs to be further optimized to meet the actual requirements of high-efficiency and accurate reasoning in the fields of financial science and technology, medical health and the like.
Disclosure of Invention
The invention mainly aims to provide a model quantitative reasoning acceleration method, device, equipment and storage medium, and aims to solve the technical problems that in the prior art, the quantization bit width needs to be determined for each token in the reasoning process, so that the quantization configuration efficiency is low, and particularly the reasoning speed and the video memory optimizing effect are seriously influenced in a long text reasoning task.
In order to achieve the above object, the present invention provides a model quantitative reasoning acceleration method, comprising:
dividing an input text into a plurality of processing blocks, fixing the processing precision format of a first processing block into a high-precision format, and disabling quantization processing of the first processing block;
Generating a self-attention matrix of each other processing block through a language model for the other processing blocks except for the first processing block, determining the sum of total element values of corresponding columns of each token position in the self-attention matrix, and taking the sum of total element values as an importance score of each token position;
Allocating token positions with importance scores greater than a first threshold to a high-precision format, allocating token positions with importance scores above a second threshold and below the first threshold to a medium-precision format, and allocating token positions with importance scores less than the second threshold to a low-precision format;
Counting the number of token positions distributed into a high-precision format, a medium-precision format and a low-precision format in each processing block, and selecting the precision format with the largest number as the unified quantization configuration of the corresponding processing block;
Dividing the network modules of the language model into a plurality of configuration sharing groups, wherein each configuration sharing group at least comprises two network modules;
sharing the unified quantization configuration of the processing block corresponding to the first network module to other network modules in the same configuration sharing group in each configuration sharing group;
And executing block-level batch quantization on all the processing blocks according to the unified quantization configuration corresponding to each processing block, completing model reasoning, and generating a model reasoning result.
Further, in order to achieve the above object, the present invention provides a model quantitative reasoning acceleration apparatus, comprising:
the input text preprocessing module is used for dividing an input text into a plurality of processing blocks, fixing the processing precision format of a first processing block into a high-precision format, and disabling quantization processing of the first processing block;
the self-attention analysis module is used for generating a self-attention matrix of each other processing block through a language model for the other processing blocks except the first processing block, determining the sum of all element values of the corresponding column of each token position in the self-attention matrix, and taking the sum of all element values as an importance score of each token position;
The precision allocation module is used for allocating the token positions with importance scores larger than a first threshold value into a high-precision format, allocating the token positions with importance scores above a second threshold value and below the first threshold value into a medium-precision format, and allocating the token positions with importance scores smaller than the second threshold value into a low-precision format;
The quantization configuration decision module is used for counting the number of token positions distributed into a high-precision format, a medium-precision format and a low-precision format in each processing block, and selecting the precision format with the largest number as the unified quantization configuration of the corresponding processing block;
The network module grouping control module is used for dividing the network module of the language model into a plurality of configuration sharing groups, and each configuration sharing group at least comprises two network modules;
The configuration sharing management module is used for sharing the unified quantization configuration of the processing block corresponding to the first network module to other network modules in the same configuration sharing group in each configuration sharing group;
and the reasoning execution module is used for executing block-level batch quantization on all the processing blocks according to the unified quantization configuration corresponding to each processing block and completing model reasoning to generate a model reasoning result.
Further, to achieve the above object, the present invention also provides a computer device including a memory, a processor, and a model quantized inference acceleration program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the model quantized inference acceleration method as described above.
Further, to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a model-based quantized inference acceleration program which, when executed by a processor, implements the steps of the model-based quantized inference acceleration method as described above.
The invention discloses a model quantization reasoning acceleration method which comprises the steps of dividing an input text into a plurality of processing blocks, fixing the calculation precision format of a first processing block into a high-precision format and disabling quantization operation, generating a self-attention matrix for other processing blocks except the first processing block through a language model, calculating importance scores according to the numerical sums of corresponding columns of each token position in the self-attention matrix, distributing each token position into high-precision, medium-precision or low-precision formats based on two preset thresholds, counting the number of each precision format token in each processing block, selecting the precision format with the largest number as unified quantization configuration, dividing a network module into a plurality of configuration sharing groups, sharing the unified quantization configuration of the processing blocks in the groups, executing block-level quantization according to the unified quantization configuration of the processing blocks, completing model reasoning, and generating reasoning results. The invention uniformly determines the quantization configuration of each processing block based on the token importance score and multiplexes the configuration in the network module group, thereby realizing block-level precision distribution and parallel quantization reasoning, greatly reducing the memory overhead and configuration time overhead while guaranteeing the reasoning precision, and effectively improving the execution efficiency and the memory utilization rate in long text reasoning tasks.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a schematic diagram of an application environment of a model quantitative reasoning acceleration method according to an embodiment of the present invention;
FIG. 2 is a flow chart of an embodiment of a method for model quantitative reasoning acceleration of the present invention;
FIG. 3 is a schematic diagram of a functional module of a preferred embodiment of the model quantitative reasoning acceleration apparatus of the present invention;
FIG. 4 is a schematic diagram of a computer device according to an embodiment of the invention;
Fig. 5 is a schematic diagram of another structure of a computer device according to an embodiment of the invention.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The model quantitative reasoning acceleration method provided by the embodiment of the invention can be applied to an application environment as shown in figure 1, wherein a user side communicates with a server side through a network. The server side can divide an input text into a plurality of processing blocks through a user side, fix the calculation precision format of a first processing block into a high-precision format and disable quantization operation, generate a self-attention matrix through a language model aiming at other processing blocks except the first processing block, calculate importance scores according to the numerical sums of corresponding columns of each token position in the self-attention matrix, allocate the token positions into the high-precision, medium-precision or low-precision formats based on two preset thresholds, count the number of the token formats in each processing block, select the precision format with the largest number as unified quantization configuration, divide a network module into a plurality of configuration sharing groups, share the unified quantization configuration of the processing blocks in the groups, execute block-level batch quantization according to the unified quantization configuration of the processing blocks and complete model reasoning, and generate reasoning results. The invention uniformly determines the quantization configuration of each processing block based on the token importance score and multiplexes the configuration in the network module group, thereby realizing block-level precision distribution and parallel quantization reasoning, greatly reducing the memory overhead and configuration time overhead while guaranteeing the reasoning precision, and effectively improving the execution efficiency and the memory utilization rate in long text reasoning tasks. The user terminal may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server may be implemented by a stand-alone server or a server cluster formed by a plurality of servers. The present invention will be described in detail with reference to specific examples.
Referring to fig. 2, fig. 2 is a flowchart of an embodiment of a model quantization inference acceleration method provided by the present invention. It should be noted that although a logical order is depicted in the flowchart, in some cases the steps depicted or described may be performed in a different order than presented herein.
As shown in FIG. 2, the model quantitative reasoning acceleration method provided by the invention comprises the following steps:
s10, dividing an input text into a plurality of processing blocks, fixing the processing precision format of a first processing block into a high-precision format, and disabling quantization processing of the first processing block;
In this embodiment, the input text is divided into a plurality of processing blocks in order to control the distribution of computing resources in model pushing for long text inputs and to locally manage the activation values, attention matrix, and memory usage of the model. In practice, the text processing block may be generated by a sliding window of a fixed length, e.g. divided into blocks every 512 token, which is determined by the maximum processing length of the training window or the inference window of the model. The partitioning mode is derived from a window slicing mechanism widely used in sequence modeling, and aims to convert a long text into a local context block, so that local semantic coherence is ensured, and a calculation bottleneck caused by loading all token at one time is avoided.
The processing precision format of the first processing block is fixed to a high precision format in order to provide a stable calculation starting point for the model in the initial stage of reasoning. The high precision format is generally referred to as a floating point format, such as FP32 or FP16, which is more robust and fault tolerant in terms of numerical range and accuracy of expression than low-order wide fixed point formats (such as INT8 or INT 4).
The quantization processing of the first processing block is disabled to further ensure that the segment of input operates under full precision expression, and avoid cumulative interference of quantization errors on subsequent block propagation paths at the initial stage of model propagation. Disabling quantization includes not only not applying quantization operations to weights and activation values of the processing block, but also skipping quantization configuration generation processes of the processing block. The core aim is to provide a reference standard in quantization decision so that the first processing block becomes a relative anchor point for subsequent block importance scoring and precision configuration judgment.
In practice, the quantization process typically involves discretizing the weight parameters of the model and compressing the activation values into a limited bit-wide representation field. If this is still performed on the first processing block, the initial attention results are distorted, thereby reducing the accuracy of the precision assignment of subsequent token positions. Therefore, the quantization processing of the first processing block is forbidden, so that not only is a numerical stability consideration, but also the basic premise of a dynamic accuracy adjustment strategy in the whole reasoning path is provided.
Because the model generally adopts a residual connection and normalization mechanism, the high-precision first block is helpful for stabilizing the statistical behaviors of the previous layers, and further, the judgment accuracy of the subsequent processing blocks in the quantization strategy decision is improved. The feature also has good generalization and does not depend on a specific model structure, so that the feature can be applied to large language models with different architectures such as GPT, T5 and BERT.
In a specific implementation, after the input text is loaded into the token sequence, the input text is segmented according to the preset processing block length, and each segment forms a processing block. The token range of the first processing block may be the first 512 tokens of the token sequence, and after the division is completed, the processing block is configured with the calculation accuracy format being FP16 or FP32. During model reasoning, the quantization configuration generation process of the block is skipped, namely the importance scoring is not participated, and the bit width decision is not executed. To disable the quantization process, quantization operators deployed in the model are forced to bypass for the first processing block when the model execution is invoked. This can be accomplished by configuring the quantization mask parameters of the inference engine, or by setting the data path of the block to a non-quantized path during the model map transformation. Further, floating point precision execution may be maintained entirely without invoking low precision computation core functions as the processing block passes through the embedded layer, the multi-head attention layer, and the feed forward layer of the model. In an actual model running environment, if an inference framework such as TensorRT is adopted, the head block can be marked as a resident high-precision area when an engine is constructed, or the head block is realized by adding a head block precision locking instruction when compiling. In the multi-block execution flow, the output of the first processing block can also be used as a reference input for importance analysis and precision allocation policy generation of the subsequent processing block.
It is illustrated that in the medical health business field, when processing text containing a long medical diagnostic record, there is often a case where a critical condition description appears at the beginning of the text. If the first processing block loses information due to quantization, it may result in the model failing to extract the disease symptoms and corresponding analysis correctly. After the fixed head block high-precision processing is adopted, the model can capture the high-value content of the head section more accurately, and is favorable for generating accurate diagnosis summary or disease course prediction subsequently.
In the field of financial and scientific business, when a complete risk report is processed, the report header typically contains global risk classification and summary information. If the model is placed in a low-precision calculation path, the risk level of misjudgment of the model is easily caused, so that the subsequent judgment process is influenced. By keeping high-precision calculation on the first processing block and disabling quantization, the key points of the report can be effectively captured, the reasoning accuracy and stability of the wind control model are ensured, and the reliability and safety of financial service decisions are improved.
The stability and accuracy of the subsequent processing blocks in accuracy policy discrimination can be remarkably improved by fixing the first processing block into a high-accuracy format and disabling quantization processing thereof in the reasoning process. The high precision first block provides complete context representation capability so that its output can be used as a benchmark for weight assignment and quantization level selection, while avoiding model offset due to quantization error from initial information propagation. In a long text scene, the strategy can effectively reduce the problem of global precision degradation caused by first-stage erroneous judgment, so that the overall reasoning performance and stability are improved.
S20, generating a self-attention matrix of each other processing block through a language model for the other processing blocks except the first processing block, determining the sum of total element values of corresponding columns of each token position in the self-attention matrix, and taking the sum of the total element values as an importance score of each token position;
In this embodiment, for the processing blocks other than the first processing block, the self-attention mechanism is introduced to generate a corresponding self-attention matrix, which is mainly used to measure the information dependency relationship between different token positions in the same processing block. In natural language processing tasks, large language models (Large Language Model) typically employ a transform structure-based self-attention mechanism to build context representations between tokens. By constructing the self-attention matrix for each processing block separately, the calculation range can be localized, so that the memory occupation is effectively controlled and the calculation efficiency is improved.
The sum of the total element values of the corresponding columns of each token position is used as an importance score, which is essentially a column vector sum, whose technical meaning is to measure the overall concentration of a token in the processing block. The larger the value, the stronger the link between the token and other tokens, and the wider the scope of its semantic information propagation within the block, and thus is considered more "important".
In the self-attention calculation, each token position corresponding column vector represents the attention weights assigned to all source tokens when the token is the target token. On this basis, summing the column vectors yields their global degree of interest. The importance score obtained by the method does not depend on specific task labels, so that the method has good universality and is suitable for the demands of token selective precision control in various tasks such as machine translation, question-answering systems, information extraction and the like.
To avoid introducing unnecessary noise and redundant computation, the self-attention moment array should skip filling token when generating, i.e. mask strategy can be adopted for paddingtoken positions, so that the corresponding attention value is forced to be zero, and the importance evaluation of other token is prevented from being interfered. This design is equally applicable in a Multi-head Attention mechanism (Multi-head Attention), where the local matrices for each Attention head calculation may be first fused on average, unified into a final self-Attention matrix.
In one embodiment, the token sequence of each processing block is fed into a multi-headed self-attention module for forward propagation to obtain an attention matrix of a plurality of attention heads. The final self-attention matrix of the current processing block is generated by weighted or simple arithmetic averaging of these local matrices. Then, each column of the matrix is traversed, and all values in the column vector are summed as the importance score of the token position corresponding to the column. In the process, the column vector corresponding to the filling token is replaced by an all-zero vector, so that the importance score is constant to zero, and the fact that the column vector does not participate in the selection of subsequent precision configuration is ensured.
For scenes with higher calculation performance requirements, the generation of the self-attention matrix and the column vector summation operation can be integrated into a GPU Kernel function (CUDA Kernel), a plurality of processing blocks are processed in batches under a GPU parallel framework, and the overall processing efficiency is improved. In addition, for the case of lower numerical precision in attention matrix (such as FP 16), a normalization or numerical smoothing mechanism can be introduced to avoid misjudgment caused by the influence of numerical drift on the token score.
It is illustrated that in the medical health field, doctor's inquiry records, patient self-description, test report, etc., often contain a large number of tokens, while words that actually affect subsequent decisions are very few of the highly specialized and context-dependent medical terms. Taking an electronic medical record as an example, the electronic medical record may include information pieces such as patient age, basic disease, complaints, examination results and the like. When a self-attention matrix of the processing block is generated, for example, a phrase of 'liver dysfunction', the corresponding token positions of the processing block show stronger attention connection with the token positions such as 'ALT elevation', 'jaundice', 'hepatitis B surface antigen positive', and the like in the matrix. By computing the element sums of the token's corresponding columns, its semantic importance throughout the fragment can be quantified. Further, these high-score tokens will be labeled as important tokens, providing greater computational accuracy support for subsequent diagnostic classification or automated generation of medical record summary tasks. The token with high universality, such as 'patient gender', 'this time of review', and low weight of up and down Wen Yuyi, which is smaller and corresponding to the elements of the column, is naturally given a lower importance score, so that calculation is performed with lower precision in the follow-up pushing, and the calculation force resource is saved without influencing the expression of core semantics.
In intelligent customer service conversations, user complaint processing, or transaction log parsing in the financial field, important contextual information is typically distributed in tokens containing words such as "risk", "freeze", "fraud", "delay to account", and the like, and the contextual dependency of such words is also typically high. In the self-attention matrix, the token "fraud" forms a strong connection with the token such as "transfer failure", "fund freeze", "strange contact", which corresponds to the elements of the column vector and is significantly higher than other background words such as "hello", "ask", "thank you", etc., and is thus marked as a high importance location in this step. In the subsequent risk judgment model pushing, key token is distributed to high-precision calculation, so that the understanding of sensitive expression is ensured not to be distorted due to low-bit-width quantization, and the key is particularly critical in user input with fuzzy semantic boundary and irregular expression. Through the token-level precision distribution strategy based on the attention, the memory occupation and the calculation load during inference can be effectively compressed on the premise of not sacrificing wind control sensitivity and precision.
The example scene shows that in a long text task, semantic coupling degree between the token is captured through a self-attention moment array, importance is determined based on column vector summation, technical targets of token level precision control are achieved, and universal adaptability and high reasoning efficiency guarantee across fields are achieved.
By summing the self-attention moment array of each processing block column by column and calculating the importance score, the token position that contributes most to semantic expression can be accurately identified based on the token's context-dependent strength. When accuracy distribution is further carried out on the basis, external feature extraction or complex rule matching is not relied on, preprocessing calculation cost in an inference stage is greatly reduced, and the identification and accuracy control of important token are unified.
S30, allocating the token positions with importance scores larger than a first threshold value into a high-precision format, allocating the token positions with importance scores above a second threshold value and below the first threshold value into a medium-precision format, and allocating the token positions with importance scores smaller than the second threshold value into a low-precision format;
In this embodiment, the configuration of using the importance score of each token position to drive the subsequent calculation accuracy is a key path for implementing the differential quantization strategy. The "importance score" here originates from a numerical representation generated based on the self-attention moment array in the previous stage for reflecting the information influence of each token in its context. The higher the score, the higher the degree to which the token is focused by other tokens in the current processing block, i.e., the stronger its role in semantic propagation, the lower the tolerance for semantic masking. Therefore, this value directly determines the computational accuracy required by the token in the inference.
The high precision format typically corresponds to a floating point calculation format, such as half precision floating point (FP 16) or single precision floating point (FP 32), with a balance between computational complexity and expressed precision. The medium precision format can be bit width fixed point quantization, such as INT4, which can save a part of video memory resources and operation cost while ensuring the accuracy of main semantic expression. The low-precision format usually adopts a fixed-point quantization format with smaller bit width, such as INT2 or further INT1 (binarization) representation, which is suitable for token positions with high redundancy and strong semantic fault tolerance, and significantly compresses the computational load.
Hierarchical control is achieved by setting a first threshold and a second threshold, wherein the first threshold is greater than the second threshold. The second threshold is used to identify the tokens of least importance, while the first threshold identifies the most important batch of tokens, with tokens in between falling within a medium accuracy range. The setting principles of the first threshold and the second threshold can be dynamically set according to the length of the processing block, the model scale and the current hardware resources, and can also be determined through empirical model training. This dual threshold mechanism allows for both controllable policy flexibility and fine-grained scheduling of computing resources for precision configuration.
The process of comparing the score with two thresholds can be generally performed in parallel in a weight calculation thread in a preprocessing stage, and each token position can determine its precision level through one floating point comparison operation, and then the precision level is marked as a corresponding precision label for subsequent configuration generation and reasoning allocation. The whole allocation mechanism does not depend on the modification of the model structure, only acts on the quantization strategy layer, and has good system compatibility and deployment adaptability.
The hierarchical accuracy configuration may be achieved by means of static threshold settings. For example, the second threshold value is 0.2, the first threshold value is 0.8, and the token importance scores are uniformly distributed in the range of [0,1 ]. Then a token greater than 0.8 is assigned FP16 with the intermediate segments (between 0.2-0.8, inclusive of 0.8 and 0.2) using INT4 and below 0.2 being INT2. Dynamic threshold calculation may also be performed based on the fractional distribution of each processing block, for example, using a quantile (quantile) method, with the first 10% being high precision, the second 30% being low precision, and the middle being medium precision. In addition, task perception factors can be fused, and known important token priority in a specific semantic annotation task is marked with high precision, namely precision enhancement under the guidance of attention is performed.
In a specific implementation, each token position records an accuracy tag (for example, 2-bit identifier: 00 low accuracy, 01 medium accuracy and 10 high accuracy), and the accuracy tags are stored in a video memory in a bitmap form, and a quantization configuration module calls a quantization kernel function with a corresponding bit width after reading the identifier to complete low-level reasoning binding. Multiple token can be marked in batches, and by means of SIMD (Single Instruction Multiple Data) instruction sets, concurrent allocation is achieved, so that throughput efficiency is improved. For the distributed scene, the precision allocation mark can be locally finished on each GPU, and the configuration bitmap is uploaded to the shared control module, so that the cross-block consistency scheduling is realized.
By distributing the precision format according to the importance score at the token level, the key inclination of the computing resource can be realized, namely, higher computing power is used for key token, and the precision resource investment for low-weight information is reduced. The display memory overhead and the reasoning delay are effectively controlled while the whole model keeps high reasoning accuracy.
S40, counting the number of token positions distributed into a high-precision format, a medium-precision format and a low-precision format in each processing block, and selecting the precision format with the largest number as the unified quantization configuration of the corresponding processing block;
In this embodiment, after the multi-level precision allocation is completed, statistics needs to be performed on the distribution situation of different precision formats inside each processing block, so as to provide a decision basis for the subsequent unified quantization configuration selection. Each token position is assigned to a high precision format, a medium precision format or a low precision format after the comparison of the preamble importance scores, and the goal of this step is to count the number of token positions corresponding to the three types of precision formats by traversing the valid token positions of the whole processing block.
The quantized configuration of each processing block must be uniform to match the scheduling requirements of block-level parallel execution in the hardware computing unit. For example, when a Tensor Core (Tensor Core) of the GPU performs calculation of INT8 or FP16 or the like, it is required that each operation is for uniform matrix accuracy, and thus it is necessary to determine in advance the overall calculation format of the processing block. The dominant precision distribution of the current processing block is determined according to the quantized precision statistical result.
The invalid token positions in the traversal process do not participate in statistics, so that the deviation of the actual distribution caused by filling the token is avoided. In practical implementation, the mask recorded in the filling stage is usually utilized to quickly exclude the padding area, and only the precision labels of the valid positions are counted. After counting, comparing the number and the size of three types of precision formats, and selecting the precision format with the largest number as the unified quantization configuration of the processing block.
If the number of the tokens of two or three precision formats is the same and is the maximum value under some special conditions, in order to ensure the stability and conservation of the model reasoning precision, the format with higher precision grade is preferably selected. For example, when the high precision and the medium precision are equal in number, a high precision format is selected as the quantization configuration of the processing block. The conservative bias strategy ensures that the information expression capability of the key semantic path is kept as much as possible under the premise of not increasing the calculation burden of the model.
This operation is not only a configuration strategy based on statistical dominant trends, but also a technical mechanism to ensure the consistency of quantization strategies inside the processing block. Through this mechanism, each processing block is assigned a unique unified quantization configuration, thereby invoking the corresponding kernel function and achieving hardware-level block parallel acceleration in subsequent reasoning.
The accuracy labels for each token location may be represented using a bitmap structure, with each bit representing the accuracy class of the current token, e.g., using a two-bit binary identifier to represent high, medium, and low accuracy, respectively. In the statistics stage, the precision zone bit of each processing block is extracted by using logical AND operation, invalid token positions are skipped according to masks, and three precision marks are respectively counted.
In a multi-core system, statistical operations may be performed in blocks in parallel, with each thread processing a statistical task for one processing block, and writing the results into a shared memory or a quantized configuration register. After three types of count values are stored in the shared memory, a comparison operation is executed to judge the maximum value, and an accuracy configuration identifier is generated according to a priority strategy. If the dynamic precision adjustment strategy is adopted, the dynamic quantization bit width setting of the current processing block can be synchronously adjusted after statistics is completed.
In order to further improve the intelligence and adaptability of unified quantization configuration of the processing block, on the basis of precision configuration selection based on precision format statistical results, various semantic and historical context features can be introduced, and the statistical counting results are subjected to weighted fine adjustment so as to reflect the sensitivity of the language model to local semantic changes and dynamically adapt to the actual demands of different semantic areas in reasoning tasks on computing resources.
First, when processing long text input in natural language, there is a phenomenon that token semantic density exhibits uneven distribution as context semantic field changes. Taking legal contracts, medical inspection reports and other highly structured documents as an example, a large number of repeated, high-frequency and information redundant token (such as terms, connectors, unit names and the like) can appear in text fragments, and the token frequently appears literally, but does not form information emphasis in actual semantics. If the precision format division is performed only according to the original importance score calculated by attention, the number of the intermediate or low precision tokens in the statistics process may be reduced, so that the processing blocks are wrongly configured to be high-precision, and the calculation redundancy is increased.
To this end, a sliding window based high frequency token importance suppression mechanism may be introduced. The mechanism counts the frequency of tokens that appear in a plurality of adjacent processing blocks during the statistics process, and if the frequency of occurrence of a token within a sliding window exceeds a preset threshold (e.g., occurs in 70% of the blocks), then the importance score of that token is suitably reduced by a scaling factor (e.g., multiplied by 0.8), thereby tending to divide it into medium or low precision formats during the statistics process, avoiding high frequency redundant token dominant overall precision configuration choices. The sliding window can be a processing block sequence with the length of 3-5, and the window length can be dynamically adjusted according to the input total length so as to achieve the response speed and the local feature capturing capability.
Second, a preamble semantic block influencing factor may be introduced. Because there is often strong semantic consistency between processing blocks, such as in financial public opinion analysis, a "credit card overdue" processing block is followed by a "collect, default, credit" class semantic block, and the precision trend in the previous processing block should be referred to when counting precision, so as to enhance the precision configuration strategy of the current block and the above semantic consistency. In the implementation manner, the proportion of the number of the various precision format tokens in the previous processing block can be recorded, and a weighting factor is introduced to adjust based on the precision count of the current processing block, for example, the number of the current middle precision tokens is 200, if the proportion of the high precision proportion in the previous processing block is high, the current high precision count can be multiplied by a forward adjustment factor of 1.1 to improve the configuration selection possibility.
Furthermore, for specific requirements of certain tasks, a context weight perception model can be introduced, and the precision configuration weight vector of the current processing block can be dynamically generated through a lightweight forward network module (such as a layer of FFN+Softmax). The input characteristics comprise the original three types of precision token counts, sliding window high-frequency token proportion, preamble block precision distribution and the like, and the relative weight of each type of precision is adjusted through the weight learned by the forward model, so that the selection logic of final precision configuration is controlled more carefully. For example, in a medical health text generation scenario, the perceptual model can identify "pathological state turns" or "treatment response changes" points in time, tending to select a higher accuracy configuration.
Finally, when the counting values of the format of three types of precision of high, medium and low are close, a dynamic gating mechanism can be introduced to avoid the problem of discontinuous context precision caused by quantization jitter. The gating module judges whether the precision configuration distribution of the current processing block is balanced or not based on precision distribution entropy (such as Shannon entropy or Top-k duty ratio), and when judging that the precision is not obviously dominant, the gating module forcedly selects medium precision configuration so as to realize soft regulation and control on the overall computing resource of the system. The mechanism is particularly suitable for a large-scale model deployment scene, can effectively control the power consumption peak value, and improves the overall reasoning throughput of the model.
In the medical health field, the input may include diagnosis conclusion, symptom description, treatment scheme and the like when facing the medical record abstract task. For example, the treatment block includes "intermittent headache and blurred vision of the patient, suspected of raised intracranial pressure". In the preceding steps, "headache," "blur," "intracranial pressure," and the like are labeled as high precision, "patient," "occurrence," "suspicion," and the like are labeled as medium precision, and the remaining connective is labeled as low precision. In the statistics process, the number of the intermediate precision tokens is found to be slightly more than that of the high precision tokens, but the high precision tokens are close to the intermediate precision tokens. The system selects high precision as unified configuration of the processing blocks according to a conservation strategy, and ensures the reasoning precision of a follow-up diagnosis model.
In a financial business scenario, when risk prediction is generated by analyzing historical behavior of user transactions, a certain processing block comprises 'three times of transfer of the user today, and each amount exceeds a limit'. In the importance scoring stage, "transfer," "quota," "amount" is assigned to a high accuracy format, with the remainder being medium or low accuracy. After precision statistics, the number of high-precision tokens is dominant, and the system configures the processing block to be FP16 precision and calls a floating point kernel function to process uniformly. The strategy ensures that the system can accurately identify and early warn abnormal transaction patterns.
The number of the tokens in each precision format is counted, the configuration selection is carried out by adopting the maximum duty ratio principle, the standardization of the precision strategy in the processing block is effectively realized, the parallel quantization calculation can be carried out based on the unified configuration of the block level in the subsequent reasoning stage, the system execution efficiency is improved, and the kernel function scheduling path is simplified. Meanwhile, a precision priority rule is introduced, so that the semantic expression capability of the model is guaranteed not to be damaged preferentially when the configuration conflicts, and the precision stability of the model is maintained while resources are saved.
S50, dividing the network module of the language model into a plurality of configuration sharing groups, wherein each configuration sharing group at least comprises two network modules;
In this embodiment, the language model is generally composed of a plurality of Network modules (Network modules) stacked in succession, each module containing components of an embedded layer, a Feed-Forward Network layer (Feed-Forward Network), a Multi-headed self-attention layer (Multi-Head Self Attention), and the like. In order to control the memory overhead and improve the inference efficiency in long text input scenarios, these network modules may be divided into several "configuration sharing groups" (configuration sharing group) according to the processing order, with each group of modules sharing the same precision configuration and quantization parameters, to reduce unnecessary duplicate computations.
The processing sequence is grouping according to the natural front-back arrangement of network modules in a model execution path, instead of random disruption or dynamic organization based on task allocation, and the sequential division mode can ensure the stability and the implementation simplicity of the precision configuration propagation logic. Each configuration sharing group contains a plurality of adjacent network modules, the number of which can be determined by a preset parameter, and the parameter can be set according to the depth of the model, the calculated amount of each layer, the available calculation resources and other dimensions. Common preset numbers include 4, 8, 12, etc., and specific values may be determined by experimental optimization. Each configuration share group contains at least two network modules.
The core of the introduction of the configuration sharing group is to form an execution structure of "block-level reasoning-group-level multiplexing". By uniformly configuring the quantization schemes of all modules in the shared group, the number of quantization configurations which need to be independently managed in the execution process can be obviously reduced, and the video memory and scheduling overhead of the frequent loading configuration of different modules in model inference can be reduced. This strategy of dividing and setting a fixed number of groups in order is structurally divisible with the number of modules, thus achieving uniform scheduling and allocation in hardware orchestration.
It is noted that the configuration sharing group and the processing block generated by the previous input text block are two-dimensional structures, wherein the former acts on the inside of the model structure for quantization configuration optimization among network modules, and the latter acts on the organization structure of the input data for controlling the input length and calculation division, and the connection is established between the two by a binding mechanism (such as the processing block corresponding to the binding of the head module in the group).
For example, a large language model containing 96 network modules may be divided into 12 configuration-sharing groups, each group containing 8 consecutive modules. The division is performed in the order of Module numbers, for example, group_1 includes modules_1 to 8, group_2 includes modules_9 to 16, and so on. In the configuration stage, a unique group identifier is allocated for each shared group, and a mapping relation between the group identifier and the processing block is established.
In the execution stage, when the model deducing task reaches a certain configuration sharing group, the system firstly reads the unified quantization configuration parameters corresponding to the group from the bound processing blocks, and maps the parameters to all modules in the group for use. All the modules in the group do not need to carry out independent quantization configuration initialization, thereby saving time and storage cost.
In a large depth model under a special scene, because network modules with different depths have non-uniformity in terms of semantic modeling strength, information flow characteristics and quantization sensitivity, the adoption of uniform number division configuration sharing groups may not achieve the best balance of precision and efficiency. At this time, a non-uniform division mode can be introduced, and the division strategy is more suitable for the model hierarchical structure and specific task requirements by flexibly adjusting the number of network modules contained in each configuration sharing group.
For example, modules near the input side typically assume underlying semantic coding tasks (e.g., location embedding, lexical modeling, etc.), which are relatively simple in computational structure and insensitive to loss of precision, thus enabling multiple adjacent modules to be consolidated into a larger configuration sharing group, improving parameter multiplexing efficiency and reducing configuration switching frequency. On the contrary, modules near the output side mainly complete high-level semantic abstraction and decision information integration, and have higher sensitivity to quantization precision, and if a unified configuration sharing strategy is adopted, precision is easy to be reduced. The modules can be divided into smaller configuration sharing groups, and even configured for each two modules independently, so that more quantization flexibility is reserved at an output layer to adapt to the representation requirement of high-dimensional output characteristics.
To further improve adaptivity, the non-uniform partitioning strategy may be combined with a model structure search (Neural Architecture Search, NAS) mechanism. The specific method is that different partitioning strategies (such as shared group size, shared group position distribution, binding block number and the like) are used as a part of the structure search space, and control variables are introduced in a pre-training or distillation stage to perform performance-efficiency joint optimization. The structure search result can feed back which depth areas are more sensitive to the precision, so that the dynamic adjustment of the sharing group layout is guided, and finally, an adaptive configuration sharing mechanism which takes account of the calculation force distribution and the precision stability is formed.
In addition, when a deep layer transducer structure facing tasks such as a medical health dialogue system or financial transaction behavior modeling is deployed, a semantic density driven dynamic partitioning mechanism can be introduced, for example, the group partitioning mode of each network segment can be dynamically adjusted in real time according to gradient activation distribution conditions caused by input token in a model. Regions of high semantic density (e.g., disease diagnosis conclusions or transaction behavior breakpoints) may be configured with smaller granularity of group partitioning to improve configuration fidelity, while information redundancy regions (e.g., repeated queries or invalidation behavior) may use large group sharing, thereby reducing unnecessary waste of computing resources.
In summary, the non-uniform division not only provides a more flexible configuration multiplexing path, but also provides a structure perception optimization path for high-performance quantitative reasoning, and is an important enhancement strategy for realizing efficient deployment of a complex model in a resource-constrained environment.
By dividing the network module of the language model into a plurality of configuration sharing groups comprising a preset number of modules, the time and the memory consumption for repeatedly calculating the quantized configuration for each layer of modules in the inference stage can be remarkably reduced, and the configuration scheduling logic is simplified. The strategy effectively introduces a configuration multiplexing mechanism of a module structure level, and realizes batch control of computing resources while guaranteeing quantization flexibility. By sharing the precision configuration in the group, a great number of redundant configuration loading and switching operations are avoided, so that the execution efficiency of the long text reasoning task is improved integrally.
S60, sharing the unified quantization configuration of the processing block corresponding to the first network module to other network modules in the same configuration sharing group in each configuration sharing group;
In this embodiment, in order to improve the execution efficiency and accuracy consistency of block-level quantization reasoning, the network modules are divided into a plurality of configuration sharing groups, and the unified quantization configuration of the processing blocks associated with the first network module is multiplexed within each group. The essence of the operation is to construct a parameter sharing mechanism of the cross-module, and to avoid independently generating or loading quantitative configuration for each network module in the reasoning process, thereby reducing redundant configuration overhead and improving the overall operation efficiency and the memory utilization rate.
In a specific implementation, firstly, the network modules of the language model are required to be divided into a plurality of configuration sharing groups according to the processing sequence, and each sharing group contains a preset number of network modules, for example, each group contains 4 continuous transducer sublayers, or the network modules are equally divided according to the depth of the model. The first network module is then selected from each shared group and the unified quantized configuration of its associated processing block is taken as the reference configuration for that group. This processing block may be derived from the actual inferential data of the model input stage, or may be generated from simulation data or task priors data, ensuring that the selected configuration is representative.
In order to realize efficient sharing, it is necessary to write the unified quantization configuration into a shared memory area and map the same memory address for all network modules in the same configuration shared group, so as to avoid repeated configuration loading and conversion operations. The shared quantization configuration typically includes a quantization bit width (e.g., INT8 or INT 4), a scale factor (scale), a zero point (zero point), and a calculation precision format identification (e.g., whether or not to be a mixed precision floating point calculation, etc.). For configuration parameters generated by adopting weighted average, history accumulation or semantic importance aggregation strategies, the configuration parameters can also be directly written into the shared area, so that the flexibility of the original configuration generation strategy is not affected by a sharing mechanism.
The configuration sharing mechanism not only improves the operation efficiency of the reasoning stage, but also enhances the local consistency of the precision, and particularly in the structural design with semantic recursion enhancement between the deep modules of the model, the consistent quantization configuration can reduce error accumulation and quantization jitter effect. In order to ensure the effectiveness under the dynamic task scene, a lightweight synchronous mechanism can be further introduced into the shared group, for example, if the configuration update of the processing block bound by the first network module is detected, the system can automatically trigger the shared memory refreshing operation through the configuration mapping table, so that the latest configuration is read by the subsequent modules.
In the transducer architecture, it is assumed that there are 48 network modules, which are divided into 12 configuration-sharing groups of 4 modules each. The first module in each group binds a processing block (such as a block extracted in a high-frequency word segment or a dialogue key area) with representative importance score distribution, and generates unified quantization configuration according to the token precision distribution condition of the processing block. The generated configuration is written into the shared memory area, and other modules of each shared group access the configuration through the same memory pointer in the reasoning process, so that the unified quantization operation of the weight and the activation value is completed. In a multi-card deployment environment, the quantized configuration of the group can be synchronously stored in the shared memory of each card, and communication delay is reduced through pointers or IPC mapping. If the structure of the processing block is detected to be changed severely or switched across scene input, the system can dynamically update the binding processing block and the configuration and remap the shared memory address, so as to ensure the adaptability of configuration sharing.
By constructing a configuration sharing group based on processing blocks and multiplexing quantization configuration associated with a first network module in the group, quantization parameter generation and switching frequency in a model reasoning process can be remarkably reduced, semantic fidelity is ensured, and repeated calculation cost is reduced. The computational consistency inside the model is enhanced, and the execution stability and the expandability of the large model reasoning under the multi-scene deployment are improved.
S70, executing block-level batch quantization on all the processing blocks according to the unified quantization configuration corresponding to each processing block, completing model reasoning, and generating a model reasoning result.
In this embodiment, in order to improve the computing efficiency and the video memory utilization rate of the large language model in the long text reasoning task, the block-level batch quantization computation is performed based on the unified quantization configuration corresponding to each processing block, and the corresponding model reasoning flow is completed. The operation integrates the capability of parallel scheduling of configuration driving quantization calculation and processing blocks on a technical path, ensures that each processing block finishes the fine reasoning processing according to the importance precision requirement of the processing block, and finally generates a complete reasoning result by splicing.
The unified quantization configuration of each processing block, derived from the pre-importance score analysis and precision statistics operations, generally includes precision format identifications (e.g., FP16, INT8, INT 4), quantization information of weight parameters (e.g., scale_w and zero_point_w), and quantization information of activation values (scale_a and zero_point_a). The quantization calculation is usually performed by selecting an efficient calculation kernel function, and after the precision configuration of the processing block is validated, a corresponding execution path is allocated according to a precision format, for example, the floating point calculation uses an FP16 kernel function implemented based on CUDA or ROCm, and the fixed point quantization path calls an INT kernel function and loads related quantization parameters.
To improve the efficiency of the operation, all processing blocks are mapped to parallel processing units (SM, STREAMING MULTIPROCESSOR) on a graphics processor (GPU, graphics Processing Unit) or tensor processor (TPU, tensor Processing Unit), and the task scheduling is performed according to the starting index or processing order of the processing blocks. In the dispatching process, each processing block loads self-binding unified quantization configuration and executes corresponding operations such as matrix multiplication, normalization, activation function, residual connection and the like. In the process, the activation value quantization parameter is used to dynamically adjust the input range, and the weight quantization parameter is used to restore the model weight from the fixed point representation to a computable quantization representation.
After the reasoning calculation is completed, the intermediate result of each processing block needs to be subjected to accuracy check and invalid position clearing operation. This step is usually completed according to the recorded index table of the invalid token position, and the intermediate tensor result at the corresponding position of the invalid token is set to zero or deleted, so as to avoid the interference to the final output. And splicing and reconstructing all the effective calculation results into a complete model output sequence according to the initial position index of the processing block.
Finally, standard post-processing procedures are performed on the spliced output sequences, including dimension alignment (e.g., to complement tensors of different blocks to a uniform shape), normalization (e.g., layerNorm to normalize the activation output), and Decoding (e.g., to generate natural language text by greeny Decoding or Beam Search), and finally outputting the inference result text or structured vectors that can be used directly by the user.
For example, in the text generation task, the input text is divided into a sequence of processing blocks according to 512 token blocks, and each processing block is analyzed in steps 4 and 5 to obtain a corresponding unified quantization configuration, for example, the processing block a is in a high-precision format (FP 16), the block B is in a medium-precision format (INT 8), and the block C is in a low-precision format (INT 4). In the reasoning execution stage, after the GPU loads the unified quantization configuration of each processing block, the processing block A is distributed to a high-precision execution channel, the processing blocks B and C are distributed to a fixed-point execution channel, and matrix multiplication and nonlinear transformation calculation are carried out through different kernel functions. After the calculation of each processing block is executed, the system clears the calculation result corresponding to the filling token according to the token position index recorded during preprocessing. For example, block C contains 12 padding keys, the output of its last 12 positions will be zeroed out. Then, the effective outputs of the blocks A, B, C are spliced according to the starting index order, and post-processing (such as normalization and decoding) is performed to obtain a complete text generation result. In a multi-path input batch reasoning scene, a micro batch scheduling mechanism can be combined, processing blocks of a plurality of input samples are scheduled in the same GPU, and weight caches of partial high-frequency structures are shared when shared configuration is loaded, so that the throughput rate is further improved.
The method is characterized in that in the field of medical health business, an intelligent auxiliary diagnosis platform hopes to deploy a large language model to perform structural analysis and reasoning on the ultra-long electronic medical record text, and aims to extract main diagnosis, key symptom evolution process and etiology conclusion from thousands of hospitalization records. In this task, the original input text length exceeds the standard window limit of the model, so that the segmentation reasoning needs to be performed in a processing block mode. First, the entire medical record text is divided into a plurality of processing blocks according to a preset length (for example, every 512 token). The system marks the first processing block as a high-precision processing area which usually contains high semantic density parts such as basic information, complaints, current medical history and the like of a patient, and directly determines the credibility of the follow-up diagnosis reasoning, so that the quantization operation of the processing block is forbidden, and only a high-precision operator is used for floating point calculation. For the rest processing blocks, the platform automatically generates a self-attention matrix corresponding to each processing block before the model is executed, and sums each column in the matrix to obtain the attention degree of each token in the whole context, namely the importance score. Based on the above, the system compares the score with two dynamic thresholds, the token with higher importance score is configured into a high-precision format, generally corresponding to the core disease course node or the etiology information, the token with middle score is configured into a medium-precision format for processing descriptive paragraphs or disease observation records, and the token with lower score is configured into a low-precision format as the format information and the table data. After each processing block finishes the token level precision labeling, the system counts the number of tokens with different precision formats, and selects the precision level with the largest number as the unified quantization configuration of the processing block. For example, in a certain processing block, the medium precision format token has the highest duty cycle, and the block is configured as a medium precision computation path as a whole. The strategy avoids frequent switching precision and improves execution efficiency. Then, the system sequentially divides the reasoning modules of the model into a plurality of configuration sharing groups, and each group comprises a plurality of continuous network modules. The blocks processed by the first module of each group are bound and the precision parameters are configured, and the other modules share the configuration, so that the repeated execution of precision selection logic of each layer is avoided. Finally, all processing blocks call the quantization calculation kernel functions with corresponding precision to execute reasoning in parallel according to the respective unified quantization configuration. The high-precision block uses FP to calculate kernel functions, and the medium and low-precision blocks call fixed-point kernel functions and run in GPU or TPU parallel units. And in the pushing process, the corresponding position of the packing token is automatically skipped, so that the output result is ensured to be effective. After the reasoning is completed, the system splices the effective results of all the processing blocks into a whole section of reasoning result text according to the original token position index, performs unified dimension alignment and normalization, and finally decodes to generate a diagnosis conclusion, a key symptom time line and a suspected etiology list. The whole reasoning process ensures the accuracy and stability of diagnosis core information while controlling the occupation of the video memory, and realizes the deployment capability of high-performance long-text medical understanding tasks.
In the field of financial and scientific business, a credit evaluation platform facing small and medium enterprises needs to perform large language model reasoning judgment on credit risks of the enterprises based on complete business data submitted by the enterprises, historical credit contracts, public annual reports and client feedback records of the same length text data. Because the documents are often loose in structure and complex in content, the number of the overall token is tens of thousands, and the direct input model is faced with the serious problems of video memory explosion and response delay. The system firstly divides all text information into a plurality of processing blocks according to preset length (such as 1024 token), and ensures that the length of each block is controllable and the computing resources are acceptable. The first processing block usually contains high weight fields of enterprise subject information, registered capital, core business and recent three years of business income, which are the cores for influencing credit score, so the platform strategically fixes the block to adopt a high-precision calculation path, and the quantization processing is disabled so as to keep the expression details of key financial characteristics and legal statement information to the maximum. For the rest processing blocks, the platform introduces a multi-head self-attention mechanism to analyze the attention degree of each token in the context, and extracts the sum of global attention weights received by each token from each column after constructing an attention matrix as an importance index of the token. Then, the system assigns different precision levels to each token according to two dynamic thresholds, ensures that high-precision format processing is used for high-risk factors such as tax anomalies, contract violations, bank credit records, and the like, while medium or low precision is used for relatively minor information such as financial statement notes, customer descriptions, and the like. Then, the system counts the number of various precision tokens in each processing block, and selects the precision grade with the highest occurrence frequency as the unified quantization configuration of the block, so as to simplify the execution path and reduce the dispatching overhead. If the boundary conditions of high, medium and low precision are similar in number, the high precision is selected in a biased manner so as to avoid weakening key semantics in credit judgment. The platform divides the whole large language model into a plurality of configuration sharing groups according to the hierarchy, each group comprises a fixed number of network modules, the quantization configuration adopted by the block processed by the first module in the group is used as sharing configuration and recorded in a configuration mapping table binding the group identifier and the block identifier, so that all the modules can call the same quantization path when processing the task of the group. In the execution stage, the system loads corresponding unified quantization configuration parameters for each processing block, including the scaling of the weights, the calibration range of the activation values, and the current block precision format identification. And scheduling different blocks into GPU parallel units supporting mixed precision, wherein a high-precision block calls a floating point processing unit, a medium-low-precision block uses an INT4/INT8 instruction set to perform reasoning, and meanwhile, an intermediate result corresponding to an invalid token position is removed, so that redundant interference is avoided. Finally, the platform sorts and splices all effective reasoning results into a complete output sequence according to the original token index, and performs structure alignment, risk tag normalization and multi-tag scoring decoding through the post-processing module to generate a credit rating, risk exposure range and early warning dimension suggestion list of the enterprise under the current financial scene, so as to assist the credit reviewer or an automatic credit engine to finish decision-making.
By executing block-level batch quantization calculation based on unified quantization configuration, not only is a heterogeneous execution strategy of precision perception realized, but also the parallel computing capacity of the multi-core processor is fully released. The cost of repeated configuration loading, dynamic quantization generation and the like is effectively reduced while the differentiated calculation of different token sections according to importance is ensured. The average reasoning video memory occupation is reduced, the processing throughput rate is improved, and the method is suitable for large-scale deployment scenes sensitive to resources in long text tasks.
The invention relates to the technical field of artificial intelligence, which can be applied to business scenes such as medical health, financial science and technology and the like, and discloses a model quantitative reasoning acceleration method, comprising the steps of dividing an input text into a plurality of processing blocks, fixing the calculation precision format of a first processing block into a high-precision format and disabling quantitative operation; the method comprises the steps of generating a self-attention matrix through a language model aiming at other processing blocks except for a first processing block, calculating importance scores according to numerical sums of corresponding columns of each token position in the self-attention matrix, distributing the token positions into high-precision, medium-precision or low-precision formats based on two preset thresholds, counting the number of the token formats in each processing block, selecting the precision format with the largest number as unified quantization configuration, dividing a network module into a plurality of configuration sharing groups, sharing the unified quantization configuration of the processing blocks in the groups, executing block-level batch quantization according to the unified quantization configuration of the processing blocks, completing model reasoning, and generating reasoning results. The invention uniformly determines the quantization configuration of each processing block based on the token importance score and multiplexes the configuration in the network module group, thereby realizing block-level precision distribution and parallel quantization reasoning, greatly reducing the memory overhead and configuration time overhead while guaranteeing the reasoning precision, and effectively improving the execution efficiency and the memory utilization rate in long text reasoning tasks.
In one embodiment, the step S10 includes:
S101, dividing an input text into a plurality of processing blocks with equal length according to a preset block length;
s102, disabling quantization parameter adjustment for all token positions of a first processing block;
S103, fixing the processing precision formats of the embedded layer, the self-attention layer and the feedforward network layer of the first processing block in the language model into high-precision formats;
S104, when a residual token with a length smaller than a preset block exists at the end of an input text, taking a text segment formed by the residual token as an independent processing block;
S105, filling invalid token into the independent processing block to the preset block length, fixing the processing precision format of the filled independent processing block into a high-precision format, and disabling the quantization operation of the independent processing block;
s106, recording the initial position index and the end position index of all the processing blocks.
In this embodiment, in order to alleviate the memory pressure of the large language model in the long text inference, a Processing Block (Processing Block) dividing strategy is adopted to divide the input text into a plurality of subsections according to a preset Block length. Each processing block is the smallest quantized scheduling unit in the model push, with explicit start and end position indices. The partitioning is used not only to control the locality of video memory usage, but also to provide boundary support for subsequent quantization strategy generation.
The processing precision format of the first processing block is fixed to a high precision format, typically FP16 or FP32 floating point format, meaning that the model takes the full precision calculation path when processing the block and explicitly prohibits the block from participating in any form of quantization parameter adjustment, including but not limited to bit width compression of weight parameters and activation values, dynamic scaling factor generation, etc., thereby skipping the flow of the block in quantization configuration generation.
The embedding Layer (Embedding Layer), the Self-Attention Layer (Self-Attention Layer) and the Feed-forward network Layer (Feed-Forward Network Layer) together form a core module for token representation transformation in the language model, and play a key role in semantic representation of the token. The core structures of the first processing block are in a high-precision format, so that the context awareness capability of the stable model in the initial stage of inference is facilitated, the nested inference performance of the subsequent token is improved, and the method is more remarkable particularly in the scene of the key topic prompt in the initial stage of long text input.
If the text length cannot be divided by the preset block length, the remaining token at the tail is assembled into an independent processing block. To maintain consistency of the block-level inferred structure and alignment requirements that facilitate parallel scheduling, the independent processing blocks fill invalidates token (Paddingtoken) to fill in to full block length and also use high precision format processing. The filling token does not participate in the efficient computation of the model output in the actual inference, only to keep the input dimensions consistent.
After the system completes text blocking, the index information of the starting position and the ending position of each processing block is required to be recorded and used as key auxiliary identifiers for the following steps of calculation scheduling, output splicing, precision distribution and the like. The index information may be constructed as a bi-directional mapping table (e.g., hash table) to quickly locate the processing block corresponding to the token and its accuracy control policy at runtime.
The input text may be sequentially partitioned by setting the block length to a subset of the maximum sequence length supported by the model (e.g., 512 or 1024 tokens). The partitioning operation may be done offline during the preprocessing stage or dynamically during runtime based on streaming input. In the GPU reasoning framework, position encoding masks can be used to distinguish actual token from filled token, and specified blocks can be processed in cooperation with high-precision kernel functions.
In the current mainstream quantization inference hardware architecture, FPGA (Field Programmable GATE ARRAY ), ASIC (Application-SPECIFIC INTEGRATED Circuit) and TPU (Tensor Processing Unit ) are widely used for deployment of high-performance large language models. For these platforms, flexible precision scheduling capability needs to be provided at the time of actual deployment to accommodate the high precision processing requirements of the first processing block.
In a programmable chip architecture (e.g., FPGA or AI accelerator chip), individual operations in a model execution path are typically mapped into a series of reconfigurable logic units or operator chains (operators). To achieve high precision execution of the first processing block, the calculation precision mode of the current block can be set on the intermediate calculation path through a bit width control register (bit-width control register) or an operator scheduling instruction table. When the token position of the processing is identified to belong to the first processing block, the system controller can set the precision control field of the corresponding operator chain to be a high-precision flag bit in a soft-hard cooperative mode, for example, enable the FP16 or FP32 channel, skip operations such as zero point adjustment, discrete mapping and the like around the excessive scaling module (quantization scaling unit), and enable data to be directly transmitted to an operator input port in an original floating point format, so that a so-called high-precision straight-through channel (precision passthrough path) is realized.
The key to this mechanism is to bind the location information of the processing block to the register configuration policy. The purpose of dynamically switching precision paths at block level granularity can be achieved by maintaining a processing block-precision policy mapping table in the scheduling unit. The mechanism may implement runtime configuration updates via the boundary detection module when processing streaming token input.
In highly integrated heterogeneous computing platforms such as TPU, the interior typically contains multiple processing cores (cores) supporting different computing Precision, with some cores being High-power, main cores supporting floating point computing, and others being optimized as fixed point computing dedicated paths. A scheduling engine (Scheduler) and resource manager (Resource Allocator) are often integrated in the TPU architecture for dynamically scheduling workloads between different cores.
When the first processing block is processed, the dispatching engine can identify the first processing block as a high-priority computing unit and directly distribute tasks to the main Core, the floating point computing Core is utilized to execute Embedding, self-attention and feedforward network module computation, so that the introduction of a quantization process is avoided, and the reasoning tasks of the other processing blocks can be divided into other low-bit-width cores to execute quantization reasoning in a low-precision high-throughput mode. To achieve dynamic multiplexing of resources, the scheduling policy of the TPU may also support time slice rotation (TIME SLICING) and task pre-fetching (prefetching), ensuring low latency and high parallelism of scheduling between multiple processing blocks.
If the method is deployed on a general GPU platform, high-precision execution identification can be transmitted through CUDA kernel launch parameters, so that the disabling of the quantization flow in a specific kernel path is realized. For example, a set of kernel not containing quantization kernel is started for the first processing block, or the float path is forced to schedule in the generic model framework instead of the int path by the precision label injection (precision tag injection) mechanism.
In addition, in a higher-order system, dynamic fine-tuning logic (precision adaptation controller) can be introduced, and whether a high-precision path is started or not is automatically judged according to the importance score distribution or the semantic weight of the first processing block, so that flexible balance between performance and precision is realized.
According to the method and the device, the input text is divided into the equal-length processing blocks, and quantization processing is disabled for the first processing block, so that more original semantics and context information can be reserved in the initial stage of model inference, and the stability of importance evaluation and precision distribution of subsequent token is improved. The method is beneficial to improving the initial inference precision of the long text input on the premise of not significantly increasing the total memory overhead, and provides more reliable baseline representation for subsequent block quantization configuration.
In one embodiment, the step S20 includes:
s201, inputting each other processing block into a multi-head self-attention layer of the language model to obtain a local self-attention matrix of a plurality of attention heads corresponding to each other processing block;
s202, carrying out weighted average or arithmetic average processing on local self-attention matrixes of a plurality of attention heads to generate a final self-attention matrix corresponding to each other processing block;
s203, extracting a column vector corresponding to each token position from the final self-attention matrix;
S204, determining the sum of the values of all elements in each column vector, and taking the sum of the values as an importance score of a token position corresponding to each column vector;
S205, if the filled invalid token position exists in other processing blocks, when the importance score of the filled invalid token position is determined, setting all element values of the column vector corresponding to the invalid token position to zero.
In this embodiment, before the processing block is assigned precision, the importance of each token in the current semantic context needs to be acquired first. To achieve this goal, the process flows start from a multi-headed self-attention layer, building an attention matrix for each process block. Each processing block, after inputting the language model, passes through a plurality of attention heads, each of which independently calculates a local self-attention matrix, and each matrix describes the mutual attention relationship between all the tokens in the processing block by taking all the tokens as units. These matrices reflect the modeling capabilities of the language model for sentence internal structures from different angles of interest.
In order to obtain a unified and comparable attention representation, it is necessary to fuse these local matrices. The fusion mode can be flexibly set according to actual task demands, common modes include arithmetic average of matrixes of all attention heads, namely simple average calculation, and weighted average can be realized by configuring different weight coefficients for each attention head. This fusion process generates a final self-attention matrix for each processing block as a comprehensive representation of the token inter-dependencies within the processing block.
In the final matrix, each column vector represents the degree to which a particular token receives attention focus from other tokens. Specifically, the position of a certain token is fixed, and the column vector corresponding to the position is extracted, which can be regarded as "the degree set of the token that is focused on by all other tokens". This is an important basis for evaluating whether a token occupies a significant semantic place in the current processing block. Namely:
the current token, the token whose attention output is being calculated, is denoted as the ith position token.
Other token include all valid token except the i-th position in the same processing block, i.e. position number 0 to N-1 in the block (where N is the processing block length), excluding the i-th position itself, excluding the positions of invalid token (e.g. paddingtoken, fill token).
To extract such importance signals, all elements in each column vector are numerically summed to obtain an aggregate value that is used to measure the global interest of the token. The aggregate value is the importance score of the token, and the higher the score, the greater the impact of the token on the overall context semantic structure. In the subsequent precision assignment process, this score will be used directly to determine whether the token should be processed using higher precision computing resources.
Considering that the processing block may contain invalid tokens that are padded for the fill length, these tokens do not carry semantic content and may interfere with the importance score if the attention matrix is participated in the calculation. To prevent such interference, after the final self-attention matrix is generated, the positions corresponding to all invalid token are identified, and the column vectors in the matrix are uniformly set to zero. Therefore, the invalid token can be effectively prevented from obtaining false high-attention values, and accuracy and robustness of importance calculation are ensured. The operations may be implemented by structured masking or explicit indexing mechanisms, with the specific method being adjusted depending on the model architecture and deployment platform.
In practical deployment, when a processing block is sent into a language model for reasoning, each layer of self-attention mechanism of the model outputs attention matrixes corresponding to a plurality of attention heads. These matrices are typically organized in a three-dimensional structure with the first dimension being the number of attention headers and the second and third dimensions representing the number of tokens in the processing block, respectively, to form a three-dimensional tensor of the structure "number of attention headers x number of tokens".
The model may cache these attention tensors at the self-attention module output of each layer through an intermediate layer interception mechanism during forward reasoning. To reduce interrupt overhead at reasoning, the cache operation should be inserted into the computational graph of the model in a non-blocking manner and its lifecycle managed uniformly by the scheduler. After the buffer memory is finished, different fusion strategies can be realized by operating on the first dimension (namely the attention head dimension), wherein one mode is to simply average matrixes of all attention heads to obtain a two-dimensional matrix, and the other mode is to endow each head with different weighting coefficients and perform weighted average processing.
After the attention head fusion is completed, a two-dimensional matrix of the number of the token number multiplied by the number of the token is obtained, and the comprehensive attention relation among the token in the processing block is represented. The global focus of each token is determined by the sum of the elements of the corresponding column in the matrix. In a specific operation, column vector extraction may be performed on the whole Zhang Juzhen by fixing a column index, which may be implemented in a tensor operation framework by a standard matrix slicing function (e.g., pyTorch, tensorFlow, JAX, etc. all support such slice access). All values in the column vector are then accumulated to obtain the attention aggregate value for the location token, i.e., its importance score. Because vector addition is highly optimally realized in a modern tensor calculation library, the calculation of the importance score can be completed on the GPU/TPU with extremely low time delay, and the method has good engineering feasibility.
For the process of populating the token, the model will typically generate a PADDING MASK (fill mask) at the time of forward reasoning, which is used to mark which tokens are invalid locations resulting from the population. The mask is typically a boolean vector or 0/1 tensor of equal length to the processing block, where 1 represents the active position and 0 represents the fill position. The mask may be expanded into a matrix form that is aligned with the column structure of the final self-attention matrix. In practice, the mask may be broadcast as a two-dimensional matrix, applied to each column of the attention matrix, with all elements of the column vector corresponding to the fill position uniformly zeroed out at the matrix level.
For optimal efficiency, the mask application process may embed a pre-quantization processing module in the model graph construction stage, and hang after the attention matrix calculation node as a post-operation. Aiming at different operation platforms, column-level zero clearing logic can be realized on the GPU through CUDA kernel, and batch column shielding can be completed in a single clock period on the TPU by means of built-in high-flux tensor mapping operation.
In addition, to ensure that a special structure that appears in the context scene does not cause erroneous decisions, such as filling the token in part of the task may be in the middle of the sequence rather than at the end (e.g., multi-segment splice input), the original input structure and PADDING MASK generation logic need to be recorded in advance to ensure mask accuracy. An alternative way is to introduce a location tag assist mechanism to combine the location semantic coding of each token with a mask to enhance the recognition stability of invalid tokens.
In the embodiment, the importance score of the token is obtained by extracting the column vectors from the fusion self-attention matrix and carrying out total weight summation, so that a quantization accuracy sensing mechanism starting from a language model structure is constructed. On the premise of not introducing an external scoring module, the attention information learned by the model is fully utilized, and the internalization of the accuracy distribution basis is realized. Explicit tag dependence is avoided, and meanwhile, the method has good computational graph compatibility and is beneficial to subsequent efficient deployment on an acceleration chip.
In one embodiment, the step S30 includes:
s301, determining a first threshold value and a second threshold value according to the length of a processing block;
S302, if an invalid token position exists, the processing precision format of the invalid token position is distributed into a low-precision format, and the invalid token position is eliminated when the number of precision formats of each processing block is counted;
s303, comparing the importance score of each valid token position with the first threshold value and the second threshold value;
S304, distributing the processing precision format of the effective token position with the importance score larger than a first threshold value into a high-precision format;
s305, assigning the processing precision format of the valid token position with the importance score above the second threshold value and below the first threshold value to a medium precision format;
S306, the processing precision format of the effective token position with the importance score smaller than the second threshold value is allocated as a low-precision format.
In this embodiment, the assignment of token positions with importance scores greater than the first threshold to the high-precision format is an important operation of performing differential configuration on the processing precision after evaluating the attention information of all valid tokens in each processing block. High precision formats refer to floating point operations that employ higher precision representations and computation, such as using higher bit width numerical representations or disabling quantization operations, in the reasoning process. The method aims to preserve the expression capability of key semantic token and avoid negative influence on model prediction results caused by precision reduction. The medium-precision format and the low-precision format correspond to the medium-low bit width quantization configuration in the resource limited scene, and are used for processing the token with lower importance, so that the video memory and the computing resource are saved.
The length of the processing block has a dynamic effect on the setting of the threshold. Because the number of the token in the long processing block is increased, the overall information density distribution tends to be average, and in order to ensure that a screening mechanism has relatively consistent distinguishing capability in the processing blocks with different lengths, the values of the first threshold and the second threshold are required to be correspondingly adjusted downwards, so that the judgment range of the high-precision token is enlarged. The dynamic self-adaption can be realized by setting an initial standard threshold value and carrying out linear or nonlinear scaling on the two threshold values by combining the ratio of the length of the current block to the standard length.
In the scenario where there is an invalid token position in the processing block, special judgment needs to be added to the processing precision allocation. Since the invalid token does not participate in semantic modeling, there is no semantic contribution in its attention information itself, and its corresponding importance score can be considered as a structural null or forced zero. The method and the device are uniformly distributed into a low-precision format, so that operation resources can be saved, and active elimination is performed when the precision proportion is counted later, so that interference to uniform quantization configuration decisions of the current processing block is avoided.
The importance score for each valid token will be compared to two thresholds in turn. When the value is higher than a first threshold, the method is represented that the method plays a role in significant semantic connection in global attention aggregation, a high-precision format is required to be allocated, when the score is between the first threshold and a second threshold, the method is regarded as an information sub-important interval, medium-precision format processing can be used, when the score is lower than the second threshold, the token plays a weak role in context construction, and reasoning calculation can be performed in a low-precision format. The multi-level precision configuration scheme ensures the balance of semantic expression, performance control and calculation efficiency.
If there are no invalid token positions in the processing block, then all token positions are valid token positions.
In a specific implementation, when the initialization model runs, a preset dynamic threshold calculation function is called to generate a first threshold and a second threshold of the current block by reading the length of the processing block. The function can be realized by a table look-up mode or a formula calculation mode, and mechanisms such as interval interpolation, index decrease, minimum threshold protection and the like can be supported. Subsequently, all token positions of the current processing block are traversed in sequence. During the traversal, the system call PADDING MASK or the token validity flag determines whether the current token is an invalid token. If the token is invalid, the processing precision format is marked as low precision, importance score comparison logic is skipped, and the next token judgment flow is entered. For a valid token, the token's importance score is read from the cache or current attention module. The score is then compared to two thresholds corresponding to the current processing block. The score is marked as high precision in the processing block precision allocation vector if the score is greater than a first threshold, as medium precision if the score is greater than a second threshold and less than or equal to the first threshold, and as low precision if the score is less than or equal to the second threshold. After the processing is completed, all the precision labels of the valid token are stored in the precision control vector in the processing block and are used for subsequent unified quantization configuration decision and kernel function selection. If the acceleration processing is needed, the importance score comparison logic can be realized in a vectorization mode, judgment of the allocation strategy is completed on the GPU in parallel, and the execution efficiency is improved.
According to the embodiment, through the steps, on the premise of not remarkably increasing the complexity of the model, the calculation accuracy of the token can be configured differently according to the actual semantic importance of the token, so that the overall display memory consumption and the calculation load are greatly compressed on the basis of guaranteeing the model reasoning accuracy. The dynamic threshold mechanism further improves the adaptive capacity of the algorithm when processing input sequences with different lengths, and the problem of uneven precision distribution caused by fixed threshold values is avoided. Specialized handling of invalid token ensures the accuracy of resource allocation and simplifies subsequent statistical logic.
In one embodiment, the step S40 includes:
S401, traversing all token positions of the current processing block, and identifying token positions distributed into a high-precision format, a medium-precision format and a low-precision format;
s402, in the counting process, if invalid token positions exist, all the invalid token positions are eliminated, and only the precision format distribution result of the valid token positions is counted;
S403, respectively counting the counts distributed into a high-precision format, a medium-precision format and a low-precision format in the effective token position, and generating a high-precision count, a medium-precision count and a low-precision count;
s404, comparing the numerical values of the high-precision count, the medium-precision count and the low-precision count;
s405, taking the precision format corresponding to the count with the largest numerical value as the unified quantization configuration of the current processing block;
and S406, if the counts of the plurality of precision formats are the same and are the maximum value, selecting the highest precision format in the plurality of precision formats as the unified quantization configuration of the current processing block.
In this embodiment, counting the number of token positions allocated in different precision formats within each processing block is a key analysis operation performed to determine the overall quantization strategy of the current processing block after the token precision allocation is completed. The process first needs to traverse all token positions in the current processing block, and identify the precision label corresponding to each position, which may be in high precision, medium precision or low precision format.
If an invalid token is included in the processing block during traversal, for example, an end-padded padding token, it needs to be excluded from statistics. This is because the invalid token does not participate in the actual inference calculation and should not affect the overall accuracy distribution. Only counting the precision labels of the valid token can ensure the rationality and accuracy of the subsequent unified quantization configuration decision.
Counting each precision label is the core of the step, and the system generates three counting results, namely high precision counting, medium precision counting and low precision counting. Each counting result corresponds to the number of positions, which are allocated to the precision format, of the valid token in the processing block and is used for describing the distribution trend of the precision in the block.
Then, the system compares the sizes of the three counting results, finds the precision format with the largest number, and uses the precision format as the unified quantization configuration of the current processing block, namely, the current processing block will perform unified quantization and calculation kernel function call in the format later. The design purpose of the unified configuration mechanism is to avoid frequent switching of precision calculation paths in the same processing block, so that the context switching cost is reduced, and the reasoning throughput efficiency is improved.
If the multiple precision counts are identical and maximum, e.g., the high precision and medium precision counts are exactly identical, the system should use the precision priority principle to select the one of the parallel maximum with the higher precision level as the final configuration. The strategy reflects the protection trend of semantic quality, namely higher calculation precision is reserved preferentially in the range allowed by resources, so that the risk of token misidentification or reasoning deviation is reduced.
For example, during model execution, after completing token precision allocation for each processing block, the system may initiate a precision statistics module that first traverses all token index locations in the processing block through loop or vectorization logic. The precision label corresponding to each token is stored in the precision label vector of the preamble step, and each element of the vector is the precision label of the token, for example, 0 represents low precision, 1 represents medium precision and 2 represents high precision.
During the traversal process, the system checks whether each token is a valid token. The decision basis may be PADDING MASK or a token validity vector. If the token is invalid, the corresponding position is skipped in the precision counting process and does not participate in any precision counting.
And for the valid token, adding the valid token into a corresponding precision counter according to the precision identifier of the valid token. The system maintains three independent integer variables simultaneously, and respectively accumulates the number of the token with high, medium and low precision. After traversing, the system compares the values of the three counters to find out the precision format corresponding to the maximum value.
If two or three precision counts are all maximum, for example, the high precision and the medium precision are both 100, the system executes the priority judgment, and can compare the precision grade values, and the precision format with higher grade is preferentially selected. The system can also be expanded to introduce additional factors in the logic, such as historical block configuration inertia, context semantic density trend, and the like, so as to enhance the stability and adaptability of precision configuration.
The finally determined precision format is written into a processing block control structure as a unified quantization configuration of the current processing block and is transferred to subsequent quantization kernel function selection logic. The configuration can be applied in batch on the GPU in the form of batch, and the efficiency of the quantization preparation stage is improved.
According to the method, the device and the system, the precision labels in each processing block are comprehensively counted and analyzed in priority, unified quantization configuration which is most in line with the semantic density and precision requirements of the processing block can be distributed to each block in a data-driven mode, and complexity of a calculation path and cost of precision switching are remarkably reduced. While the high-precision processing capacity of the key token is reserved, the reasoning efficiency is improved and the video memory is compressed through a block-level merging strategy. The elimination processing of the invalid token ensures that the statistical result is not interfered by the redundant token, and further improves the configuration accuracy and the execution stability of the system. The introduction of the precision priority strategy ensures that the processing quality of the high semantic contribution token can be ensured when the system faces to the precision distribution fuzzy scene, and is beneficial to the stability and reliability of model output.
In one embodiment, the step S60 includes:
S601, assigning a unique block identifier to each processing block and assigning a unique group identifier to each configuration share group;
s602, establishing a binding relation between a unique group identifier and a unique block identifier in a configuration mapping table;
s603, associating the first network module of each configuration sharing group with the unified quantization configuration parameter of the corresponding processing block based on the binding relation of the unique group identifier and the unique block identifier;
S604, writing the unified quantization configuration parameters into a shared memory area of each configuration shared group, and distributing the same memory address mapping for all network modules of each configuration shared group;
S605, when other network modules in the same configuration sharing group execute quantization processing, the unified quantization configuration parameters are read from the shared memory area through the memory address mapping.
In this embodiment, to implement compression optimization of video memory resources and computation paths in a Large Language Model (LLM) inference process, a configuration sharing mechanism is introduced to multiplex unified quantization configurations among multiple network modules within the same group. The key of the mechanism is to structure and organize the mapping relation between the processing block and the network module, and realize the multiplexing of the parameter access path by using the shared memory.
First, the system generates a globally unique block identifier for each processing block during the processing block partitioning phase, for identifying a set of token sequences and their quantization configurations corresponding to the block. In different implementation platforms, the identifier may be an integer index, a hash result, or a combination code combining the processing block position and the length, which has stability and traceability, and facilitates rapid positioning of the quantization configuration in a subsequent step.
Meanwhile, the system can divide the network module into a plurality of configuration sharing groups according to the processing sequence of the network module in the model inference graph. A configuration sharing group is a logical structure that internally contains a number of adjacent or semantically related network modules, typically arranged in a fixed number (e.g., 4, 8, or 16 modules per group). Each configuration share group is also assigned a unique group identifier, the generation mechanism of which should maintain uniqueness and mappability consistent with the processing block identifier.
Based on the two types of identifiers, the system forms a binding relation between the group identifier and the block identifier by constructing a configuration mapping table. This mapping table may take the form of key-value pairs, the group identifier being a key and the corresponding block identifier being a value. This structure allows the model to quickly find the processing blocks and their quantized configurations bound to the current configuration share group at runtime, thereby determining the shared access path.
Further, in order to implement multiplexing of quantization configuration among multiple modules in a group, the system writes the unified quantization configuration parameters of the processing block into a shared memory area after the binding is completed. This shared region may be a shared cache (shared memory) in GPU video memory at the implementation level, or may be a tensor accelerator register region that supports concurrent access, or a cache block allocated by middleware in a system that supports heterogeneous architecture.
The unified quantization configuration parameters typically include weight quantization parameters, activation value quantization parameters, processing precision format information, bit width coding strategies, quantization range information, auxiliary bias, and the like. These parameters are stored in a structured format, such as a key dictionary, a compressed tensor, or a configuration structure, and are exposed to all network modules within the configuration sharing group through the shared memory address.
In terms of address mapping, the system directs the quantized configuration access paths that configure all modules within the shared group to a uniform shared memory address. The mapping process can be accomplished either statically during model loading or dynamically during reasoning, depending on the time coupling relationship of the processing block generation and intra-module invocation. It should be noted in particular that to ensure the performance and consistency of multi-module concurrent reads, it should be ensured that the shared memory region is thread safe and non-blocking, which may be achieved through a lock mechanism, read-write cache, or communication channel.
Once the mapping is complete, the rest of the network modules within the configuration sharing group will no longer independently generate or load the quantized configuration when performing the quantization process, but will directly read the required parameter data from the sharing area by looking up the mapping address. The method avoids repeated calculation, memory copying and configuration switching, greatly compresses the execution diagram width of the reasoning path, and improves the overall calculation density and throughput rate.
Furthermore, if the mechanism is combined with a quantization perception training (QAT) or post-training quantization (PTQ) strategy, configuration sharing not only reduces the memory consumption of a hardware layer, but also improves semantic consistency, so that modules in the same group can cooperatively express local semantics under consistent quantization configuration, and representation jitter or semantic drift caused by quantization strategy differences is effectively avoided.
In addition, the configuration sharing mechanism has good expandability and hardware compatibility. The implementation of the support map execution framework (such as TensorRT and ONNX Runtime) can be realized through unified layer parameter injection strategy, and the dynamic sharing can be completed through the cooperation of the memory address binding table and the interrupt synchronization mechanism in the lower-level hardware layer (such as TPU and NPU).
According to the embodiment, through the configuration sharing mechanism, efficient multiplexing of processing block quantization configuration in the configuration sharing group is realized, and the video memory consumption and the calculation burden caused by repeated configuration loading are obviously reduced. Compared with the traditional mode of independently loading configuration of each module, the configuration access time delay can be reduced to a constant level by using the unified memory address mapping, and the overall peak memory occupation of the system is reduced. The unified quantization strategy in the group also improves the consistency of the collaborative execution of the multiple modules, effectively reduces the quantization jitter in the reasoning path and enhances the output stability of the model. Under the scene of concurrent execution of a plurality of blocks, the mechanism can effectively avoid the problem of semantic drift or output fault caused by inconsistent quantization strategies.
In one embodiment, the step S70 includes:
s701, loading corresponding unified quantization configuration parameters for each processing block, wherein the unified quantization configuration parameters comprise precision format identifiers;
s702, configuring independent processing kernel functions for different processing blocks according to the precision format identification;
S703, in the parallel processing unit of the graphic processor or tensor processor, allocating processing resources according to the index sequence of the processing blocks, and concurrently executing quantization processing of all the processing blocks;
s704, checking the positions of the token in each processing block to eliminate the intermediate results corresponding to the invalid token positions;
s705, sorting the effective intermediate results of all the processing blocks according to the initial position indexes, and splicing the effective intermediate results into a complete model output sequence;
S706, performing post-processing operation on the spliced output sequences to generate a final model reasoning result.
In this embodiment, loading the unified quantization configuration parameters for each processing block is a preparatory phase before model execution. The unified quantization configuration parameters comprise three key contents, namely a weight quantization parameter, a parameter set for mapping model weights from floating point representation to quantization representation, and a scaling factor and a zero point, wherein the parameters are widely applied in a static quantization process or a dynamic quantization process, the parameters are activation value quantization parameters, compression precision of input activation values is controlled to maintain stability of input feature distribution during reasoning, and the parameters are precision format identification, namely marks for guiding a processing block to select corresponding calculation paths, and are usually enumeration types or control register marks for subsequent dynamic kernel function scheduling. The specific implementation of loading the parameters can be realized through configuration table lookup, memory pre-reading or configuration cache, so as to reduce access delay in the reasoning stage.
Configuring independent processing kernel functions according to the precision format identification is a key link for realizing block-level heterogeneous computation. The processing kernel functions are customized execution paths for different precision formats, the high precision formats generally correspond to floating point kernel functions of FP16 or FP32, higher numerical precision is reserved for processing tokens with dense semantics or key positions, the medium and low precision formats correspond to fixed point quantization kernel functions of INT8, INT4 and the like, and matrix calculation and memory transmission are accelerated by using mechanisms such as tensor quantization, weight sharing and the like. The scheduling strategy of the computing kernel function can be realized through a function pointer mapping table or a kernel selection mechanism supported by hardware, so that the execution paths of the processing blocks are distributed as required, and the overall throughput is improved.
In a Graphics Processor (GPU) or Tensor Processor (TPU), allocating parallel processing resources to all processing blocks is a physical guarantee to achieve block-level batch quantization. The resource allocation is performed based on the index order of the processing blocks, so that the output order of the processing results is consistent with the input order, and meanwhile, the resource competition among threads is avoided. In the GPU platform, the processing kernel distribution can be controlled through CUDA block and stream mechanisms, and in the TPU, each processing block can be mapped through a matrix multiplication unit (MXU) for execution. In the process, the weight quantization parameter and the activation value quantization parameter are synchronously loaded into corresponding execution contexts to respectively act on the quantization mapping of the weight tensor in the model and the dynamic calibration operation of the input tensor.
And checking the positions of the token in the block according to the intermediate result after the processing is completed. Since some processing blocks contain padding filled invalid token, the computation results for these locations should be filtered to prevent affecting the subsequent model output. In the step, the token mask recorded in the preamble step can be utilized to perform screening operation according to the position on the intermediate result, and the invalid token corresponding result is removed and the position consistency of the invalid token in the splicing process is maintained.
When the effective intermediate results of all the processing blocks are spliced, the sorting and the connection are needed according to the initial position index of the original input, so that the semantic information is ensured to be restored according to the correct context sequence. The operation not only needs to carry out data reorganization based on token indexes, but also requires that the output formats of different processing blocks keep consistent, including tensor dimension, sequence alignment mode and the like, so that dimension dislocation or information overlapping in the splicing process is avoided.
Finally, inputting the spliced sequence to a post-processing module, and executing dimension alignment, normalization and decoding operations. The dimension alignment is used for normalizing tensor shapes output by each segment, the normalization is used for compressing the dynamic range of the numerical value, the decoding accuracy is improved, and the decoding generates final reasoning results such as texts, labels or eigenvalues according to the task types of the model. The whole post-processing flow can be embedded into the back section of the inference graph as a unified output layer, so that efficient execution is ensured.
In some implementations, the weight quantization parameters and the activation value quantization parameters may be obtained through external quantization training, saved in an offline configuration file, and preloaded before reasoning begins. Another way is run-time dynamic quantization, where the scaling is calculated in real time from the maximum and minimum intervals of the current input activation value. For the assignment of precision format identifiers, they may be represented as integer encodings in the quantization configuration generation step, e.g., 0 for low precision, 1 for medium precision, and 2 for high precision, for parsing by the processing kernel selector.
The allocation of processing core functions may employ a hardware-supported adaptive kernel call policy. For example, in NVIDIA TensorRT framework, different GEMM implementation modules are automatically matched according to the precision identification, and the precision configuration can be compiled into corresponding hardware instruction sequences on the TPU platform through an XLA compiler. The allocation of processing resources can also be optimized according to a hardware architecture, and part of the platform supports the preferential scheduling of high-precision blocks to a core resource area with stronger computing performance so as to ensure the model reasoning accuracy.
In practical deployment, in order to reduce bandwidth overhead in the splicing process, the effective result of each processing block can be written into a pre-allocated global output buffer area immediately after the output of the processing block, and the processing block is arranged according to token index positions, so that subsequent sequencing operation is avoided. In the dimension alignment process, a unified mapping strategy or a packing and filling mechanism can be introduced for the output tensor difference caused by different precision formats, so that the consistency of the data structures is ensured.
According to the embodiment, the corresponding unified quantization configuration is loaded for each processing block, and the processing kernel function is dynamically selected according to the unified quantization configuration, so that the cooperative optimization of the calculation accuracy and the resource allocation is realized. By introducing the precision format identification control kernel function path, the computing resources of the low-importance region are compressed to the maximum while the processing precision of the high-importance region is ensured. The parallel processing unit further improves the execution efficiency by executing the blocks, and the elimination of invalid token positions ensures the precision and consistency of the splicing result. After the final output is subjected to unified post-processing operation, the context semantic consistency can be maintained, and the overall reasoning delay and the video memory consumption are effectively reduced.
In one embodiment, a model quantized inference acceleration device is provided, where the model quantized inference acceleration device corresponds to the model quantized inference acceleration method in the foregoing embodiment one by one. Referring to fig. 3, fig. 3 is a schematic diagram of a functional module of a model quantitative reasoning acceleration apparatus according to a preferred embodiment of the present invention. An input text preprocessing module 10, a self-attention analysis module 20, an accuracy allocation module 30, a quantized configuration decision module 40, a network module grouping control module 50, a configuration sharing management module 60, and an inference execution module 70. The functional modules are described in detail as follows:
an input text preprocessing module 10, configured to divide an input text into a plurality of processing blocks, fix a processing precision format of a first processing block to a high precision format, and disable quantization processing of the first processing block;
A self-attention analysis module 20, configured to generate, for each of the processing blocks except for the first processing block, a self-attention matrix of each of the other processing blocks through a language model, determine a sum of total element values of a corresponding column of each token position in the self-attention matrix, and use the sum of total element values as an importance score of each token position;
a precision allocation module 30, configured to allocate token positions with importance scores greater than a first threshold to a high precision format, to allocate token positions with importance scores above a second threshold and below the first threshold to a medium precision format, and to allocate token positions with importance scores less than the second threshold to a low precision format;
A quantization configuration decision module 40, configured to count the number of token positions allocated to the high-precision format, the medium-precision format and the low-precision format in each processing block, and select the precision format with the largest number as the unified quantization configuration of the corresponding processing block;
A network module grouping control module 50, configured to divide the network modules of the language model into a plurality of configuration sharing groups, where each configuration sharing group includes at least two network modules;
A configuration sharing management module 60, configured to share, in each configuration sharing group, a unified quantization configuration of the processing block corresponding to the first network module to other network modules in the same configuration sharing group;
The reasoning execution module 70 is configured to execute block-level batch quantization on all the processing blocks according to the unified quantization configuration corresponding to each processing block, complete model reasoning, and generate a model reasoning result.
In one embodiment, the text preprocessing module 10 is specifically configured to:
dividing an input text into a plurality of processing blocks with equal length according to a preset block length;
disabling quantization parameter adjustment for all token positions of the first processing block;
Fixing the processing precision formats of the embedded layer, the self-attention layer and the feedforward network layer of the first processing block in the language model into a high-precision format;
When a residual token with a length smaller than a preset block exists at the tail of an input text, taking a text segment formed by the residual token as an independent processing block;
Filling an invalid token into the independent processing block to the preset block length, fixing the processing precision format of the filled independent processing block into a high-precision format, and disabling the quantization operation of the independent processing block;
the starting position index and the ending position index of all the processing blocks are recorded.
In one embodiment, the self-attention analysis module 20 is specifically configured to:
inputting each other processing block into a multi-head self-attention layer of the language model to obtain a local self-attention matrix of a plurality of attention heads corresponding to each other processing block;
Performing weighted average or arithmetic average processing on the local self-attention matrixes of the plurality of attention heads to generate a final self-attention matrix corresponding to each other processing block;
extracting a column vector corresponding to each token position from the final self-attention matrix;
Determining the sum of the values of all elements in each column vector, and taking the sum of the values as an importance score of a token position corresponding to each column vector;
if the filled invalid token position exists in other processing blocks, when the importance score of the filled invalid token position is determined, setting all element values of the column vector corresponding to the invalid token position to zero.
In one embodiment, the precision allocation module 30 is specifically configured to:
determining a first threshold and a second threshold according to the length of the processing block;
If the invalid token position exists, the processing precision format of the invalid token position is distributed into a low-precision format, and the invalid token position is eliminated when the number of the precision formats of each processing block is counted;
comparing the importance score of each valid token position to the first and second thresholds;
Assigning the processing precision format of the effective token position with the importance score larger than the first threshold value to a high-precision format;
assigning a processing accuracy format for valid token positions having an importance score above a second threshold and below a first threshold to a medium accuracy format;
the processing precision format of the valid token positions with importance scores less than the second threshold is assigned as the low precision format.
In one embodiment, the quantized configuration decision module 40 is specifically configured to:
traversing all token positions of the current processing block, and identifying token positions distributed into a high-precision format, a medium-precision format and a low-precision format;
in the counting process, if invalid token positions exist, all the invalid token positions are eliminated, and only the precision format distribution result of the valid token positions is counted;
Counting the counts distributed into a high-precision format, a medium-precision format and a low-precision format in the effective token positions respectively to generate a high-precision count, a medium-precision count and a low-precision count;
Comparing the numerical values of the high-precision count, the medium-precision count and the low-precision count;
Taking the precision format corresponding to the count with the largest value as the unified quantization configuration of the current processing block;
And if the counts of the plurality of precision formats are the same and are the maximum value, selecting the highest precision format in the plurality of precision formats as the unified quantization configuration of the current processing block.
In one embodiment, the sharing management module 60 is configured specifically for:
Assigning a unique block identifier to each processing block and assigning a unique group identifier to each configuration share group;
establishing a binding relation between the unique group identifier and the unique block identifier in a configuration mapping table;
Associating a first network module of each configuration sharing group with a unified quantized configuration parameter of a corresponding processing block based on a binding relationship of the unique group identifier and the unique block identifier;
Writing the unified quantization configuration parameters into a shared memory area of each configuration shared group, and distributing the same memory address mapping for all network modules of each configuration shared group;
and when other network modules in the same configuration sharing group execute quantization processing, reading the unified quantization configuration parameters from the shared memory area through the memory address mapping.
In one embodiment, the inference execution module 70 is specifically configured to:
Loading corresponding unified quantization configuration parameters for each processing block, wherein the unified quantization configuration parameters comprise precision format identifiers;
according to the precision format identifier, configuring independent processing kernel functions for different processing blocks;
in a parallel processing unit of a graphic processor or a tensor processor, processing resources are allocated according to an index sequence of processing blocks, and quantization processing of all the processing blocks is executed concurrently;
Checking the positions of the token n in each processing block to eliminate the intermediate results corresponding to the invalid token n positions;
Sequencing the effective intermediate results of all the processing blocks according to the initial position index, and splicing the effective intermediate results into a complete model output sequence;
and executing post-processing operation on the spliced output sequences to generate a final model reasoning result.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes non-volatile and/or volatile storage media and internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external user terminal through network connection. The computer program, when executed by a processor, implements a function or step on the server side of a model quantitative reasoning acceleration method.
In one embodiment, a computer device is provided, which may be a user terminal, and the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is for communicating with an external server via a network connection. The computer program, when executed by the processor, implements a function or step on the user side of a model quantitative reasoning acceleration method
In one embodiment, a computer device is provided that includes a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
dividing an input text into a plurality of processing blocks, fixing the processing precision format of a first processing block into a high-precision format, and disabling quantization processing of the first processing block;
Generating a self-attention matrix of each other processing block through a language model for the other processing blocks except for the first processing block, determining the sum of total element values of corresponding columns of each token position in the self-attention matrix, and taking the sum of total element values as an importance score of each token position;
Allocating token positions with importance scores greater than a first threshold to a high-precision format, allocating token positions with importance scores above a second threshold and below the first threshold to a medium-precision format, and allocating token positions with importance scores less than the second threshold to a low-precision format;
Counting the number of token positions distributed into a high-precision format, a medium-precision format and a low-precision format in each processing block, and selecting the precision format with the largest number as the unified quantization configuration of the corresponding processing block;
Dividing the network modules of the language model into a plurality of configuration sharing groups, wherein each configuration sharing group at least comprises two network modules;
sharing the unified quantization configuration of the processing block corresponding to the first network module to other network modules in the same configuration sharing group in each configuration sharing group;
And executing block-level batch quantization on all the processing blocks according to the unified quantization configuration corresponding to each processing block, completing model reasoning, and generating a model reasoning result.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
dividing an input text into a plurality of processing blocks, fixing the processing precision format of a first processing block into a high-precision format, and disabling quantization processing of the first processing block;
Generating a self-attention matrix of each other processing block through a language model for the other processing blocks except for the first processing block, determining the sum of total element values of corresponding columns of each token position in the self-attention matrix, and taking the sum of total element values as an importance score of each token position;
Allocating token positions with importance scores greater than a first threshold to a high-precision format, allocating token positions with importance scores above a second threshold and below the first threshold to a medium-precision format, and allocating token positions with importance scores less than the second threshold to a low-precision format;
Counting the number of token positions distributed into a high-precision format, a medium-precision format and a low-precision format in each processing block, and selecting the precision format with the largest number as the unified quantization configuration of the corresponding processing block;
Dividing the network modules of the language model into a plurality of configuration sharing groups, wherein each configuration sharing group at least comprises two network modules;
sharing the unified quantization configuration of the processing block corresponding to the first network module to other network modules in the same configuration sharing group in each configuration sharing group;
And executing block-level batch quantization on all the processing blocks according to the unified quantization configuration corresponding to each processing block, completing model reasoning, and generating a model reasoning result.
It should be noted that, the functions or steps that can be implemented by the computer readable storage medium or the computer device may correspond to the descriptions of the server side and the client side in the foregoing method embodiments, and are not described herein for avoiding repetition.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.
It should be noted that, if a software tool or component other than the company appears in the embodiment of the present application, the embodiment is merely presented by way of example, and does not represent actual use. The foregoing embodiments are merely illustrative of the technical solutions of the present application, and not restrictive, and although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that modifications may still be made to the technical solutions described in the foregoing embodiments or equivalent substitutions of some technical features thereof, and that such modifications or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (10)

1. The model quantitative reasoning acceleration method is characterized by comprising the following steps of:
dividing an input text into a plurality of processing blocks, fixing the processing precision format of a first processing block into a high-precision format, and disabling quantization processing of the first processing block;
Generating a self-attention matrix of each other processing block through a language model for the other processing blocks except for the first processing block, determining the sum of total element values of corresponding columns of each token position in the self-attention matrix, and taking the sum of total element values as an importance score of each token position;
Allocating token positions with importance scores greater than a first threshold to a high-precision format, allocating token positions with importance scores above a second threshold and below the first threshold to a medium-precision format, and allocating token positions with importance scores less than the second threshold to a low-precision format;
Counting the number of token positions distributed into a high-precision format, a medium-precision format and a low-precision format in each processing block, and selecting the precision format with the largest number as the unified quantization configuration of the corresponding processing block;
Dividing the network modules of the language model into a plurality of configuration sharing groups, wherein each configuration sharing group at least comprises two network modules;
sharing the unified quantization configuration of the processing block corresponding to the first network module to other network modules in the same configuration sharing group in each configuration sharing group;
And executing block-level batch quantization on all the processing blocks according to the unified quantization configuration corresponding to each processing block, completing model reasoning, and generating a model reasoning result.
2. The model quantized reasoning acceleration method of claim 1, wherein dividing the input text into a plurality of processing blocks, fixing a processing precision format of a first processing block to a high precision format, and disabling quantization processing of the first processing block, comprises:
dividing an input text into a plurality of processing blocks with equal length according to a preset block length;
disabling quantization parameter adjustment for all token positions of the first processing block;
Fixing the processing precision formats of the embedded layer, the self-attention layer and the feedforward network layer of the first processing block in the language model into a high-precision format;
When a residual token with a length smaller than a preset block exists at the tail of an input text, taking a text segment formed by the residual token as an independent processing block;
Filling an invalid token into the independent processing block to the preset block length, fixing the processing precision format of the filled independent processing block into a high-precision format, and disabling the quantization operation of the independent processing block;
the starting position index and the ending position index of all the processing blocks are recorded.
3. The model quantitative reasoning acceleration method of claim 1, wherein generating a self-attention matrix for each of the other processing blocks except for a first processing block by a language model for the other processing blocks, determining a sum of total element values of a corresponding column of each token position in the self-attention matrix, and taking the sum of total element values as an importance score of each token position, comprising:
inputting each other processing block into a multi-head self-attention layer of the language model to obtain a local self-attention matrix of a plurality of attention heads corresponding to each other processing block;
Performing weighted average or arithmetic average processing on the local self-attention matrixes of the plurality of attention heads to generate a final self-attention matrix corresponding to each other processing block;
extracting a column vector corresponding to each token position from the final self-attention matrix;
Determining the sum of the values of all elements in each column vector, and taking the sum of the values as an importance score of a token position corresponding to each column vector;
if the filled invalid token position exists in other processing blocks, when the importance score of the filled invalid token position is determined, setting all element values of the column vector corresponding to the invalid token position to zero.
4. The model quantitative reasoning acceleration method of claim 1, wherein assigning token positions with importance scores greater than a first threshold to a high precision format, token positions with importance scores above a second threshold and below the first threshold to a medium precision format, and token positions with importance scores less than the second threshold to a low precision format, comprises:
determining a first threshold and a second threshold according to the length of the processing block;
If the invalid token position exists, the processing precision format of the invalid token position is distributed into a low-precision format, and the invalid token position is eliminated when the number of the precision formats of each processing block is counted;
comparing the importance score of each valid token position to the first and second thresholds;
Assigning the processing precision format of the effective token position with the importance score larger than the first threshold value to a high-precision format;
assigning a processing accuracy format for valid token positions having an importance score above a second threshold and below a first threshold to a medium accuracy format;
the processing precision format of the valid token positions with importance scores less than the second threshold is assigned as the low precision format.
5. The model quantization inference acceleration method of claim 1, wherein counting the number of token positions allocated in high precision format, medium precision format and low precision format in each processing block, selecting the precision format with the largest number as the unified quantization configuration of the corresponding processing block, comprises:
traversing all token positions of the current processing block, and identifying token positions distributed into a high-precision format, a medium-precision format and a low-precision format;
in the counting process, if invalid token positions exist, all the invalid token positions are eliminated, and only the precision format distribution result of the valid token positions is counted;
Counting the counts distributed into a high-precision format, a medium-precision format and a low-precision format in the effective token positions respectively to generate a high-precision count, a medium-precision count and a low-precision count;
Comparing the numerical values of the high-precision count, the medium-precision count and the low-precision count;
Taking the precision format corresponding to the count with the largest value as the unified quantization configuration of the current processing block;
And if the counts of the plurality of precision formats are the same and are the maximum value, selecting the highest precision format in the plurality of precision formats as the unified quantization configuration of the current processing block.
6. The method for accelerating model quantization reasoning as set forth in claim 1, wherein sharing the unified quantization configuration of the processing block corresponding to the first network module to other network modules in the same configuration sharing group in each configuration sharing group includes:
Assigning a unique block identifier to each processing block and assigning a unique group identifier to each configuration share group;
establishing a binding relation between the unique group identifier and the unique block identifier in a configuration mapping table;
Associating a first network module of each configuration sharing group with a unified quantized configuration parameter of a corresponding processing block based on a binding relationship of the unique group identifier and the unique block identifier;
Writing the unified quantization configuration parameters into a shared memory area of each configuration shared group, and distributing the same memory address mapping for all network modules of each configuration shared group;
and when other network modules in the same configuration sharing group execute quantization processing, reading the unified quantization configuration parameters from the shared memory area through the memory address mapping.
7. The model quantized reasoning acceleration method of claim 1, wherein performing block-level batch quantization and completing model reasoning on all processing blocks according to a unified quantization configuration corresponding to each processing block, generating a model reasoning result comprises:
Loading corresponding unified quantization configuration parameters for each processing block, wherein the unified quantization configuration parameters comprise precision format identifiers;
according to the precision format identifier, configuring independent processing kernel functions for different processing blocks;
in a parallel processing unit of a graphic processor or a tensor processor, processing resources are allocated according to an index sequence of processing blocks, and quantization processing of all the processing blocks is executed concurrently;
Checking the positions of the token n in each processing block to eliminate the intermediate results corresponding to the invalid token n positions;
Sequencing the effective intermediate results of all the processing blocks according to the initial position index, and splicing the effective intermediate results into a complete model output sequence;
and executing post-processing operation on the spliced output sequences to generate a final model reasoning result.
8. A model quantized reasoning acceleration apparatus, characterized in that the model quantized reasoning acceleration apparatus comprises:
the input text preprocessing module is used for dividing an input text into a plurality of processing blocks, fixing the processing precision format of a first processing block into a high-precision format, and disabling quantization processing of the first processing block;
the self-attention analysis module is used for generating a self-attention matrix of each other processing block through a language model for the other processing blocks except the first processing block, determining the sum of all element values of the corresponding column of each token position in the self-attention matrix, and taking the sum of all element values as an importance score of each token position;
The precision allocation module is used for allocating the token positions with importance scores larger than a first threshold value into a high-precision format, allocating the token positions with importance scores above a second threshold value and below the first threshold value into a medium-precision format, and allocating the token positions with importance scores smaller than the second threshold value into a low-precision format;
The quantization configuration decision module is used for counting the number of token positions distributed into a high-precision format, a medium-precision format and a low-precision format in each processing block, and selecting the precision format with the largest number as the unified quantization configuration of the corresponding processing block;
The network module grouping control module is used for dividing the network module of the language model into a plurality of configuration sharing groups, and each configuration sharing group at least comprises two network modules;
The configuration sharing management module is used for sharing the unified quantization configuration of the processing block corresponding to the first network module to other network modules in the same configuration sharing group in each configuration sharing group;
and the reasoning execution module is used for executing block-level batch quantization on all the processing blocks according to the unified quantization configuration corresponding to each processing block and completing model reasoning to generate a model reasoning result.
9. A computer device comprising a memory, a processor and a model quantized inference acceleration program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the model quantized inference acceleration method according to any one of claims 1-7.
10. A computer readable storage medium, characterized in that the storage medium has stored thereon a model quantized inference acceleration program, which when executed by a processor implements the steps of the model quantized inference acceleration method according to any one of claims 1-7.
CN202510525474.6A 2025-04-25 2025-04-25 Model quantitative reasoning acceleration method, device, equipment and medium Active CN120086355B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510525474.6A CN120086355B (en) 2025-04-25 2025-04-25 Model quantitative reasoning acceleration method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510525474.6A CN120086355B (en) 2025-04-25 2025-04-25 Model quantitative reasoning acceleration method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN120086355A CN120086355A (en) 2025-06-03
CN120086355B true CN120086355B (en) 2025-07-22

Family

ID=95845292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510525474.6A Active CN120086355B (en) 2025-04-25 2025-04-25 Model quantitative reasoning acceleration method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN120086355B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120494006B (en) * 2025-07-17 2025-09-12 红有软件股份有限公司 Dynamic optimization of large model inference performance and hardware-aware compression method
CN121070445B (en) * 2025-11-10 2026-02-06 上海壁仞科技股份有限公司 Neural network computation methods and devices, electronic devices and storage media

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113902108A (en) * 2021-11-24 2022-01-07 贵州电网有限责任公司 Neural network acceleration hardware architecture and method for quantizing bit width dynamic selection
CN118394895A (en) * 2024-03-19 2024-07-26 北京百度网讯科技有限公司 Text reasoning acceleration method and related device applied to large language model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8566259B2 (en) * 2009-09-04 2013-10-22 The Regents Of The University Of California Method and system for parallel statistical inference on highly parallel platforms
US20250094712A1 (en) * 2024-12-02 2025-03-20 Intel Corporation Multi-granular clustering-based solution for key-value cache compression

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113902108A (en) * 2021-11-24 2022-01-07 贵州电网有限责任公司 Neural network acceleration hardware architecture and method for quantizing bit width dynamic selection
CN118394895A (en) * 2024-03-19 2024-07-26 北京百度网讯科技有限公司 Text reasoning acceleration method and related device applied to large language model

Also Published As

Publication number Publication date
CN120086355A (en) 2025-06-03

Similar Documents

Publication Publication Date Title
CN120086355B (en) Model quantitative reasoning acceleration method, device, equipment and medium
Zheng et al. AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures
CN120235181A (en) An automated construction method for end-to-end intelligent agents based on graph structure semantic fusion
CN118245373B (en) A software defect prediction method, system, terminal and storage medium
Kleine Büning et al. Verifying equivalence properties of neural networks with relu activation functions
CN114661665B (en) Execution engine determination method, model training method and device
CN120654816A (en) Large model inference operator optimization method, device and equipment based on rising chip
CN112949825B (en) Resource adjustment method, device and equipment
CN117764057A (en) Utilize syntax parsing to implement instruction execution method for unified data across platforms
US11488006B2 (en) Discretization of numerical values with adaptive accuracy
Luan et al. Timing performance benchmarking of out-of-distribution detection algorithms
CN120610944A (en) A bloodline path tracing system and method based on quantum coding and GPU acceleration
Escobar et al. Improving memory accesses for heterogeneous parallel multi-objective feature selection on EEG classification
CN118445557A (en) Missing data complement method, device method and electronic equipment
Li et al. An attack detection mechanism in smart contracts based on deep learning and feature fusion
CN116360960A (en) Memory allocation method and memory allocation device based on many-core chip
CN120803962B (en) Platform and method for automatically testing and generating kernel export function of operating system
CN120631736B (en) Library entity dependence conflict recognition method for Python environment
Zhao et al. Funcgnn: Learning functional semantics of logic circuits with graph neural networks
US20250383989A1 (en) Non-contiguous attention mask for key-value (kv) cache management for fixed-length transformer models
US20250378381A1 (en) Task-specific modification of pre-trained language models
CN121210525A (en) Graph query processing method and system based on multi-agent collaboration
CN120950354A (en) Front-end code performance testing methods, devices, electronic equipment and storage media
Yu et al. Intra-node transaction parallelism in blockchains: Models, solutions, and trends
CN120510497A (en) Fake image detection method, device, equipment and medium based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant