WO2023093015A1 - Procédé et appareil de filtrage de données, dispositif et support de stockage - Google Patents

Procédé et appareil de filtrage de données, dispositif et support de stockage Download PDF

Info

Publication number
WO2023093015A1
WO2023093015A1 PCT/CN2022/099815 CN2022099815W WO2023093015A1 WO 2023093015 A1 WO2023093015 A1 WO 2023093015A1 CN 2022099815 W CN2022099815 W CN 2022099815W WO 2023093015 A1 WO2023093015 A1 WO 2023093015A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
model
business
business data
updating
Prior art date
Application number
PCT/CN2022/099815
Other languages
English (en)
Chinese (zh)
Inventor
秦铎浩
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Publication of WO2023093015A1 publication Critical patent/WO2023093015A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present disclosure relates to the field of computer technology, in particular to the field of artificial intelligence, big data, deep learning, and data reflow technology, and in particular to a data screening method, device, equipment, and storage medium.
  • the neural network model based on deep learning can be applied to more and more scenarios, such as target detection, target recognition, target classification, etc.
  • the disclosure provides a data screening method, device, equipment and storage medium.
  • a data screening method including:
  • the business data is screened based on the degree of influence of the business data on the model to obtain data for updating the model; wherein the degree of influence reflects the degree of influence on the update performance of the model.
  • a data screening device comprising:
  • An acquisition module used to acquire business data
  • a screening module configured to screen the business data based on the degree of influence of the business data on the model to obtain data for updating the model; wherein the degree of influence reflects the impact on the update performance of the model size.
  • an electronic device including:
  • the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the method described in the first aspect.
  • a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to the first aspect.
  • a computer program product comprising a computer program which, when executed by a processor, implements the method according to the first aspect.
  • the disclosure screens business data, avoids retraining all business data to update the model, and can reduce the amount of data used for model updating.
  • FIG. 1 is a flowchart of a data screening method according to an embodiment of the present disclosure
  • FIG. 2 is a flowchart of a data screening method according to another embodiment of the present disclosure.
  • FIG. 3 is a schematic structural diagram of a data screening device provided by an embodiment of the present disclosure.
  • Fig. 4 is another schematic structural diagram of a data screening device provided by an embodiment of the present disclosure.
  • FIG. 5 is a block diagram of an electronic device used to implement the data screening method of the embodiment of the present disclosure.
  • the business data generated by the prediction service is collected, and these business data are stored in the corresponding file storage, and then the organization personnel manually label the business data, and store them in the training set after the labeling is completed, and then based on The training set is retrained to update the prediction service.
  • the prediction service is realized by using the neural network model, and updating the prediction service means updating the neural network model.
  • the labeling of business data is very cumbersome, and the overall workload of manual labeling is very large, and all business data generated every day must be labeled. The amount of data generated every day is very large, and the cost of labeling is very high through manual labeling.
  • the update of the neural network model is implemented based on the labeled data.
  • Data reflux is the process from predicting business data involved in the service to regenerating a new data set. It can also be understood as the process of using business data to obtain data for model update.
  • the embodiment of the present disclosure provides a data screening method to screen business data, avoid retraining all business data to update the model, reduce the amount of data used for model update, and improve the efficiency of model update .
  • Simple understanding realize the optimization of data backflow, and realize the return of non-labeled data to training data faster and effectively.
  • filtering the business data used for retraining for model update can reduce the amount of retraining data during the model update process, thereby improving the efficiency of model update.
  • filter the data and label to update the model which can avoid labeling all business data, reduce the amount of labeled data, reduce the cost of labeling, and use the filtered data to update the model , selectively labeling data can reduce the time-consuming model update and improve the efficiency of model update.
  • electronic devices may include servers, terminals, and so on.
  • An embodiment of the present disclosure provides a data screening method, which may include:
  • the business data is screened to obtain the data used to update the model; wherein, the degree of influence reflects the degree of influence on the model update performance.
  • the business data can be screened based on the degree of influence of the business data on the model, so as to realize the screening of the business data, which can avoid retraining for all business data to update the model, and reduce the need for model update.
  • the amount of data thereby improving the efficiency of model updating.
  • Fig. 1 is a flowchart of a data screening method provided by an embodiment of the present disclosure.
  • the data screening method provided by the embodiment of the present disclosure may include the following steps:
  • Business data refers to data in business scenarios.
  • the business data is the data corresponding to the target detection result;
  • the target classification scenario it is the data corresponding to the classification result;
  • the target recognition scenario it is the data corresponding to the recognition result.
  • the business data may be data generated in a business scenario using a model.
  • the degree of influence reflects the degree of influence on the model update performance.
  • a high influence degree reflects a large influence on the model update performance, and a low influence degree reflects a small influence on the model update performance.
  • Simple understanding filter out business data that has a high degree of influence on the model, that is, has a relatively large impact on the model update performance, from multiple business data, and update the model based on the filtered business data.
  • Update performance may include update rate and/or accuracy.
  • the degree of influence of each business data on the model can be determined for each business data, that is, the influence of each business data on the model update performance, for example, the influence of each business data on the model update rate and/or accuracy can be determined respectively. Then, select the business data with a high degree of influence, that is, select the business data that has a great impact on the model update performance, as the filtered data for updating the model, so that the update rate can be selected to make the update rate higher and the accuracy higher
  • the business data of is used as the data used to update the model.
  • an influence degree threshold may be set in advance, and for each business data, when the influence degree of the business data on the model update performance is not less than the influence degree threshold, the business data may be used as data for updating the model.
  • S102 may include:
  • the business data is screened based on the business tags and prediction data corresponding to the business data to obtain data for updating the model; the prediction data is data obtained by using the model for the business data.
  • the business data is screened based on the information gain corresponding to the business data to obtain data for updating the model, and the information gain is proportional to the degree of influence.
  • the business data is screened separately, so as to realize the screening of data participating in model update, reduce the amount of data used for model update, and reduce the complexity of model update, thereby improving Efficiency of model updates.
  • screening the business data used for retraining for model update can reduce the amount of retraining data and improve the efficiency of model update.
  • selectively labeling data and reducing the amount of labeled data can reduce the cost of labeling, and using the filtered data to update the model can reduce the time-consuming model update and improve the efficiency of model update. efficiency.
  • the business data in response to the business data containing business tags, the business data is screened based on the business tags and the forecast data corresponding to the business data to obtain data for updating the model, which may include:
  • the service label can be compared with the predicted data; in response to the difference between the service label and the predicted data is not less than a preset difference value, the service data is used as data for updating the model.
  • the difference between the business label corresponding to the business data and the predicted data is relatively small, it can be understood that the accuracy of the model prediction is relatively high. In this case, the contribution of the business data to the model update is relatively small. In order to reduce the retraining during the model update process If there is a large amount of data, these business data can be deleted, that is, these business data are no longer used as business data for model update, and these business data are no longer redesigned to implement model update.
  • the difference between the business label corresponding to the business data and the predicted data is relatively large, it can be understood that the accuracy of the model prediction is relatively low, and the business data can be understood as a failed or wrong sample.
  • the difference in forecast data updates the model and adjusts the model parameters to make the model prediction more accurate. That is, it can be understood that these business data contribute a lot to the model update. Therefore, these business data can be used as the business for model update data, and recreate the business data to update the model.
  • the business data can be input into the model, and the prediction data corresponding to the business data can be output through the model. Then, compare the business label with the predicted data, if the difference between the business label corresponding to the business data and the predicted data is relatively small, if the difference between the business label and the predicted data is smaller than the preset difference value, delete the business data as the Data to update the model. If the difference between the business label corresponding to the business data and the predicted data is relatively large, for example, if the difference between the business label and the predicted data is not less than a preset difference value, the business data is used as data for updating the model.
  • the preset difference value may be determined according to actual requirements.
  • the difference between the business label corresponding to the business data and the predicted data reflects the degree of influence of the business data on the model.
  • the difference is proportional to the degree of influence, that is, the greater the difference, the higher the degree of influence, that is, the greater the impact on the model update performance.
  • Using the screened data to update the model for retraining to update the model can optimize the model with a relatively small amount of data compared to using all business data, making the prediction result of the model more accurate. And the model optimization effect can be achieved faster.
  • a picture of a handwritten signature can be understood as business data, and the name field can be extracted from the picture of the handwritten signature.
  • Image contains business label: name field.
  • the picture can be named with the name field, so that the corresponding relationship between the picture and the business tags contained in the picture can be obtained by using the file name of the picture.
  • the picture of the handwritten signature can be input into an image recognition model, and the model can output a predicted value, that is, the predicted name field, and the predicted name field and the name field extracted from the business system Compare and screen out failed cases (samples), that is, business data, such as the picture of the handwritten signature whose difference value between the predicted name field and the name field extracted from the business system is not less than the preset difference value, and will
  • the picture of the handwritten signature is used as the data used to update the model, which can realize the screening of business data and reduce the amount of data for retraining.
  • using failed cases for retraining to update the model can optimize the model faster and better.
  • the service data is screened based on the information gain corresponding to the service data.
  • the business data is screened based on the information gain corresponding to the business data to obtain data for updating the model.
  • Information gain reflects the degree of influence of business data on the model. Information gain is directly proportional to the degree of influence. Simple understanding, the greater the information gain, the higher the impact on the model, that is, the greater the impact on the model update performance, it can also be understood as the more useful the update of the model.
  • the business data can only be marked manually first, and the cost of manually marking a large amount of business data is relatively high.
  • the business data that does not contain business tags are screened before labeling, so that the amount of data that needs to be labeled can be reduced, the cost of labeling can be reduced, the efficiency of model update can be improved, and the amount of labeled data can be reduced. In turn, the amount of data for retraining is reduced, and the efficiency of model updating is further improved.
  • a model with a relatively large information gain is selected for subsequent model update.
  • the information gain corresponding to the business data can be calculated; in response to the information gain being not less than the preset gain value, the business data is used as the data to be marked.
  • the preset gain value may be determined according to actual requirements.
  • the information gain corresponding to the business data can be calculated through the following information gain function
  • D train represents the model to be updated
  • X represents business data
  • indicates the model parameters
  • X,D train indicates the corresponding ⁇ when X and D train are given
  • Indicates the corresponding when X and D train are given Indicates that for a given X and D train
  • D train ) represents the probability of ⁇ based on the given D train
  • randomness can be added on the basis of the initial model.
  • randomness can be added by adding Monte Carlo dropout method
  • randomness can be added by adding Monte Carlo dropout. It is to randomly select some neurons in the model each time and temporarily hide (discard) them, and then use the model to obtain the prediction data of this iteration.
  • Monte Carlo dropout it is to randomly select some neurons in the model each time and temporarily hide (discard) them, and then use the model to obtain the prediction data of this iteration.
  • Monte Carlo dropout it is to randomly select some neurons in the model each time and temporarily hide (discard) them, and then use the model to obtain the prediction data of this iteration.
  • Monte Carlo dropout you can refer to the dropout mechanism in related technologies , which will not be repeated here.
  • the multiple prediction data corresponding to can be calculated as well as so can get That is, under the condition of ⁇
  • the information gain can be compared with the preset gain value, and in response to the information gain being not less than the preset gain value, the service data X is used as the data to be labeled.
  • the information gain function can be used to accurately calculate the information gain to accurately reflect the impact of business data on the model update, and then can accurately filter out the data that has a high impact on the model update performance, that is, can more accurately filter out the data that has a high impact on the model update.
  • Useful data thereby greatly reducing the amount and cost of manually labeled data, and improving the efficiency of model updating.
  • the above information gain function is calculated by can be understood as Mutual information with ⁇
  • a form of mutual information is adopted to maximize the information gain of model parameters.
  • the maximum entropy will be relatively large; when the model predicts a single point with a higher probability (that is, certainty), then The smaller the value, the goal of screening through the information gain function is to screen out the samples (business data) that can minimize the parameter uncertainty, that is, to screen out the business data that makes the information gain large.
  • a preset number of data such as 10 or 20 is randomly selected from the reflowed data for labeling, and a preliminary model is first trained based on the labeled data of the preset number of data, and then, through the filter function ( The information gain function above) is screened; then, the screened data is relabeled, and finally the model is updated by using the labeled data for the screened data.
  • the data to be marked after the data to be marked is obtained, the data to be marked can be marked, and the marked data can be used to update the model.
  • the process of using the marked data to update the model is similar to the neural network model in related technologies
  • the training process is similar to the training process of the neural network model in the related art.
  • the embodiments of the present disclosure perform screening on unstructured data to realize reflow of unstructured data.
  • the type of data it can be divided into structured data and unstructured data.
  • Structured data is highly organized and neatly formatted data. It is the type of data that can be put into tables and spreadsheets. Structured data is also known as quantitative data. It is information that can be represented by numbers or uniform structures, such as numbers and symbols.
  • Unstructured data is essentially everything except structured data. It does not conform to any predefined model, is stored in a non-relational database, may be textual or non-textual, and may be human or machine-generated. Simply put, unstructured data is field variable data. Unstructured data is not easily organized or formatted, and collecting, processing and analyzing unstructured data is also a major challenge. For example, structured data is in the form of text tables, and unstructured data is in the form of images.
  • unstructured data is more difficult to collect, process and analyze. It is also understandable that labeling for unstructured data is more cumbersome.
  • data screening can be performed on unstructured data, that is, business data is unstructured data, so as to selectively label unstructured data, reduce the amount of labeling, reduce the cost of labeling, and then improve the model update s speed.
  • the process of unstructured data reflow is also optimized to further reduce the cost of labeling in the unstructured data reflow process, and to filter the reflow data through active learning to select samples that are more useful for the final result. Significantly reduce the amount of data and cost of manual labeling.
  • the business data can be understood as the data to be returned, which can be the data generated in the business scenario, and the model is updated based on these data.
  • the business data may be data generated in a business scenario using the model.
  • the business data is business data generated in other ways in the business scenario. You can first select a preset number of business data from the business data for training to obtain an initial model, and then, based on the business data other than the preset number of business data, pair the The initial model is updated.
  • the business data contains business tags, it can be screened based on the actual tags (the above business tags) and predicted data.
  • the business label corresponding to the business data is compared with the predicted data; in response to the difference between the business label and the predicted data is not less than a preset difference value, the business data is used as data for updating the model.
  • the screening results can be saved in the data set.
  • the data can be obtained from the data set to retrain the model to update the model.
  • the business data When the business data does not contain business tags, it can be screened through active learning, and the screening can be repeated multiple times, and the business data can be screened based on the information gain corresponding to the business data.
  • the information gain corresponding to the service data is calculated through the above information gain function; in response to the information gain being not less than a preset gain value, the service data is used as the data to be labeled.
  • the screening of service data based on the information gain corresponding to the service data has been described in detail in the foregoing embodiments, and will not be repeated here.
  • the active learning method may be repeated multiple times for multiple business data, for example, repeated N times, where N is greater than 1.
  • the information gain is calculated separately for multiple business data, and the business data is screened based on the information gain corresponding to the business data.
  • the process of calculating the information gain for a business data can also be repeated multiple times, that is, for a business data, the information gain is calculated multiple times, and one of them is selected for subsequent screening, or one can be randomly selected, or more than one can be counted.
  • the value of secondary information gain for example, calculate statistical values such as average value and variance, and perform subsequent screening based on statistical values.
  • the data to be labeled can be labeled and saved to the dataset.
  • the filtered data for updating the model may be saved.
  • the data used to update the model can be saved by means of incremental saving; or, the data used for updating the model can be saved by means of full saving.
  • Incremental saving means saving changed data. Specifically, only the data obtained by the current screening may be saved, and the data before the current screening may be deleted.
  • Saving in full means saving all the data used to update the model, specifically, saving the currently filtered data on the basis of the data before the current filtering.
  • the incremental storage method is more suitable for the training of time-sensitive data.
  • the model can be more focused on the distribution of sample data that is closer in time; the full storage retraining is used to have better generalization to the whole.
  • the model update process when the model is more time-sensitive, for example, training a model for object tracking, you can select incrementally saved data for retraining to update the model.
  • the model needs to have better generalization to the whole for example, classification scenarios, detection scenarios, etc., you can choose to retrain the full amount of saved data to update the model.
  • An embodiment of the present disclosure also provides a data screening device, as shown in FIG. 3 , which may include:
  • An acquisition module 301 configured to acquire business data
  • the screening module 302 is used to screen the business data based on the degree of influence of the business data on the model to obtain data for updating the model; wherein the degree of influence reflects the size of the influence on the model update performance.
  • the screening module 302 is also configured to: in response to the business data containing business tags, filter the business data based on the business tags and the forecast data corresponding to the business data to obtain data for updating the model; the forecast data is used The data obtained by the model for business data; in response to the fact that the business data does not contain business tags, the business data is screened based on the information gain corresponding to the business data to obtain data for updating the model, and the information gain is proportional to the degree of influence.
  • the screening module 302 is also configured to: compare the business label with the predicted data; and use the business data as the data for updating the model in response to the difference between the business label and the predicted data is not less than a preset difference value .
  • the screening module 302 is also used to: calculate the information gain corresponding to the business data; respond to the information gain being not less than the preset gain value, use the business data as the data to be marked; mark the data to be marked to obtain the The data to be updated.
  • the screening module 302 is also used to: calculate the information gain corresponding to the business data through the following information gain function;
  • D train represents the model to be updated
  • X represents business data
  • indicates the model parameters
  • X,D train indicates the corresponding ⁇ when X and D train are given
  • Indicates the corresponding when X and D train are given Indicates that for a given X and D train
  • D train ) represents the probability of ⁇ based on the given D train
  • the device also includes:
  • the saving module 401 is configured to save the data used for updating the model by way of incremental saving; or save the data used for updating the model by way of full saving.
  • the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
  • FIG. 5 shows a schematic block diagram of an example electronic device 500 that may be used to implement embodiments of the present disclosure.
  • Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the device 500 includes a computing unit 501 that can execute according to a computer program stored in a read-only memory (ROM) 502 or loaded from a storage unit 508 into a random-access memory (RAM) 503. Various appropriate actions and treatments. In the RAM 503, various programs and data necessary for the operation of the device 500 can also be stored.
  • the computing unit 501, ROM 502, and RAM 503 are connected to each other through a bus 504.
  • An input/output (I/O) interface 505 is also connected to the bus 504 .
  • the I/O interface 505 includes: an input unit 506, such as a keyboard, a mouse, etc.; an output unit 507, such as various types of displays, speakers, etc.; a storage unit 508, such as a magnetic disk, an optical disk, etc. ; and a communication unit 509, such as a network card, a modem, a wireless communication transceiver, and the like.
  • the communication unit 509 allows the device 500 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.
  • the computing unit 501 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 501 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc.
  • the calculation unit 501 executes various methods and processes described above, such as data screening methods.
  • the data screening method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 508 .
  • part or all of the computer program may be loaded and/or installed on the device 500 via the ROM 502 and/or the communication unit 509.
  • the computer program When the computer program is loaded into RAM 503 and executed by computing unit 501, one or more steps of the data screening method described above may be performed.
  • the computing unit 501 may be configured to execute the data screening method in any other suitable manner (for example, by means of firmware).
  • Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), complex programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system of systems
  • CPLD complex programmable logic device
  • computer hardware firmware, software, and/or combinations thereof.
  • programmable processor can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
  • Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or a trackball
  • Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.
  • the systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
  • the components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
  • a computer system may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.
  • the server can be a cloud server, a server of a distributed system, or a server combined with a blockchain.
  • steps may be reordered, added or deleted using the various forms of flow shown above.
  • each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

La présente divulgation se rapporte au domaine technique des ordinateurs et, en particulier, aux domaines techniques de l'intelligence artificielle, des mégadonnées, de l'apprentissage profond et du flux inverse de données, et concerne un procédé et un appareil de filtrage de données, un dispositif et un support de stockage. Le mécanisme d'implémentation spécifique consiste : à acquérir des données de service ; et à filtrer les données de service sur la base du degré d'influence des données de service sur un modèle pour obtenir des données servant à la mise à jour du modèle, le degré d'influence reflétant l'influence sur les performances de mise à jour de modèle. Les données de service sont filtrées et la re-formation de toutes les données de service est évitée pour mettre à jour le modèle, de telle sorte que le volume de données de mise à jour de modèle peut être réduit.
PCT/CN2022/099815 2021-11-23 2022-06-20 Procédé et appareil de filtrage de données, dispositif et support de stockage WO2023093015A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111394304.7A CN114090601B (zh) 2021-11-23 2021-11-23 一种数据筛选方法、装置、设备以及存储介质
CN202111394304.7 2021-11-23

Publications (1)

Publication Number Publication Date
WO2023093015A1 true WO2023093015A1 (fr) 2023-06-01

Family

ID=80303226

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/099815 WO2023093015A1 (fr) 2021-11-23 2022-06-20 Procédé et appareil de filtrage de données, dispositif et support de stockage

Country Status (2)

Country Link
CN (1) CN114090601B (fr)
WO (1) WO2023093015A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114090601B (zh) * 2021-11-23 2023-11-03 北京百度网讯科技有限公司 一种数据筛选方法、装置、设备以及存储介质
CN117998295A (zh) * 2022-10-31 2024-05-07 维沃移动通信有限公司 数据标注方法、装置、终端设备及网络侧设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827169A (zh) * 2019-10-30 2020-02-21 云南电网有限责任公司信息中心 一种基于分级指标的分布式电网业务监控方法
CN111242195A (zh) * 2020-01-06 2020-06-05 支付宝(杭州)信息技术有限公司 模型、保险风控模型训练方法、装置及电子设备
CN111767712A (zh) * 2019-04-02 2020-10-13 北京地平线机器人技术研发有限公司 基于语言模型的业务数据筛选方法和装置、介质、设备
CN112399448A (zh) * 2020-11-18 2021-02-23 中国联合网络通信集团有限公司 无线通讯优化方法、装置、电子设备及存储介质
CN112598326A (zh) * 2020-12-31 2021-04-02 五八有限公司 模型迭代方法、装置、电子设备及存储介质
CN112906902A (zh) * 2020-12-22 2021-06-04 上海有个机器人有限公司 一种基于主动学习技术的机器人数据收集迭代训练方法、系统以及储存介质
CN114090601A (zh) * 2021-11-23 2022-02-25 北京百度网讯科技有限公司 一种数据筛选方法、装置、设备以及存储介质

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8099373B2 (en) * 2008-02-14 2012-01-17 Microsoft Corporation Object detector trained using a working set of training data
US20160071017A1 (en) * 2014-10-15 2016-03-10 Brighterion, Inc. Method of operating artificial intelligence machines to improve predictive model training and performance
CN107122327B (zh) * 2016-02-25 2021-06-29 阿里巴巴集团控股有限公司 一种利用训练数据训练模型的方法和训练系统
CN106778357B (zh) * 2016-12-23 2020-02-07 北京神州绿盟信息安全科技股份有限公司 一种网页篡改的检测方法及装置
CN106780258B (zh) * 2016-12-23 2020-08-14 东方网力科技股份有限公司 一种未成年人犯罪决策树的建立方法及装置
GB201705189D0 (en) * 2017-03-31 2017-05-17 Microsoft Technology Licensing Llc Sensor data processor with update ability
CN108447055A (zh) * 2018-03-26 2018-08-24 西安电子科技大学 基于spl和ccn的sar图像变化检测方法
CN108804512B (zh) * 2018-04-20 2020-11-24 平安科技(深圳)有限公司 文本分类模型的生成装置、方法及计算机可读存储介质
CN110689038B (zh) * 2019-06-25 2024-02-02 深圳市腾讯计算机系统有限公司 神经网络模型的训练方法、装置和医学图像处理系统
CN110544100A (zh) * 2019-09-10 2019-12-06 北京三快在线科技有限公司 基于机器学习的业务识别方法、装置及介质
CN111813931B (zh) * 2020-06-16 2021-03-16 清华大学 事件检测模型的构建方法、装置、电子设备及存储介质
CN112560993A (zh) * 2020-12-25 2021-03-26 北京百度网讯科技有限公司 数据筛选方法、装置、电子设备及存储介质
CN112734195B (zh) * 2020-12-31 2023-07-07 平安科技(深圳)有限公司 数据处理方法、装置、电子设备及存储介质
CN112446441B (zh) * 2021-02-01 2021-08-20 北京世纪好未来教育科技有限公司 模型训练数据筛选方法、装置、设备及存储介质
CN113205880B (zh) * 2021-04-30 2022-09-23 广东省人民医院 基于LogitBoost的心脏疾病预后预测方法及装置
CN113033713B (zh) * 2021-05-24 2021-07-23 天津所托瑞安汽车科技有限公司 事故片段的识别方法、装置、设备及可读存储介质
CN113642659B (zh) * 2021-08-19 2023-06-20 上海商汤科技开发有限公司 一种训练样本集生成的方法、装置、电子设备及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767712A (zh) * 2019-04-02 2020-10-13 北京地平线机器人技术研发有限公司 基于语言模型的业务数据筛选方法和装置、介质、设备
CN110827169A (zh) * 2019-10-30 2020-02-21 云南电网有限责任公司信息中心 一种基于分级指标的分布式电网业务监控方法
CN111242195A (zh) * 2020-01-06 2020-06-05 支付宝(杭州)信息技术有限公司 模型、保险风控模型训练方法、装置及电子设备
CN112399448A (zh) * 2020-11-18 2021-02-23 中国联合网络通信集团有限公司 无线通讯优化方法、装置、电子设备及存储介质
CN112906902A (zh) * 2020-12-22 2021-06-04 上海有个机器人有限公司 一种基于主动学习技术的机器人数据收集迭代训练方法、系统以及储存介质
CN112598326A (zh) * 2020-12-31 2021-04-02 五八有限公司 模型迭代方法、装置、电子设备及存储介质
CN114090601A (zh) * 2021-11-23 2022-02-25 北京百度网讯科技有限公司 一种数据筛选方法、装置、设备以及存储介质

Also Published As

Publication number Publication date
CN114090601A (zh) 2022-02-25
CN114090601B (zh) 2023-11-03

Similar Documents

Publication Publication Date Title
CN113326764B (zh) 训练图像识别模型和图像识别的方法和装置
WO2023093015A1 (fr) Procédé et appareil de filtrage de données, dispositif et support de stockage
WO2020244336A1 (fr) Procédé et dispositif de classification d'alarme, dispositif électronique et support d'informations
CN112507098B (zh) 问题处理方法、装置、电子设备、存储介质及程序产品
US20230215136A1 (en) Method for training multi-modal data matching degree calculation model, method for calculating multi-modal data matching degree, and related apparatuses
CN112883730B (zh) 相似文本匹配方法、装置、电子设备及存储介质
US11954084B2 (en) Method and apparatus for processing table, device, and storage medium
CN113657483A (zh) 模型训练方法、目标检测方法、装置、设备以及存储介质
WO2023093014A1 (fr) Procédé et appareil de reconnaissance de facture, et dispositif et support de stockage
US20240070454A1 (en) Lightweight model training method, image processing method, electronic device, and storage medium
JP7357114B2 (ja) 生体検出モデルのトレーニング方法、装置、電子機器および記憶媒体
JP2023060846A (ja) モデル決定方法、装置、電子機器及びメモリ
US20220198358A1 (en) Method for generating user interest profile, electronic device and storage medium
CN113392920B (zh) 生成作弊预测模型的方法、装置、设备、介质及程序产品
CN114970540A (zh) 训练文本审核模型的方法和装置
WO2024016680A1 (fr) Procédé et appareil de recommandation de flux d'informations et produit programme d'ordinateur
EP4246365A1 (fr) Procédé et appareil d'identification de page web, dispositif électronique et support
WO2023011093A1 (fr) Procédé et appareil d'apprentissage de modèle de tâche, et dispositif électronique et support de stockage
CN114417974B (zh) 模型训练方法、信息处理方法、装置、电子设备和介质
CN113420174B (zh) 难样本挖掘方法、装置、设备以及存储介质
CN114817476A (zh) 语言模型的训练方法、装置、电子设备和存储介质
CN114330576A (zh) 模型处理方法、装置、图像识别方法及装置
CN114610953A (zh) 一种数据分类方法、装置、设备及存储介质
CN114117248A (zh) 数据处理方法、装置及电子设备
CN113590774A (zh) 事件查询方法、装置以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22897112

Country of ref document: EP

Kind code of ref document: A1