Background
A thread (thread), also called a Lightweight Process (LWP), is a single sequence of control flows in a Process and serves as the smallest unit of program execution. In an operating system incorporating threads, a process is generally used as a basic unit for allocating resources, and a thread is used as a basic unit for independent operation and independent scheduling. Threads may be executed concurrently, e.g., multiple threads in a process may be executed concurrently. Threads in different processes can also execute concurrently. In particular, in a multi-compute core computer system, such as a computer system having multiple CPU cores, threads of different cores may also execute concurrently.
When multiple threads execute concurrently, the same data often needs to be accessed. From the accessed data, the data is shared to different threads. When multiple threads access the shared data, the integrity of the shared data needs to be guaranteed. For example, two threads cannot modify shared data at the same time; one thread cannot read the shared data that is modified in half. The classical approach is to use a Lock (Lock) mechanism. For example, a "read lock" is added to data during a read operation by a thread, and a "write lock" is added to data during a write operation by a thread. Before a process reads a datum, a read lock is added to the datum, and after the read operation is finished, the read lock is unlocked. Similarly, before a process performs a write operation on a piece of data, a write lock is applied to the data, and after the write operation is completed, the write lock is released. Read _ ref is typically used as the reference count for a read thread and the writer _ ID is used to represent the ID of a write thread.
For read operations that are performed on the same data by different threads, multiple read locks may be added. For example, if thread 1 is to perform a read operation on a data, before performing the read operation, the data is read after adding a read lock, specifically, adding 1 to the value of read _ ref (e.g., the data type of read _ ref is shaped and the initial value is 0). In the process of reading, thread 2 also performs a read operation on the same data, and adds 1 to the value of read _ ref and reads the data. The value of read _ ref at this time is 2. After the read operation for thread 1 is completed, the value of read _ ref is decremented by 1 and the read lock is unlocked. At this time, the value of read _ ref is 1. After that, thread 2 finishes the read operation on the data, subtracts 1 from the value of read _ ref, and unlocks the read lock. At this time, the value of read _ ref is 0. The read locks for the same data can be duplicated, so that the read locks are shared.
For write operations that are performed on the same data by different threads, a write lock can only be applied once. For example, if thread 1 is to perform a write operation on data, before performing the write operation, the data is locked by writing, specifically, the value of the writer _ ID is updated to the ID of thread 1 (for example, the data type of the writer _ ID is a shape and the initial value is 0; the ID of any thread is not 0), and then the data is written. During writing, the thread 2 also performs writing operation on the same data, but since the value of the writer _ ID is not 0 in this case, the thread 2 cannot add a write lock and cannot write the data. After the write operation of thread 1 is completed, the lock is unlocked, i.e., the value of the writer _ ID is updated to 0. Thread 2 knows that the value of writer _ ID is 0 at this time after the previous write lock failure and waiting for a period of time, and can add the write lock. Thereafter, the value of the writer _ ID is updated to the ID of the thread 2, and then the data is written. After the write operation of thread 2 is completed, the lock is unlocked, i.e., the value of the writer _ ID is updated to 0. As can be seen, write locks on the same data cannot be duplicated, and thus, there is mutual exclusivity between write locks.
In addition, the write lock and the read lock are mutually exclusive, namely, at any time, the write lock cannot be added again when the read lock is added to the same data, and the write lock cannot be added again when the write lock is added. Thus, before a thread reads data, it needs to check whether the writer _ ID value of the data is 0. If 0, the read operation can be performed; if not 0, it is necessary to wait for the writer _ ID value to become 0. Similarly, before writing a data, a thread needs to check whether the read _ ref value of the data is 0. If the value is 0, the write operation can be carried out; if not 0, it is necessary to wait for the read _ ref value to become 0. In fact, for the add-read lock, in order to further avoid the add-write lock operation performed by another thread for the write operation between the operation of checking the writer _ ID value and the corresponding read operation, i.e. to avoid the collision detection failure caused in this case, after adding 1 to the value of read _ ref, it will be checked again whether the writer _ ID value at this time is 0. If not 0, then a read operation is performed. Similarly, for the write lock, in order to further avoid the read lock operation performed by another thread for a read operation between the operation of checking the read _ ref value and the corresponding write operation, i.e. to avoid the failure of the collision detection caused in this case, after the value of the writer _ ID is updated to the ID of the write thread, it is checked again whether the read _ ref value at this time is 0. If not 0, then the write operation is performed.
The thread changes the values of read _ ref and writer _ ID, belonging to atomic operations. Atomic operations are typically instructions provided by the CPU with atomicity. When one thread executes an atomic operation, the thread cannot be interrupted by other threads and cannot be switched to other threads. In other words, such an atomic operation, once started, runs until the operation ends.
In the process of implementing the present application, the inventor finds that at least the following problems exist in the prior art:
in a multi-compute core computer system, threads of different cores may perform read and write operations on the same data. In particular, it is often the case that a large number of read operations are performed on the same data over a period of time, without a write operation. Each core typically corresponds to a cache. Each core maintains a read _ ref value in its corresponding cache. Furthermore, according to the prior art implementation, the read _ ref value in the cache corresponding to each core needs to be kept consistent. Thus, for a multi-compute core computer system, once a read _ ref value in a cache memory corresponding to one core changes, it communicates with other cores to notify of the change. And after receiving the notification, other cores update the read _ ref value in the corresponding caches of the other cores.
Thus, in the prior art, when a plurality of threads of different cores read the same data, since communication between the cores takes a certain time, the atomic operation for changing the read _ ref value in the cache corresponding to each core takes a certain time, and thus the execution efficiency is low.
Detailed Description
The embodiment of the application provides a read lock operation method, a write lock operation method and a system.
In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
An embodiment of a method of operating a read lock of the present application is first described.
FIG. 1 is a flow chart illustrating one embodiment of a method for read lock operation of the present application. As shown in fig. 1, the method of this embodiment includes:
s100: the private reference count corresponding to each core is set.
Modern CPUs employ a number of techniques to counteract the latency associated with memory accesses. During the period of reading and writing the memory data, the CPU can execute hundreds of instructions. A multi-level Static Random Access Memory (SRAM) cache (hereinafter, referred to as a cache) is a main means for reducing the influence caused by such delay.
For example, for a dual core computer system, core1 and core2 have corresponding cache 1 and cache 2, respectively. The cache may be a cache of a compute core. For example, for a CPU, the CPU often has a first level cache, a second level cache, and even some CPUs have a third level cache. For a CPU including a first-level cache and a second-level cache, data to be operated by the CPU is read into the second-level cache from a memory, then read into the first-level cache from the second-level cache, and then read into the CPU from the first-level cache for execution. Generally, the closer to the memory of the CPU, the faster the speed, but the higher the cost; the further away from the memory of the CPU, the slower the speed, but the cheaper the cost. Data frequently read and written by the CPU is generally stored in a memory close to the CPU, so that the utilization rate of the memory with high manufacturing cost is improved.
In this step, preferably, the private reference count (private _ read _ ref) may be placed in the cache. For example, the private reference count may be set in the CPU's level one cache. Of course, depending on the architecture of the CPU and the capacity of the memories at different levels, the private reference count may also be set in the second level cache, or in other memories with read speeds on the same order as the atomic operating speed of the CPU. This is not a particular limitation in the embodiments of the present application. In fact, caches are often transparent to the program, i.e., the program has no way to control whether a variable is to be placed in the cache, and in which level of cache it is to be placed. When a program needs to operate a variable, the CPU can check whether the variable is in the primary buffer, and if the variable is in the primary buffer, the CPU can directly read the variable from the primary buffer; if not, it may be checked whether in the level two cache: if yes, the variable is loaded into the first level cache from the second level cache, and if the second level cache does not exist, the variable is loaded into the second level cache and the first level cache from the memory.
In the prior art, read operations of different threads on the same data involve the same reference count, i.e. the same reference count operation. This reference count, read _ ref, is referred to as global read _ ref according to general computer domain rules. In particular, different threads of the same compute kernel, including different threads in the same process, or different threads in different processes, perform a self-add (++) or self-subtract (- -) operation on the same global read _ ref when reading the same data. If in a multi-core computer system, for the case of multiple cores, only one global reference count is still used, problems arise as analyzed in the background.
In this step, for different cores, a private application count is set for each core. For example, for core1, its corresponding private reference count is set, e.g., as read _ ref _ core 1; for core2, its corresponding private reference count is also set, e.g., as read _ ref _ core 2. For the case where other cores are also included, and so on.
The corresponding private reference count for each core may not be permanently (or referred to as fixed) assigned, but may be temporarily assigned. For example, the allocation may be made before the thread of each core first locks the data; the private reference count is retired after a read operation of the data by a thread of the core is completed. Specifically, an array of private reference counts [ read _ ref ] may be set. Before the thread of each core first locks the data, it applies to allocate one of the [ read _ ref ] arrays. The space of the array [ read _ ref ] can be set large enough. Each entry in the array may be set to shaping (int). The initial value of each entry in the array may be initialized to 0. Of course, for a read operation of a certain data, each entry in the [ read _ ref ] array may also be fixedly allocated to each core.
Preferably, in actual operation, each entry in the [ read _ ref ] array may be allocated to one cache line in the cache. The cache line is the minimum unit for the multi-core CPU to maintain cache consistency and is also the actual unit of memory exchange. In practice, one cache line on most platforms is larger than 8 bytes, and most cache lines are 64 bytes. If the [ read _ ref ] array is defined to be int type, then 8 bytes. As can be seen, one cache line may store 8 read _ ref. If more than one read _ ref is stored in a cache line, there will be conflicts when operating on different elements in the array. To avoid conflicts, each read _ ref in the [ read _ ref ] array may be stored in one cache line. For example, each entry in the [ read _ ref ] array may be declared as a structure, with a structure size of 64 bytes. Thus, each entry in the [ read _ ref ] array is exclusive to one cache line, thereby avoiding conflicts during operation.
S110: and in the process of reading the same data by the threads of different cores, performing reading lock adding and reading lock reading operations by using the private reference counts corresponding to the different cores.
For example, the same computer system includes 2 computing cores, core1 and core 2. As another example, core1 and core2 both read the same data. According to S100, core1 may apply for 1 private reference count, labeled read _ ref _ core 1; similarly, core2 may also apply for 1 private reference count each, such as read _ ref _ core 2.
In this way, in the process of reading the data by the thread of the core1, the read lock is first added. That is, the private reference count read _ ref _ core1 of core1 performs an add 1 operation. Thus, the read _ ref _ core1 changes from the initial value 0 to 1. The thread of core1 then reads the data. After the read operation is completed, the reading lock operation is performed. That is, the private reference count read _ ref _ core1 of core1 performs a subtract 1 operation. Thus, read _ ref _ core1 changes from 1 to 0.
Similarly, in the process of reading the data by the thread of the core2, the read lock is firstly added. That is, the private reference count read _ ref _ core2 of core2 performs an add 1 operation. Thus, the read _ ref _ core2 changes from the initial value 0 to 1. The thread of core2 then reads the data. After the read operation is completed, the reading lock operation is performed. That is, the private reference count read _ ref _ core2 of core2 performs a subtract 1 operation. Thus, read _ ref _ core2 changes from 1 to 0.
By adopting the above manner in the embodiment of the application, when threads of different cores read the same data, the private reference counting operation corresponding to the core is performed independently. The private reference counts corresponding to the different cores do not need to be synchronized among the cores, so the execution efficiency is improved. Moreover, the expansibility of the read lock is improved, namely, the time for adding and removing the read lock is hardly increased no matter how many threads of the core add and decode the lock simultaneously.
In addition, the private reference counts corresponding to different cores do not need to be synchronized among the cores, and a communication process among the cores is omitted, so that expenses of bandwidth, time and the like required by inter-core communication are omitted.
The S110 may specifically include the following steps:
s111: and in the process of reading the data by the thread of the first core, performing reading lock adding and reading lock reading operations by using the private reference count corresponding to the first core.
S112: and in the process of reading the data by the thread of the second core, performing reading lock adding and reading lock reading operations by using the private reference count corresponding to the second core.
The operation of locking with a read lock by using the private reference count specifically includes: the processes of different cores perform an add-1 operation with the private reference count corresponding to each core. The operation of reading the lock with the private reference count specifically includes: and the processes of different cores execute the minus 1 operation according to the private reference count corresponding to each core. Between the read lock and read lock operations, the data may be read by the processes of each core.
It should be noted that, in the process of performing a read operation on the same data by a plurality of different threads of the same core, the read lock adding operation and the read lock reading operation may be performed by the same private counter. For example, in the process of reading the data by thread 1 of core1, a read lock is first added. The private reference count read _ ref _ core1 of core1 performs an add 1 operation. Thus, the read _ ref _ core1 changes from the initial value 0 to 1. Thread 1 of core1 then reads the data. In the process of reading the data by thread 1 of core1, thread 2 of core1 also performs a read operation on the same data, and adds 1 to the value of read _ ref _ core1 and reads the data. The value of read _ ref at this time is 2. After the read operation for thread 1 of core1 is completed, the read _ ref _ core1 is decremented by 1 to unlock the lock. At this time, the value of read _ ref _ core1 becomes 1. After that, thread 2 of core1 finishes executing the read operation on the data, and subtracts 1 from the value of read _ ref _ core1 to unlock the read lock. At this time, the value of read _ ref _ core1 becomes 0. Thus, for the same core, no matter how many threads add and decode the lock simultaneously, the time for adding and decoding the lock is hardly increased.
It should be further noted that, in order to avoid data inconsistency, the read lock in the embodiment of the present application is still mutually exclusive from the write lock. For example, in a computer system with multiple cores, a global write lock is set, such as global _ write _ id. If a thread is to write to data, a write lock is placed on the data before the write is performed. For example, thread 1 of a core updates the value of global _ writer _ ID to the ID of thread 1 (e.g., the data type of global _ writer _ ID is reshaped and the initial value is 0; the ID of any thread is not 0), and then writes the data. During a write, a thread of a core (which may be the same core or a different core from the previous write lock thread), referred to herein as thread 2, performs a read operation on the same data, applying for a private reference count corresponding to the core. The private reference count is initialized to 0, for example. However, since the value of global _ writer _ id is not 0 at this time, thread 2 cannot add a read lock and cannot read the data. After the write operation of thread 1 is completed, the lock is unlocked, i.e. the value of global _ writer _ id is updated to 0. After thread 2 fails to add the read lock for the previous time and waits for a period of time, it knows that the value of global _ writer _ id is 0 at this time, and can add the read lock. Thread 2 may also try to add a read lock at regular intervals after the previous read lock addition fails; when the value of global _ writer _ id is 0, retry read lock is successful. Thus, thread 2 increments the value of the private reference count to which it applies by 1, and then reads the data. The private reference count for thread 2 at this point has a value of 1. After the read operation of thread 2 is completed, the lock is interpreted, i.e., the value of the corresponding private reference count is decremented by 1 to 0.
Based on this, before the thread of the different core in S110 performs the read lock operation on the corresponding private reference count, the method may further include:
s101: the thread of the different core checks whether the data is in the process of writing operation, and if the check result is no, the execution is triggered to S110.
Whether a write operation is in progress may be accomplished by checking the status of the global write lock. For example, it may be checked whether the global write lock is 0, and S110 is performed when the check result is 0.
Conversely, if the value of the check global write lock is not 0, it means that there is currently a write operation to the data. Based on the mutually exclusive rows of the write lock and the read lock, the read lock cannot be added to the data, and the data cannot be read. In this case, S110 needs to be executed after waiting for the global write lock to become 0.
S101 may be executed after S100 or before S100.
It should be noted that, for the add-read lock, in order to further avoid an add-write lock operation performed by another thread for a write operation between the operation of checking the global _ writer _ id value and the corresponding read operation, that is, to avoid failure of collision detection caused in this case, after adding 1 to the value of the private reference count corresponding to the core, it will be checked again whether the current global _ writer _ id value is 0. If not 0, then a read operation is performed.
One embodiment of the read lock operating system of the present application is described below. Fig. 2 shows a block diagram of an embodiment of the system.
As shown in fig. 2, the read lock operating system in an embodiment of the present application includes a first computing core 11a, a second computing core 11b, a first cache unit 12a, a second cache unit 12b, and a data unit 13, where each of the computing cores corresponds to a unique cache unit.
Wherein:
a data unit 13 for storing data;
a first cache unit 12a, configured to store a first private reference count allocated for the first computing core;
a second cache unit 12b for saving the allocated second private reference count for the second computational core;
a first computing core 11a and a second computing core 11b for reading the same data in the data unit; and the number of the first and second electrodes,
in the process of reading the data by the thread of the first computing core 11a, performing reading lock adding and reading lock reading operations by using the private reference count corresponding to the first core;
and in the process of reading the data by the thread of the second computing core 11b, performing read lock adding and read lock reading operations by using the private reference count corresponding to the second core.
Wherein:
the first cache unit 12a may be a cache of a first computing core;
the second cache unit 12b may be a cache of a second computing core.
In the foregoing method embodiment, the private reference count corresponding to each core is set, and the private reference count corresponding to each core may be allocated before the thread of each core performs the first read lock on the data, or the private reference count of each core may also be fixedly allocated. For example, an array of private reference counts [ read _ ref ] may be set. Before the thread of each core first locks the data, it applies to allocate one of the [ read _ ref ] arrays. The space of the array [ read _ ref ] can be set large enough. Each entry in the array may be set to shaping (int). The initial value of each entry in the array may be initialized to 0. Of course, for a read operation of a certain data, each entry in the [ read _ ref ] array may also be fixedly allocated to each core. Preferably, in actual operation, each entry in the [ read _ ref ] array may be allocated to a cache line (cacheline) in the cache. The cache line is the minimum unit for the multi-core CPU to maintain cache consistency and is also the actual unit of memory exchange. In practice, one cache line on most platforms is larger than 8 bytes, and most cache lines are 64 bytes. If the [ read _ ref ] array is defined to be int type, then 8 bytes. As can be seen, one cache line may store 8 read _ ref. If more than one read _ ref is stored in a cache line, there will be conflicts when operating on different elements in the array. To avoid conflicts, each read _ ref in the [ read _ ref ] array may be stored in one cache line. For example, each entry in the [ read _ ref ] array may be declared as a structure, with a structure size of 64 bytes. Thus, each entry in the [ read _ ref ] array is exclusive to one cache line, thereby avoiding conflicts during operation.
In combination with the above, in an embodiment of the read lock operating system of the present application, caches of different cores may correspond to different cache lines. For example, the first cache unit corresponds to a first cache line, and the second cache unit corresponds to a second cache line.
In the embodiment of the read lock operating system, the read lock operating system may further include a checking unit 14, configured to check whether the data is in a write operation process, and if not, trigger each computing core to perform read lock adding and read lock reading operations on the corresponding private reference count.
The operation of locking the private reference count includes: the process of each core performs an add-1 operation on the private reference count corresponding to that core. The operation of reading the lock on the private reference count specifically includes: and the process of each core performs 1 subtracting operation on the private reference count corresponding to the core. Between the read lock and read lock operations, the data may be read by the processes of each core.
One embodiment of a write lock operation method of the present application is described below. Fig. 3 shows a flow chart of an embodiment of the method. As shown in fig. 3, an embodiment of a write lock operation method of the present application includes:
s300: before the data is written, whether a read operation process for the data exists in all the computing cores is judged.
The determining whether all the computing cores have the reading process for the data may be specifically implemented by facilitating whether the private reference count of each computing core corresponding to the data is 0. If the value is 0, the data is in the process of reading operation; if not 0, this indicates that the data is not in the process of a read operation.
S310: before the data is written, whether the data is in another writing operation process is judged.
S310 may be specifically implemented by determining whether the global write lock for the data is 0. If 0, it indicates that the data is not in the process of another write operation; if not 0, another write operation procedure is indicated.
S320: and if the judgment results of the S310 and the S320 are both negative, performing the operations of adding a write lock and removing the write lock by using the global write lock in the process of performing the write operation on the data.
Specifically, before a write operation is performed, a write lock is applied to the data; after the write operation, the lock is unlocked for the data.
The global variable in S320 is, for example, global _ writer _ id. The write lock can update the value of global _ writer _ ID to the ID of the write thread; the value of global _ writer _ id may be updated to 0 to unlock the write lock.
Similarly, for the write lock, in order to further avoid the read lock operation performed by another thread for the read operation between the operation of checking each core private reference count value and the corresponding write operation, i.e. to avoid the failure of the collision detection caused in this case, after the value of global _ writer _ ID is updated to the ID of the write thread, it will be checked again whether each core private reference count value at this time is 0. If not 0, then the write operation is performed.
The write lock operation method may be based on the read lock operation method or the read lock operation system.
One embodiment of the write lock operating system of the present application is described below. Fig. 4 shows a block diagram of an embodiment of the system. As shown in FIG. 4, the embodiment of the write lock operating system of the present application includes:
a data unit 3 for storing data;
a first judgment unit 21a configured to judge whether there is a read operation process on data in all the computing cores before performing a write operation on the data;
a second judging unit 21b configured to judge whether data is in another write operation process before performing a write operation on the data;
and an add/write lock/unlock unit 22, configured to, when the determination results of the first determination unit and the second determination unit are both negative, perform write lock/unlock operations with the global write lock in a process of performing write operations on the data.
Specifically, before a write operation is performed, a write lock is applied to the data; after the write operation, the lock is unlocked for the data.
The global variable is, for example, global _ writer _ id. The write lock can update the value of global _ writer _ ID to the ID of the write thread; the value of global _ writer _ id may be updated to 0 to unlock the write lock.
The write lock operating system may be based on the read lock operating method or the read lock operating system.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
While the present application has been described with examples, those of ordinary skill in the art will appreciate that there are numerous variations and permutations of the present application without departing from the spirit of the application, and it is intended that the appended claims encompass such variations and permutations without departing from the spirit of the application.