1Clock sources, Clock events, sched_clock() and delay timers 2----------------------------------------------------------- 3 4This document tries to briefly explain some basic kernel timekeeping 5abstractions. It partly pertains to the drivers usually found in 6drivers/clocksource in the kernel tree, but the code may be spread out 7across the kernel. 8 9If you grep through the kernel source you will find a number of architecture- 10specific implementations of clock sources, clockevents and several likewise 11architecture-specific overrides of the sched_clock() function and some 12delay timers. 13 14To provide timekeeping for your platform, the clock source provides 15the basic timeline, whereas clock events shoot interrupts on certain points 16on this timeline, providing facilities such as high-resolution timers. 17sched_clock() is used for scheduling and timestamping, and delay timers 18provide an accurate delay source using hardware counters. 19 20 21Clock sources 22------------- 23 24The purpose of the clock source is to provide a timeline for the system that 25tells you where you are in time. For example issuing the command 'date' on 26a Linux system will eventually read the clock source to determine exactly 27what time it is. 28 29Typically the clock source is a monotonic, atomic counter which will provide 30n bits which count from 0 to 2^(n-1) and then wraps around to 0 and start over. 31It will ideally NEVER stop ticking as long as the system is running. It 32may stop during system suspend. 33 34The clock source shall have as high resolution as possible, and the frequency 35shall be as stable and correct as possible as compared to a real-world wall 36clock. It should not move unpredictably back and forth in time or miss a few 37cycles here and there. 38 39It must be immune to the kind of effects that occur in hardware where e.g. 40the counter register is read in two phases on the bus lowest 16 bits first 41and the higher 16 bits in a second bus cycle with the counter bits 42potentially being updated in between leading to the risk of very strange 43values from the counter. 44 45When the wall-clock accuracy of the clock source isn't satisfactory, there 46are various quirks and layers in the timekeeping code for e.g. synchronizing 47the user-visible time to RTC clocks in the system or against networked time 48servers using NTP, but all they do basically is update an offset against 49the clock source, which provides the fundamental timeline for the system. 50These measures does not affect the clock source per se, they only adapt the 51system to the shortcomings of it. 52 53The clock source struct shall provide means to translate the provided counter 54into a nanosecond value as an unsigned long long (unsigned 64 bit) number. 55Since this operation may be invoked very often, doing this in a strict 56mathematical sense is not desirable: instead the number is taken as close as 57possible to a nanosecond value using only the arithmetic operations 58multiply and shift, so in clocksource_cyc2ns() you find: 59 60 ns ~= (clocksource * mult) >> shift 61 62You will find a number of helper functions in the clock source code intended 63to aid in providing these mult and shift values, such as 64clocksource_khz2mult(), clocksource_hz2mult() that help determine the 65mult factor from a fixed shift, and clocksource_register_hz() and 66clocksource_register_khz() which will help out assigning both shift and mult 67factors using the frequency of the clock source as the only input. 68 69For real simple clock sources accessed from a single I/O memory location 70there is nowadays even clocksource_mmio_init() which will take a memory 71location, bit width, a parameter telling whether the counter in the 72register counts up or down, and the timer clock rate, and then conjure all 73necessary parameters. 74 75Since a 32-bit counter at say 100 MHz will wrap around to zero after some 43 76seconds, the code handling the clock source will have to compensate for this. 77That is the reason why the clock source struct also contains a 'mask' 78member telling how many bits of the source are valid. This way the timekeeping 79code knows when the counter will wrap around and can insert the necessary 80compensation code on both sides of the wrap point so that the system timeline 81remains monotonic. 82 83 84Clock events 85------------ 86 87Clock events are the conceptual reverse of clock sources: they take a 88desired time specification value and calculate the values to poke into 89hardware timer registers. 90 91Clock events are orthogonal to clock sources. The same hardware 92and register range may be used for the clock event, but it is essentially 93a different thing. The hardware driving clock events has to be able to 94fire interrupts, so as to trigger events on the system timeline. On an SMP 95system, it is ideal (and customary) to have one such event driving timer per 96CPU core, so that each core can trigger events independently of any other 97core. 98 99You will notice that the clock event device code is based on the same basic 100idea about translating counters to nanoseconds using mult and shift 101arithmetic, and you find the same family of helper functions again for 102assigning these values. The clock event driver does not need a 'mask' 103attribute however: the system will not try to plan events beyond the time 104horizon of the clock event. 105 106 107sched_clock() 108------------- 109 110In addition to the clock sources and clock events there is a special weak 111function in the kernel called sched_clock(). This function shall return the 112number of nanoseconds since the system was started. An architecture may or 113may not provide an implementation of sched_clock() on its own. If a local 114implementation is not provided, the system jiffy counter will be used as 115sched_clock(). 116 117As the name suggests, sched_clock() is used for scheduling the system, 118determining the absolute timeslice for a certain process in the CFS scheduler 119for example. It is also used for printk timestamps when you have selected to 120include time information in printk for things like bootcharts. 121 122Compared to clock sources, sched_clock() has to be very fast: it is called 123much more often, especially by the scheduler. If you have to do trade-offs 124between accuracy compared to the clock source, you may sacrifice accuracy 125for speed in sched_clock(). It however requires some of the same basic 126characteristics as the clock source, i.e. it should be monotonic. 127 128The sched_clock() function may wrap only on unsigned long long boundaries, 129i.e. after 64 bits. Since this is a nanosecond value this will mean it wraps 130after circa 585 years. (For most practical systems this means "never".) 131 132If an architecture does not provide its own implementation of this function, 133it will fall back to using jiffies, making its maximum resolution 1/HZ of the 134jiffy frequency for the architecture. This will affect scheduling accuracy 135and will likely show up in system benchmarks. 136 137The clock driving sched_clock() may stop or reset to zero during system 138suspend/sleep. This does not matter to the function it serves of scheduling 139events on the system. However it may result in interesting timestamps in 140printk(). 141 142The sched_clock() function should be callable in any context, IRQ- and 143NMI-safe and return a sane value in any context. 144 145Some architectures may have a limited set of time sources and lack a nice 146counter to derive a 64-bit nanosecond value, so for example on the ARM 147architecture, special helper functions have been created to provide a 148sched_clock() nanosecond base from a 16- or 32-bit counter. Sometimes the 149same counter that is also used as clock source is used for this purpose. 150 151On SMP systems, it is crucial for performance that sched_clock() can be called 152independently on each CPU without any synchronization performance hits. 153Some hardware (such as the x86 TSC) will cause the sched_clock() function to 154drift between the CPUs on the system. The kernel can work around this by 155enabling the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK option. This is another aspect 156that makes sched_clock() different from the ordinary clock source. 157 158 159Delay timers (some architectures only) 160-------------------------------------- 161 162On systems with variable CPU frequency, the various kernel delay() functions 163will sometimes behave strangely. Basically these delays usually use a hard 164loop to delay a certain number of jiffy fractions using a "lpj" (loops per 165jiffy) value, calibrated on boot. 166 167Let's hope that your system is running on maximum frequency when this value 168is calibrated: as an effect when the frequency is geared down to half the 169full frequency, any delay() will be twice as long. Usually this does not 170hurt, as you're commonly requesting that amount of delay *or more*. But 171basically the semantics are quite unpredictable on such systems. 172 173Enter timer-based delays. Using these, a timer read may be used instead of 174a hard-coded loop for providing the desired delay. 175 176This is done by declaring a struct delay_timer and assigning the appropriate 177function pointers and rate settings for this delay timer. 178 179This is available on some architectures like OpenRISC or ARM. 180