1Per-task statistics interface 2----------------------------- 3 4 5Taskstats is a netlink-based interface for sending per-task and 6per-process statistics from the kernel to userspace. 7 8Taskstats was designed for the following benefits: 9 10- efficiently provide statistics during lifetime of a task and on its exit 11- unified interface for multiple accounting subsystems 12- extensibility for use by future accounting patches 13 14Terminology 15----------- 16 17"pid", "tid" and "task" are used interchangeably and refer to the standard 18Linux task defined by struct task_struct. per-pid stats are the same as 19per-task stats. 20 21"tgid", "process" and "thread group" are used interchangeably and refer to the 22tasks that share an mm_struct i.e. the traditional Unix process. Despite the 23use of tgid, there is no special treatment for the task that is thread group 24leader - a process is deemed alive as long as it has any task belonging to it. 25 26Usage 27----- 28 29To get statistics during a task's lifetime, userspace opens a unicast netlink 30socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid. 31The response contains statistics for a task (if pid is specified) or the sum of 32statistics for all tasks of the process (if tgid is specified). 33 34To obtain statistics for tasks which are exiting, the userspace listener 35sends a register command and specifies a cpumask. Whenever a task exits on 36one of the cpus in the cpumask, its per-pid statistics are sent to the 37registered listener. Using cpumasks allows the data received by one listener 38to be limited and assists in flow control over the netlink interface and is 39explained in more detail below. 40 41If the exiting task is the last thread exiting its thread group, 42an additional record containing the per-tgid stats is also sent to userspace. 43The latter contains the sum of per-pid stats for all threads in the thread 44group, both past and present. 45 46getdelays.c is a simple utility demonstrating usage of the taskstats interface 47for reporting delay accounting statistics. Users can register cpumasks, 48send commands and process responses, listen for per-tid/tgid exit data, 49write the data received to a file and do basic flow control by increasing 50receive buffer sizes. 51 52Interface 53--------- 54 55The user-kernel interface is encapsulated in include/linux/taskstats.h 56 57To avoid this documentation becoming obsolete as the interface evolves, only 58an outline of the current version is given. taskstats.h always overrides the 59description here. 60 61struct taskstats is the common accounting structure for both per-pid and 62per-tgid data. It is versioned and can be extended by each accounting subsystem 63that is added to the kernel. The fields and their semantics are defined in the 64taskstats.h file. 65 66The data exchanged between user and kernel space is a netlink message belonging 67to the NETLINK_GENERIC family and using the netlink attributes interface. 68The messages are in the format 69 70 +----------+- - -+-------------+-------------------+ 71 | nlmsghdr | Pad | genlmsghdr | taskstats payload | 72 +----------+- - -+-------------+-------------------+ 73 74 75The taskstats payload is one of the following three kinds: 76 771. Commands: Sent from user to kernel. Commands to get data on 78a pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID, 79containing a u32 pid or tgid in the attribute payload. The pid/tgid denotes 80the task/process for which userspace wants statistics. 81 82Commands to register/deregister interest in exit data from a set of cpus 83consist of one attribute, of type 84TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the 85attribute payload. The cpumask is specified as an ascii string of 86comma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8 87the cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest 88in cpus before closing the listening socket, the kernel cleans up its interest 89set over time. However, for the sake of efficiency, an explicit deregistration 90is advisable. 91 922. Response for a command: sent from the kernel in response to a userspace 93command. The payload is a series of three attributes of type: 94 95a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates 96a pid/tgid will be followed by some stats. 97 98b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats 99are being returned. 100 101c) TASKSTATS_TYPE_STATS: attribute with a struct taskstats as payload. The 102same structure is used for both per-pid and per-tgid stats. 103 1043. New message sent by kernel whenever a task exits. The payload consists of a 105 series of attributes of the following type: 106 107a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats 108b) TASKSTATS_TYPE_PID: contains exiting task's pid 109c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats 110d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats 111e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs 112f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process 113 114 115per-tgid stats 116-------------- 117 118Taskstats provides per-process stats, in addition to per-task stats, since 119resource management is often done at a process granularity and aggregating task 120stats in userspace alone is inefficient and potentially inaccurate (due to lack 121of atomicity). 122 123However, maintaining per-process, in addition to per-task stats, within the 124kernel has space and time overheads. To address this, the taskstats code 125accumulates each exiting task's statistics into a process-wide data structure. 126When the last task of a process exits, the process level data accumulated also 127gets sent to userspace (along with the per-task data). 128 129When a user queries to get per-tgid data, the sum of all other live threads in 130the group is added up and added to the accumulated total for previously exited 131threads of the same thread group. 132 133Extending taskstats 134------------------- 135 136There are two ways to extend the taskstats interface to export more 137per-task/process stats as patches to collect them get added to the kernel 138in future: 139 1401. Adding more fields to the end of the existing struct taskstats. Backward 141 compatibility is ensured by the version number within the 142 structure. Userspace will use only the fields of the struct that correspond 143 to the version its using. 144 1452. Defining separate statistic structs and using the netlink attributes 146 interface to return them. Since userspace processes each netlink attribute 147 independently, it can always ignore attributes whose type it does not 148 understand (because it is using an older version of the interface). 149 150 151Choosing between 1. and 2. is a matter of trading off flexibility and 152overhead. If only a few fields need to be added, then 1. is the preferable 153path since the kernel and userspace don't need to incur the overhead of 154processing new netlink attributes. But if the new fields expand the existing 155struct too much, requiring disparate userspace accounting utilities to 156unnecessarily receive large structures whose fields are of no interest, then 157extending the attributes structure would be worthwhile. 158 159Flow control for taskstats 160-------------------------- 161 162When the rate of task exits becomes large, a listener may not be able to keep 163up with the kernel's rate of sending per-tid/tgid exit data leading to data 164loss. This possibility gets compounded when the taskstats structure gets 165extended and the number of cpus grows large. 166 167To avoid losing statistics, userspace should do one or more of the following: 168 169- increase the receive buffer sizes for the netlink sockets opened by 170listeners to receive exit data. 171 172- create more listeners and reduce the number of cpus being listened to by 173each listener. In the extreme case, there could be one listener for each cpu. 174Users may also consider setting the cpu affinity of the listener to the subset 175of cpus to which it listens, especially if they are listening to just one cpu. 176 177Despite these measures, if the userspace receives ENOBUFS error messages 178indicated overflow of receive buffers, it should take measures to handle the 179loss of data. 180 181---- 182