Design goals We want to catch kernel oopses, binary program crashes (coredumps) and interpreted languages crashes (Python exceptions, maybe more in the future). We want to support the following use cases: * Home/office user with minimal administration In this scenario, user expects that abrt will work "out of the box" with minimal configuration. It will be sufficient if crashes just show a GUI notification, and user can invoke a GUI tool to process the crash and report it to bugzilla etc. The configuration (like bugzilla address, username, password) needs to be done via GUI dialogs from the same GUI tool. * Standalone server The server is installed by an admin. It may lack GUI. Admin is willing to do somewhat more complex configuration. Crashes should be recorded, and either processed at once or reported to the admin by email etc. Admin may log in and manually request crash(es) to be processed and reported, using GUI or CLI tools. * Mission critical servers, server farms etc. Admins are expected to be competent and willing to set up complex configurations. They might want to avoid any complex crash processing on the servers - for example, it does not make much sense and/or can be considered insecure to download debuginfo packages to such servers. Admins may want to send "raw" crash dumps to a dedicated server(s) for processing (backtrace, etc). Design Abrt design should be flexible enough to accomodate all of the above usage scenarios. The description below is not what abrt does now. It is (currently incomplete) design notes on how we want it to achieve design goals. Since currently we do not know how to dump oops on demand, we can only poll for it. There is a small daemon which polls kernel message buffer and dumps oopses when it sees them. The dump is written into /var/spool/abrt/DIR. After this, daemon spawns "abrt-process -d /var/spool/abrt/DIR" which processes it according to configuration in /etc/abrt/*.conf. In order to catch binary crashes, we install a handler for it in /proc/sys/kernel/core_pattern (by setting it to "|/usr/libexec/abrt-hook-ccpp /var/spool/abrt %p %s %u"). When process dumps core, the dump is written into /var/spool/abrt/DIR. After this, abrt-hook-ccpp spawns "abrt-process -d /var/spool/abrt/DIR" and terminates. When python program crashes, it invokes internel python subroutine which dumps crash info into ~/abrt/spool/DIR. [this is a tentative plan, currently we dump in /var/spool/abrt/DIR] After this, it spawns "abrt-process -d ~/abrt/spool/DIR" and terminates. [Problem: dumping to /var/spool/abrt/DIR needs world-writable /var/spool/abrt and allows user to go way over his disk quota. Dumping to ~/abrt/spool/DIR makes it difficult to present a list of all crashes which happened on the machine - for example, root-owned processes cannot even access user data in ~user/* if /home is on NFS4... ] When user (admin) wants to see the list of dumped crashes and process them, he runs abrt-gui or abrt-cli. These programs perform a dbus call to "com.redhat.abrt" on a system dbus. If there is no program with this name on it, dbus autostart will invoke "abrt-process", which registers "com.redhat.abrt" and processes the call(s). abrt-process will terminate after a timeout (a few minutes) if no new dbus calls are arriving to it. The key dbus calls served by abrt-process are: - GetCrashInfos(): returns a vector_map_crash_data_t (vector_map_vector_string_t) of crashes for given uid v[N]["executable"/"uid"/"kernel"/"backtrace"][N] = "contents" [see above the problem with producing this list] - CreateReport(UUID): starts creating a report for /var/spool/abrt/DIR with this UUID. Returns job id (uint64). After it returns, when report creation thread has finished, JobDone(client_dbus_ID,UUID) dbus signal is emitted. [Problem: how to do privilegged plugin specific actions?] Solution: if plugin needs an access to some root only accessible dir then abrt should be run by root anyway - debuginfo gets installed using pk-debuginfo-install, which cares about privileges itself, so no problem here - GetJobResult(UUID): returns map_crash_data_t (map_vector_string_t) - Report(map_crash_data_t (map_vector_string_t)): "Please report this crash": calls Report() of all registered reporter plugins Returns report_status_t (map_vector_string_t) - the status of each call - DeleteDebugDump(UUID): delete corresponding /var/spool/abrt/DIR. Returns bool Development plan Since current code does not match the planned design, we need to gradually change the code to "morph" it into the desired shape. Done: * Make abrtd dbus startable. * Add -t TIMEOUT_SEC option to abrtd. {done} * Make abrt-gui start abrtd on demand, so that abrt-gui can be started even if abrtd does not run at the moment. (doesn't work in some cases!) Planned steps: * make kerneloops plugin into separate daemon (convert it to a hook and get rid of "cron plugins" which are wrong idea since the begining) - and make it to the service (write an initscript) * make C/C++ hook to be started by init script - init scritp would run ccpp-hook --init whic shoudl just set the core_pattern, which is now done by the C analyzer plugin * hooks will start the daemon on-demand using dbus - this is something I'm not sure if it's good idea, but dbus is becoming to be "un-installable" on Fedora, it's probably ok * simplify abrt.conf: - move all plugin related info to plugins/.conf - enabled, action association, etc ... - make abrtd to parse plugins/*.conf and set the config options that it understand - this will fix the case when this is in abrt.conf [Cron] KerneloopsScanner = 120 because this should be in plugins/kerneloops.conf and thus shouldn't exist if kerneloops-addon is not installed * ??? * ??? * ??? * ??? * ??? * Take over the world