Design goals We want to catch kernel oopses, binary program crashes (coredumps) and interpreted languages crashes (Python exceptions, maybe more in the future). We want to support the following use cases: * Home/office user with minimal administration In this scenario, user expects that abrt will work "out of the box" with minimal configuration. It will be sufficient if crashes just show a GUI notification, and user can invoke a GUI tool to process the crash and report it to bugzilla etc. The configuration (like bugzilla address, username, password) needs to be done via GUI dialogs from the same GUI tool. * Standalone server The server is installed by an admin. It may lack GUI. Admin is willing to do somewhat more complex configuration. Crashes should be recorded, and either processed at once or reported to the admin by email etc. Admin may log in and manually request crash(es) to be processed and reported, using GUI or CLI tools. * Mission critical servers, server farms etc. Admins are expected to be competent and willing to set up complex configurations. They might want to avoid any complex crash processing on the servers - for example, it does not make much sense and/or can be considered insecure to download debuginfo packages to such servers. Admins may want to send "raw" crash dumps to a dedicated server(s) for processing (backtrace, etc). Design Abrt design should be flexible enough to accomodate all of the above usage scenarios. Since currently we do not know how to dump oops on demand, we can only poll for it. There is a small daemon, abrt-dump-oops, which polls syslog file and saves oopses when it sees them. The oops dump is written into /var/spool/abrt/DIR. [TODO? abrt-dump-oops spawns "abrt-handle-crashdump -d /var/spool/abrt/DIR" which processes it according to configuration in /etc/abrt/*.conf] In order to catch binary crashes, we install a handler for it in /proc/sys/kernel/core_pattern (by setting it to "|/usr/libexec/abrt-hook-ccpp /var/spool/abrt ...."). When process dumps core, the dump is written into /var/spool/abrt/DIR. [TODO? after this, abrt-hook-ccpp spawns "abrt-handle-crashdump -d /var/spool/abrt/DIR"] Then abrt-hook-ccpp terminates. When python program crashes, it invokes internal python subroutine which connects to abrtd via /var/run/abrt/abrt.socket and sends crash data to abrtd. abrtd creates dump dir /var/spool/abrt/DIR. abrtd daemon watches /var/spool/abrt for new directories. When new directory is noticed, abrtd runs "post-create" event on it, and emits a dbus signal. This dbus signal is used to inform online users about the new crash: if abrt-applet is running in X session, it sees the signal and starts blinking. [The above scheme is somewhat suboptimal. It's stupid that abrtd uses inotify to watch for crashes. Instead the programs which create crashes can trigger their initial ("post-create") processing themselves] Crashes conceptually go through "events" in their lives. Apart from "post-create" event decribed above, they may have "analyze" event, "reanalyze" event, "report[_FOO]" events, ans arbitrarily-named other events. abrt-handle-crashdump tool can be used to "run" an event on a directory, or to query list of possible events for a directory. /etc/abrt/abrt_event.conf file describes what should be done on each event. When user (admin) wants to see the list of dumped crashes and process them, he runs abrt-gui or abrt-cli. These programs perform a dbus call to "com.redhat.abrt" on a system dbus. If there is no program with this name on it, dbus autostart will invoke abrtd, which registers "com.redhat.abrt" and processes the call(s). The key dbus calls served by abrtd are: - GetCrashInfos(): returns a vector_map_crash_data_t (vector_map_vector_string_t) of crashes for given uid v[N]["executable"/"uid"/"kernel"/"backtrace"][N] = "contents" - CreateReport(/var/spool/abrt/DIR): starts creating a report. Returns job id (uint64). Then abrtd run "analyze" event on the DIR. After it completes, when report creation thread has finished, JobDone(client_dbus_ID,/var/spool/abrt/DIR) dbus signal is emitted. [Problem: how to do privilegged plugin specific actions?] Solution: if plugin needs an access to some root only accessible dir then abrt should be run by root anyway - debuginfo gets installed using pk-debuginfo-install, which cares about privileges itself, so no problem here - GetJobResult(/var/spool/abrt/DIR): returns map_crash_data_t (map_vector_string_t) - Report(map_crash_data_t (map_vector_string_t)): "Please report this crash": abrtd run "report[_FOO]" event(s) on the DIR. Returns report_status_t (map_vector_string_t) - the status of each event - DeleteDebugDump(/var/spool/abrt/DIR): delete /var/spool/abrt/DIR. Returns bool [Note: we are in the process of reducing/eliminating dbus communication between abrt-gui/abrt-cli and abrtd. It seems we will be able to reduce dbus messages to "crash occurred" signal and "DeleteDebugDump(DIR)" call] Development plan Since current code does not match the planned design, we need to gradually change the code to "morph" it into the desired shape. Done: * Make abrtd dbus startable. * Add -t TIMEOUT_SEC option to abrtd. * Make abrt-gui start abrtd on demand, so that abrt-gui can be started even if abrtd does not run at the moment. * make kerneloops plugin into separate daemon (convert it to a hook and get rid of "cron plugins" which are wrong idea since the begining) * make C/C++ hook to be started by init script * add "include FILE" feature to abrt_event.conf Planned steps: * make kerneloops plugin into separate service? * hooks will start the daemon on-demand using dbus - this is something I'm not sure if it's good idea, but dbus is becoming to be "un-installable" on Fedora, it's probably ok * simplify abrt.conf: - move all plugin related info to plugins/.conf - enabled, action association, etc ... - make abrtd to parse plugins/*.conf and set the config options that it understand - this will fix the case when this is in abrt.conf [Cron] KerneloopsScanner = 120 because this should be in plugins/kerneloops.conf and thus shouldn't exist if kerneloops-addon is not installed * ??? * ??? * ??? * ??? * ??? * Take over the world