diff options
| author | Karel Klic <kklic@redhat.com> | 2011-03-08 19:38:38 +0100 |
|---|---|---|
| committer | Karel Klic <kklic@redhat.com> | 2011-03-08 19:38:38 +0100 |
| commit | cb552a48f1b1fe95e2ea9f63d3c860eeccfdbe08 (patch) | |
| tree | e11f538c85f66e31b204589ac5583d0ff5da42fc | |
| parent | 5fbd48fabf60b2e6e9470a9315307c154126d971 (diff) | |
retrace server documentation in texinfo format
| -rw-r--r-- | Makefile.am | 2 | ||||
| -rw-r--r-- | configure.ac | 1 | ||||
| -rw-r--r-- | doc/Makefile.am | 1 | ||||
| -rw-r--r-- | doc/retrace-server.texi | 800 |
4 files changed, 803 insertions, 1 deletions
diff --git a/Makefile.am b/Makefile.am index a536b7c7..6f38997f 100644 --- a/Makefile.am +++ b/Makefile.am @@ -1,5 +1,5 @@ ACLOCAL_AMFLAGS = -I m4 -SUBDIRS = src po icons tests +SUBDIRS = src po icons tests doc DISTCHECK_CONFIGURE_FLAGS = \ --with-systemdsystemunitdir=$$dc_install_base/$(systemdsystemunitdir) diff --git a/configure.ac b/configure.ac index bf10138d..b6999daf 100644 --- a/configure.ac +++ b/configure.ac @@ -121,6 +121,7 @@ AC_CONFIG_HEADERS([config.h]) AC_CONFIG_FILES([ Makefile abrt.pc + doc/Makefile src/include/Makefile src/lib/Makefile src/report-python/Makefile diff --git a/doc/Makefile.am b/doc/Makefile.am new file mode 100644 index 00000000..43c0bc6d --- /dev/null +++ b/doc/Makefile.am @@ -0,0 +1 @@ +info_TEXINFOS = retrace-server.texi diff --git a/doc/retrace-server.texi b/doc/retrace-server.texi new file mode 100644 index 00000000..43f05627 --- /dev/null +++ b/doc/retrace-server.texi @@ -0,0 +1,800 @@ +\input texinfo +@c retrace-server.texi - Retrace Server Documentation +@c +@c .texi extension is recommended in GNU Automake manual +@setfilename retrace-server.info +@include version.texi + +@settitle Retrace server for ABRT @value{VERSION} Manual + +@dircategory Retrace server +@direntry +* Retrace server: (retrace-server). Remote coredump analysis via HTTP. +@end direntry + +@titlepage +@title Retrace server +@subtitle for ABRT version @value{VERSION}, @value{UPDATED} +@author Karel Klic (@email{kklic@@redhat.com}) +@page +@vskip 0pt plus 1filll +@end titlepage + +@contents + +@ifnottex +@node Top +@top Retrace server + +This manual is for retrace server for ABRT version @value{VERSION}, +@value{UPDATED}. The retrace server provides a coredump analysis and +backtrace generation service over a network using HTTP protocol. +@end ifnottex + +@menu +* Overview:: +* HTTP interface:: +* Retrace worker:: +* Package repository:: +* Traffic and load estimation:: +* Security:: +* Future work:: +@end menu + +@node Overview +@chapter Overview + +A client sends a coredump (created by Linux kernel) together with +some additional information to the server, and gets a backtrace +generation task ID in response. Then the client, after some time, asks +the server for the task status, and when the task is done (backtrace +has been generated from the coredump), the client downloads the +backtrace. If the backtrace generation fails, the client gets an error +code and downloads a log indicating what happened. Alternatively, the +client sends a coredump, and keeps receiving the server response +message. Server then, via the response's body, periodically sends +status of the task, and delivers the resulting backtrace as soon as +it's ready. + +The retrace server must be able to support multiple operating +systems and their releases (Fedora N-1, N, Rawhide, Branched Rawhide, +RHEL), and multiple architectures within a single installation. + +The retrace server consists of the following parts: +@enumerate +@item +abrt-retrace-server: a HTTP interface script handling the +communication with clients, task creation and management +@item +abrt-retrace-worker: a program doing the environment preparation +and coredump processing +@item +package repository: a repository placed on the server containing +all the application binaries, libraries, and debuginfo necessary for +backtrace generation +@end enumerate + +@node HTTP interface +@chapter HTTP interface + +@menu +* Creating a new task:: +* Task status:: +* Requesting a backtrace:: +* Requesting a log:: +* Task cleanup:: +* Limiting traffic:: +@end menu + +The HTTP interface application is a script written in Python. The +script is named @file{abrt-retrace-server}, and it uses the +@uref{http://www.python.org/dev/peps/pep-0333/, Python Web Server +Gateway Interface} (WSGI) to interact with the web server. +Administrators may use +@uref{http://code.google.com/p/modwsgi/, mod_wsgi} to run +@command{abrt-retrace-server} on Apache. The mod_wsgi is a part of +both Fedora 12 and RHEL 6. The Python language is a good choice for +this application, because it supports HTTP handling well, and it is +already used in ABRT. + +Only secure (HTTPS) communication must be allowed for the communication +with @command{abrt-retrace-server}, because coredumps and backtraces are +private data. Users may decide to publish their backtraces in a bug +tracker after reviewing them, but the retrace server doesn't do +that. The HTTPS requirement must be specified in the server's man +page. The server must support HTTP persistent connections to to avoid +frequent SSL renegotiations. The server's manual page should include a +recommendation for administrator to check that the persistent +connections are enabled. + +@node Creating a new task +@section Creating a new task + +A client might create a new task by sending a HTTP request to the +@indicateurl{https://server/create} URL, and providing an archive as the +request content. The archive must contain crash data files. The crash +data files are a subset of the local +@file{/var/spool/abrt/ccpp-time-pid/} directory contents, so the client +must only pack and upload them. + +The server must support uncompressed tar archives, and tar archives +compressed with gzip and xz. Uncompressed archives are the most +efficient way for local network delivery, and gzip can be used there +as well because of its good compression speed. + +The xz compression file format is well suited for public server setup +(slow network), as it provides good compression ratio, which is +important for compressing large coredumps, and it provides reasonable +compress/decompress speed and memory consumption. See @ref{Traffic and +load estimation} for the measurements. The @uref{http://tukaani.org/xz/, XZ Utils} +implementation with the compression level 2 should be used to compress +the data. + +The HTTP request for a new task must use the POST method. It must +contain a proper @var{Content-Length} and @var{Content-Type} fields. If +the method is not POST, the server must return the "405 Method Not +Allowed" HTTP error code. If the @var{Content-Length} field is missing, +the server must return the "411 Length Required" HTTP error code. If an +@var{Content-Type} other than @samp{application/x-tar}, +@samp{application/x-gzip}, @samp{application/x-xz} is used, the server +must return the "415 unsupported Media Type" HTTP error code. If the +@var{Content-Length} value is greater than a limit set in the server +configuration file (50 MB by default), or the real HTTP request size +gets larger than the limit + 10 KB for headers, then the server must +return the "413 Request Entity Too Large" HTTP error code, and provide +an explanation, including the limit, in the response body. The limit +must be changeable from the server configuration file. + +If there is less than 20 GB of free disk space in the +@file{/var/spool/abrt-retrace} directory, the server must return +the "507 Insufficient Storage" HTTP error code. The server must return +the same HTTP error code if decompressing the received archive would +cause the free disk space to become less than 20 GB. The 20 GB limit +must be changeable from the server configuration file. + +If the data from the received archive would take more than 500 +MB of disk space when uncompressed, the server must return the "413 +Request Entity Too Large" HTTP error code, and provide an explanation, +including the limit, in the response body. The size limit must be +changeable from the server configuration file. It can be set pretty +high because coredumps, that take most disk space, are stored on the +server only temporarily until the backtrace is generated. When the +backtrace is generated the coredump is deleted by the +@command{abrt-retrace-worker}, so most disk space is released. + +The uncompressed data size for xz archives can be obtained by calling +@code{`xz --list file.tar.xz`}. The @option{--list} option has been +implemented only recently, so it might be necessary to implement a +method to get the uncompressed data size by extracting the archive to +the stdout, and counting the extracted bytes, and call this method if +the @option{--list} doesn't work on the server. Likewise, the +uncompressed data size for gzip archives can be obtained by calling +@code{`gzip --list file.tar.gz`}. + +If an upload from a client succeeds, the server creates a new directory +@file{/var/spool/abrt-retrace/@var{id}} and extracts the +received archive into it. Then it checks that the directory contains all +the required files, checks their sizes, and then sends a HTTP +response. After that it spawns a subprocess with +@command{abrt-retrace-worker} on that directory. + +To support multiple architectures, the retrace server needs a GDB +package compiled separately for every supported target architecture +(see the avr-gdb package in Fedora for an example). This is +technically and economically better solution than using a standalone +machine for every supported architecture and resending coredumps +depending on client's architecture. However, GDB's support for using a +target architecture different from the host architecture seems to be +fragile. If it doesn't work, the QEMU user mode emulation should be +tried as an alternative approach. + +The following files from the local crash directory are required to be +present in the archive: @file{coredump}, @file{architecture}, +@file{release}, @file{packages} (this one does not exist yet). If one or +more files are not present in the archive, or some other file is present +in the archive, the server must return the "403 Forbidden" HTTP error +code. If the size of any file except the coredump exceeds 100 KB, the +server must return the "413 Request Entity Too Large" HTTP error code, +and provide an explanation, including the limit, in the response +body. The 100 KB limit must be changeable from the server configuration +file. + +If the file check succeeds, the server HTTP response must have the "201 +Created" HTTP code. The response must include the following HTTP header +fields: +@itemize +@item +@var{X-Task-Id} containing a new server-unique numerical +task id +@item +@var{X-Task-Password} containing a newly generated +password, required to access the result +@item +@var{X-Task-Est-Time} containing a number of seconds the +server estimates it will take to generate the backtrace +@end itemize + +The @var{X-Task-Password} is a random alphanumeric (@samp{[a-zA-Z0-9]}) +sequence 22 characters long. 22 alphanumeric characters corresponds to +128 bit password, because @samp{[a-zA-Z0-9]} = 62 characters, and +@math{2^128} < @math{62^22}. The source of randomness must be, +directly or indirectly, @file{/dev/urandom}. The @code{rand()} function +from glibc and similar functions from other libraries cannot be used +because of their poor characteristics (in several aspects). The password +must be stored to the @file{/var/spool/abrt-retrace/@var{id}/password} file, +so passwords sent by a client in subsequent requests can be verified. + +The task id is intentionally not used as a password, because it is +desirable to keep the id readable and memorable for +humans. Password-like ids would be a loss when an user authentication +mechanism is added, and server-generated password will no longer be +necessary. + +The algorithm for the @var{X-Task-Est-Time} time estimation +should take the previous analyses of coredumps with the same +corresponding package name into account. The server should store +simple history in a SQLite database to know how long it takes to +generate a backtrace for certain package. It could be as simple as +this: +@itemize +@item + initialization step one: @code{CREATE TABLE package_time (id INTEGER + PRIMARY KEY AUTOINCREMENT, package, release, time)}; we need the + @var{id} for the database cleanup - to know the insertion order of + rows, so the @code{AUTOINCREMENT} is important here; the @var{package} + is the package name without the version and release numbers, the + @var{release} column stores the operating system, and the @var{time} + is the number of seconds it took to generate the backtrace +@item + initialization step two: @code{CREATE INDEX package_release ON + package_time (package, release)}; we compute the time only for single + package on single supported OS release per query, so it makes sense to + create an index to speed it up +@item + when a task is finished: @code{INSERT INTO package_time (package, + release, time) VALUES ('??', '??', '??')} +@item + to get the average time: @code{SELECT AVG(time) FROM package_time + WHERE package == '??' AND release == '??'}; the arithmetic mean seems + to be sufficient here +@end itemize + +So the server knows that crashes from an OpenOffice.org package +take 5 minutes to process in average, and it can return the value 300 +(seconds) in the field. The client does not waste time asking about +that task every 20 seconds, but the first status request comes after +300 seconds. And even when the package changes (rebases etc.), the +database provides good estimations after some time anyway +(@ref{Task cleanup} chapter describes how the +data are pruned). + +The server response HTTP body is generated and sent +gradually as the task is performed. Client chooses either to receive +the body, or terminate after getting all headers and ask the server +for status and backtrace asynchronously. + +The server re-sends the output of abrt-retrace-worker (its stdout and +stderr) to the response the body. In addition, a line with the task +status is added in the form @code{X-Task-Status: PENDING} to the body +every 5 seconds. When the worker process ends, either +@samp{FINISHED_SUCCESS} or @samp{FINISHED_FAILURE} status line is +sent. If it's @samp{FINISHED_SUCCESS}, the backtrace is attached after +this line. Then the response body is closed. + +@node Task status +@section Task status + +A client might request a task status by sending a HTTP GET request to +the @indicateurl{https://someserver/@var{id}} URL, where @var{id} is the +numerical task id returned in the @var{X-Task-Id} field by +@indicateurl{https://someserver/create}. If the @var{id} is not in the +valid format, or the task @var{id} does not exist, the server must +return the "404 Not Found" HTTP error code. + +The client request must contain the @var{X-Task-Password} field, and its +content must match the password stored in the +@file{/var/spool/abrt-retrace/@var{id}/password} file. If the password is +not valid, the server must return the "403 Forbidden" HTTP error code. + +If the checks pass, the server returns the "200 OK" HTTP code, and +includes a field @var{X-Task-Status} containing one of the following +values: @samp{FINISHED_SUCCESS}, @samp{FINISHED_FAILURE}, +@samp{PENDING}. + +The field contains @samp{FINISHED_SUCCESS} if the file +@file{/var/spool/abrt-retrace/@var{id}/backtrace} exists. The client might +get the backtrace on the @indicateurl{https://someserver/@var{id}/backtrace} +URL. The log might be obtained on the +@indicateurl{https://someserver/@var{id}/log} URL, and it might contain +warnings about some missing debuginfos etc. + +The field contains @samp{FINISHED_FAILURE} if the file +@file{/var/spool/abrt-retrace/@var{id}/backtrace} does not exist, and file +@file{/var/spool/abrt-retrace/@var{id}/retrace-log} exists. The retrace-log +file containing error messages can be downloaded by the client from the +@indicateurl{https://someserver/@var{id}/log} URL. + +The field contains @samp{PENDING} if neither file exists. The client +should ask again after 10 seconds or later. + +@node Requesting a backtrace +@section Requesting a backtrace + +A client might request a backtrace by sending a HTTP GET request to the +@indicateurl{https://someserver/@var{id}/backtrace} URL, where @var{id} +is the numerical task id returned in the @var{X-Task-Id} field by +@indicateurl{https://someserver/create}. If the @var{id} is not in the +valid format, or the task @var{id} does not exist, the server must +return the "404 Not Found" HTTP error code. + +The client request must contain the @var{X-Task-Password} field, and its +content must match the password stored in the +@file{/var/spool/abrt-retrace/@var{id}/password} file. If the password +is not valid, the server must return the "403 Forbidden" HTTP error +code. + +If the file @file{/var/spool/abrt-retrace/@var{id}/backtrace} does not +exist, the server must return the "404 Not Found" HTTP error code. +Otherwise it returns the file contents, and the @var{Content-Type} field +must contain @samp{text/plain}. + +@node Requesting a log +@section Requesting a log + +A client might request a task log by sending a HTTP GET request to the +@indicateurl{https://someserver/@var{id}/log} URL, where @var{id} is the +numerical task id returned in the @var{X-Task-Id} field by +@indicateurl{https://someserver/create}. If the @var{id} is not in the +valid format, or the task @var{id} does not exist, the server must +return the "404 Not Found" HTTP error code. + +The client request must contain the @var{X-Task-Password} field, and its +content must match the password stored in the +@file{/var/spool/abrt-retrace/@var{id}/password} file. If the password is +not valid, the server must return the "403 Forbidden" HTTP error code. + +If the file @file{/var/spool/abrt-retrace/@var{id}/retrace-log} does not +exist, the server must return the "404 Not Found" HTTP error code. +Otherwise it returns the file contents, and the "Content-Type" must +contain "text/plain". + +@node Task cleanup +@section Task cleanup + +Tasks that were created more than 5 days ago must be deleted, because +tasks occupy disk space (not so much space, as the coredumps are deleted +after the retrace, and only backtraces and configuration remain). A +shell script @command{abrt-retrace-clean} must check the creation time +and delete the directories in @file{/var/spool/abrt-retrace/}. It is +supposed that the server administrator sets @command{cron} to call the +script once a day. This assumption must be mentioned in the +@command{abrt-retrace-clean} manual page. + +The database containing packages and processing times should also +be regularly pruned to remain small and provide data quickly. The +cleanup script should delete some rows for packages with too many +entries: +@enumerate +@item +get a list of packages from the database: @code{SELECT DISTINCT +package, release FROM package_time} +@item +for every package, get the row count: @code{SELECT COUNT(*) FROM +package_time WHERE package == '??' AND release == '??'} +@item +for every package with the row count larger than 100, some rows +most be removed so that only the newest 100 rows remain in the +database: +@itemize + @item + to get highest row id which should be deleted, + execute @code{SELECT id FROM package_time WHERE package == '??' AND + release == '??' ORDER BY id LIMIT 1 OFFSET ??}, where the + @code{OFFSET} is the total number of rows for that single + package minus 100 + @item + then all the old rows can be deleted by executing @code{DELETE + FROM package_time WHERE package == '??' AND release == '??' AND id + <= ??} +@end itemize +@end enumerate + +@node Limiting traffic +@section Limiting traffic + +The maximum number of simultaneously running tasks must be limited +to 20 by the server. The limit must be changeable from the server +configuration file. If a new request comes when the server is fully +occupied, the server must return the "503 Service Unavailable" HTTP +error code. + +The archive extraction, chroot preparation, and gdb analysis is +mostly limited by the hard drive size and speed. + +@node Retrace worker +@chapter Retrace worker + +The worker (@command{abrt-retrace-worker} binary) gets a +@file{/var/spool/abrt-retrace/@var{id}} directory as an input. The worker +reads the operating system name and version, the coredump, and the list +of packages needed for retracing (a package containing the binary which +crashed, and packages with the libraries that are used by the binary). + +The worker prepares a new @file{chroot} subdirectory with the packages, +their debuginfo, and gdb installed. In other words, a new directory +@file{/var/spool/abrt-retrace/@var{id}/chroot} is created and +the packages are unpacked or installed into this directory, so for +example the gdb ends up as +@file{/var/.../@var{id}/chroot/usr/bin/gdb}. + +After the @file{chroot} subdirectory is prepared, the worker moves the +coredump there and changes root (using the chroot system function) of a +child script there. The child script runs the gdb on the coredump, and +the gdb sees the corresponding crashy binary, all the debuginfo and all +the proper versions of libraries on right places. + +When the gdb run is finished, the worker copies the resulting backtrace +to the @file{/var/spool/abrt-retrace/@var{id}/backtrace} file and stores a +log from the whole chroot process to the @file{retrace-log} file in the +same directory. Then it removes the @file{chroot} directory. + +The GDB installed into the chroot must: +@itemize +@item +run on the server (same architecture, or we can use +@uref{http://wiki.qemu.org/download/qemu-doc.html#QEMU-User-space-emulator, QEMU +user space emulation}) +@item +process the coredump (possibly from another architecture): that +means we need a special GDB for every supported architecture +@item +be able to handle coredumps created in an environment with prelink +enabled +(@uref{http://sourceware.org/ml/gdb/2009-05/msg00175.html, should +not} be a problem) +@item +use libc, zlib, readline, ncurses, expat and Python packages, +while the version numbers required by the coredump might be different +from what is required by the GDB +@end itemize + +The gdb might fail to run with certain combinations of package +dependencies. Nevertheless, we need to provide the libc/Python/* +package versions which are required by the coredump. If we would not +do that, the backtraces generated from such an environment would be of +lower quality. Consider a coredump which was caused by a crash of +Python application on a client, and which we analyze on the retrace +server with completely different version of Python because the +client's Python version is not compatible with our GDB. + +We can solve the issue by installing the GDB package dependencies first, +move their binaries to some safe place (@file{/lib/gdb} in the chroot), +and create the @file{/etc/ld.so.preload} file pointing to that place, or +set @env{LD_LIBRARY_PATH}. Then we can unpack libc binaries and +other packages and their versions as required by the coredump to the +common paths, and the GDB would run happily, using the libraries from +@file{/lib/gdb} and not those from @file{/lib} and @file{/usr/lib}. This +approach can use standard GDB builds with various target architectures: +gdb, gdb-i386, gdb-ppc64, gdb-s390 (nonexistent in Fedora/EPEL at the +time of writing this). + +The GDB and its dependencies are stored separately from the packages +used as data for coredump processing. A single combination of GDB and +its dependencies can be used across all supported OS to generate +backtraces. + +The retrace worker must be able to prepare a chroot-ready environment +for certain supported operating system, which is different from the +retrace server's operating system. It needs to fake the @file{/dev} +directory and create some basic files in @file{/etc} like @file{passwd} +and @file{hosts}. We can use the @uref{https://fedorahosted.org/mock/, +mock} library to do that, as it does almost what we need (but not +exactly as it has a strong focus on preparing the environment for +rpmbuild and running it), or we can come up with our own solution, while +stealing some code from the mock library. The @file{/usr/bin/mock} +executable is entirely unuseful for the retrace server, but the +underlying Python library can be used. So if would like to use mock, an +ABRT-specific interface to the mock library must be written or the +retrace worker must be written in Python and use the mock Python library +directly. + +We should save some time and disk space by extracting only binaries +and dynamic libraries from the packages for the coredump analysis, and +omit all other files. We can save even more time and disk space by +extracting only the libraries and binaries really referenced by the +coredump (eu-unstrip tells us). Packages should not be +@emph{installed} to the chroot, they should be @emph{extracted} +only, because we use them as a data source, and we never run them. + +Another idea to be considered is that we can avoid the package +extraction if we can teach GDB to read the dynamic libraries, the +binary, and the debuginfo directly from the RPM packages. We would +provide a backend to GDB which can do that, and provide tiny front-end +program which tells the backend which RPMs it should use and then run +the GDB command loop. The result would be a GDB wrapper/extension we +need to maintain, but it should end up pretty small. We would use +Python to write our extension, as we do not want to (inelegantly) +maintain a patch against GDB core. We need to ask GDB people if the +Python interface is capable of handling this idea, and how much work +it would be to implement it. + +@node Package repository +@chapter Package repository + +We should support every Fedora release with all packages that ever +made it to the updates and updates-testing repositories. In order to +provide all that packages, a local repository is maintained for every +supported operating system. The debuginfos might be provided by a +debuginfo server in future (it will save the server disk space). We +should support the usage of local debuginfo first, and add the +debuginfofs support later. + +A repository with Fedora packages must be maintained locally on the +server to provide good performance and to provide data from older +packages already removed from the official repositories. We need a +package downloader, which scans Fedora servers for new packages, and +downloads them so they are immediately available. + +Older versions of packages are regularly deleted from the updates +and updates-testing repositories. We must support older versions of +packages, because that is one of two major pain-points that the +retrace server is supposed to solve (the other one is the slowness of +debuginfo download and debuginfo disk space requirements). + +A script abrt-reposync must download packages from Fedora +repositories, but it must not delete older versions of the +packages. The retrace server administrator is supposed to call this +script using cron every ~6 hours. This expectation must be documented +in the abrt-reposync manual page. The script can use use wget, rsync, +or reposync tool to get the packages. The remote yum source +repositories must be configured from a configuration file or files +(@file{/etc/yum.repos.d} might be used). + +When the abrt-reposync is used to sync with the Rawhide repository, +unneeded packages (where a newer version exists) must be removed after +residing one week with the newer package in the same repository. + +All the unneeded content from the newly downloaded packages should be +removed to save disk space and speed up chroot creation. We need just +the binaries and dynamic libraries, and that is a tiny part of package +contents. + +The packages should be downloaded to a local repository in +@file{/var/cache/abrt-repo/@{fedora12,fedora12-debuginfo,...@}}. + +@node Traffic and load estimation +@chapter Traffic and load estimation + +2500 bugs are reported from ABRT every month. Approximately 7.3% +from that are Python exceptions, which don't need a retrace +server. That means that 2315 bugs need a retrace server. That is 77 +bugs per day, or 3.3 bugs every hour on average. Occasional spikes +might be much higher (imagine a user that decided to report all his 8 +crashes from last month). + +We should probably not try to predict if the monthly bug count goes up +or down. New, untested versions of software are added to Fedora, but +on the other side most software matures and becomes less crashy. So +let's assume that the bug count stays approximately the same. + +Test crashes (see that we should probably use @code{`xz -2`} +to compress coredumps): +@itemize +@item +firefox with 7 tabs (random pages opened), coredump size 172 MB +@itemize +@item +xz compression +@itemize +@item +compression level 6 (default): compression took 32.5 sec, compressed +size 5.4 MB, decompression took 2.7 sec +@item +compression level 3: compression took 23.4 sec, compressed size 5.6 MB, +decompression took 1.6 sec +@item +compression level 2: compression took 6.8 sec, compressed size 6.1 MB, +decompression took 3.7 sec +@item +compression level 1: compression took 5.1 sec, compressed size 6.4 MB, +decompression took 2.4 sec +@end itemize +@item +gzip compression +@itemize +@item +compression level 9 (highest): compression took 7.6 sec, compressed size +7.9 MB, decompression took 1.5 sec +@item +compression level 6 (default): compression took 2.6 sec, compressed size +8 MB, decompression took 2.3 sec +@item +compression level 3: compression took 1.7 sec, compressed size 8.9 MB, +decompression took 1.7 sec +@end itemize +@end itemize +@item +thunderbird with thousands of emails opened, coredump size 218 MB +@itemize +@item +xz compression +@itemize +@item +compression level 6 (default): compression took 60 sec, compressed size +12 MB, decompression took 3.6 sec +@item +compression level 3: compression took 42 sec, compressed size 13 MB, +decompression took 3.0 sec +@item +compression level 2: compression took 10 sec, compressed size 14 MB, +decompression took 3.0 sec +@item +compression level 1: compression took 8.3 sec, compressed size 15 MB, +decompression took 3.2 sec +@end itemize +@item +gzip compression +@itemize +@item +compression level 9 (highest): compression took 14.9 sec, compressed +size 18 MB, decompression took 2.4 sec +@item +compression level 6 (default): compression took 4.4 sec, compressed size +18 MB, decompression took 2.2 sec +@item +compression level 3: compression took 2.7 sec, compressed size 20 MB, +decompression took 3 sec +@end itemize +@end itemize +@item +evince with 2 pdfs (1 and 42 pages) opened, coredump size 73 MB +@itemize +@item +xz compression +@itemize +@item +compression level 2: compression took 2.9 sec, compressed size 3.6 MB, +decompression took 0.7 sec +@item +compression level 1: compression took 2.5 sec, compressed size 3.9 MB, +decompression took 0.7 sec +@end itemize +@end itemize +@item +OpenOffice.org Impress with 25 pages presentation, coredump size 116 MB +@itemize +@item +xz compression +@itemize +@item +compression level 2: compression took 7.1 sec, compressed size 12 MB, +decompression took 2.3 sec +@end itemize +@end itemize +@end itemize + +So let's imagine there are some users that want to report their +crashes approximately at the same time. Here is what the retrace +server must handle: +@enumerate +@item +2 OpenOffice crashes +@item +2 evince crashes +@item +2 thunderbird crashes +@item +2 firefox crashes +@end enumerate + +We will use the xz archiver with the compression level 2 on the ABRT's +side to compress the coredumps. So the users spend 53.6 seconds in +total packaging the coredumps. + +The packaged coredumps have 71.4 MB, and the retrace server must +receive that data. + +The server unpacks the coredumps (perhaps in the same time), so they +need 1158 MB of disk space on the server. The decompression will take +19.4 seconds. + +Several hundred megabytes will be needed to install all the +required packages and debuginfos for every chroot (8 chroots 1 GB each += 8 GB, but this seems like an extreme, maximal case). Some space will +be saved by using a debuginfofs. + +Note that most applications are not as heavyweight as OpenOffice and +Firefox. + +@node Security +@chapter Security + +The retrace server communicates with two other entities: it accepts +coredumps form users, and it downloads debuginfos and packages from +distribution repositories. + +@menu +* Clients:: +* Packages and debuginfo:: +@end menu + +General security from GDB flaws and malicious data is provided by +chroot. The GDB accesses the debuginfos, packages, and the coredump +from within the chroot, unable to access the retrace server's +environment. We should consider setting a disk quota to every chroot +directory, and limit the GDB access to resources using cgroups. + +SELinux policy should be written for both the retrace server's HTTP +interface, and for the retrace worker. + +@node Clients +@section Clients + +The clients, which are using the retrace server and sending coredumps +to it, must fully trust the retrace server administrator. The server +administrator must not try to get sensitive data from client +coredumps. That seems to be a major bottleneck of the retrace server +idea. However, users of an operating system already trust the OS +provider in various important matters. So when the retrace server is +operated by the operating system provider, that might be acceptable by +users. + +We cannot avoid sending clients' coredumps to the retrace server, if +we want to generate quality backtraces containing the values of +variables. Minidumps are not acceptable solution, as they lower the +quality of the resulting backtraces, while not improving user +security. + +Can the retrace server trust clients? We must know what can a +malicious client achieve by crafting a nonstandard coredump, which +will be processed by server's GDB. We should ask GDB experts about +this. + +Another question is whether we can allow users providing some packages +and debuginfo together with a coredump. That might be useful for +users, who run the operating system only with some minor +modifications, and they still want to use the retrace server. So they +send a coredump together with a few nonstandard packages. The retrace +server uses the nonstandard packages together with the OS packages to +generate the backtrace. Is it safe? We must know what can a malicious +client achieve by crafting a special binary and debuginfo, which will +be processed by server's GDB. + +@node Packages and debuginfo +@section Packages and debuginfo + +We can safely download packages and debuginfo from the distribution, +as the packages are signed by the distribution, and the package origin +can be verified. + +When the debuginfo server is done, the retrace server can safely use +it, as the data will also be signed. + +@node Future work +@chapter Future work + +1. Coredump stripping. Jan Kratochvil: With my test of OpenOffice.org +presentation kernel core file has 181MB, xz -2 of it has 65MB. +According to `set target debug 1' GDB reads only 131406 bytes of it +(incl. the NOTE segment). + +2. Use gdbserver instead of uploading whole coredump. GDB's +gdbserver cannot process coredumps, but Jan Kratochvil's can: +<pre> git://git.fedorahosted.org/git/elfutils.git + branch: jankratochvil/gdbserver + src/gdbserver.c + * Currently threading is not supported. + * Currently only x86_64 is supported (the NOTE registers layout). +</pre> + +3. User management for the HTTP interface. We need multiple +authentication sources (x509 for RHEL). + +4. Make @file{architecture}, @file{release}, +@file{packages} files, which must be included in the package +when creating a task, optional. Allow uploading a coredump without +involving tar: just coredump, coredump.gz, or coredump.xz. + +5. Handle non-standard packages (provided by user) + +@bye |
