Added retrace server design document

author: Karel Klic <kklic@redhat.com> 2010-11-16 17:06:13 +0100
committer: Karel Klic <kklic@redhat.com> 2010-12-02 16:13:42 +0100
commit: 2a959ffa76358eef17c4563823a73d62b2a8929e (patch)
tree: 31556675325e30bf1032d55f9dfe382b7220dfce /doc
parent: feeab473082eb31e7bf4ca17a624bc683a4b6fb8 (diff)
download: abrt-2a959ffa76358eef17c4563823a73d62b2a8929e.tar.gz
abrt-2a959ffa76358eef17c4563823a73d62b2a8929e.tar.xz
abrt-2a959ffa76358eef17c4563823a73d62b2a8929e.zip
2 files changed, 1566 insertions, 0 deletions
diff --git a/doc/retrace-server b/doc/retrace-server
new file mode 100644
index 00000000..0c16a78e
--- /dev/null
+++ b/doc/retrace-server
@@ -0,0 +1,706 @@
+======================================================================
+Retrace server design
+======================================================================
+
+The retrace server provides a coredump analysis and backtrace
+generation service over a network using HTTP protocol.
+
+----------------------------------------------------------------------
+Contents
+----------------------------------------------------------------------
+
+1. Overview
+2. HTTP interface
+  2.1 Creating a new task
+  2.2 Task status
+  2.3 Requesting a backtrace
+  2.4 Requesting a log file
+  2.5 Task cleanup
+  2.6 Limiting traffic
+3. Retrace worker
+4. Package repository
+5. Traffic and load estimation
+6. Security
+  6.1 Clients
+  6.2 Packages and debuginfo
+7. Future work
+
+----------------------------------------------------------------------
+1. Overview
+----------------------------------------------------------------------
+
+A client sends a coredump (created by Linux kernel) together with some
+additional information to the server, and gets a backtrace generation
+task ID in response. Then the client, after some time, asks the server
+for the task status, and when the task is done (backtrace has been
+generated from the coredump), the client downloads the backtrace. If
+the backtrace generation fails, the client gets an error code and
+downloads a log indicating what happened. Alternatively, the client
+sends a coredump, and keeps receiving the server response
+message. Server then, via the response's body, periodically sends
+status of the task, and delivers the resulting backtrace as soon as
+it's ready.
+
+The retrace server must be able to support multiple operating systems
+and their releases (Fedora N-1, N, Rawhide, Branched Rawhide, RHEL),
+and multiple architectures within a single installation.
+
+The retrace server consists of the following parts:
+1. abrt-retrace-server: a HTTP interface script handling the
+   communication with clients, task creation and management
+2. abrt-retrace-worker: a program doing the environment preparation
+   and coredump processing
+3. package repository: a repository placed on the server containing
+   all the application binaries, libraries, and debuginfo necessary
+   for backtrace generation
+
+----------------------------------------------------------------------
+2. HTTP interface
+----------------------------------------------------------------------
+
+The HTTP interface application is a script written in Python. The
+script is named abrt-retrace-server, and it uses the Python Web Server
+Gateway Interface (WSGI, http://www.python.org/dev/peps/pep-0333/) to
+interact with the web server.  Administrators may use mod_wsgi
+(http://code.google.com/p/modwsgi/) to run abrt-retrace-server on
+Apache. The mod_wsgi is a part of both Fedora 12 and RHEL 6. The
+Python language is a good choice for this application, because it
+supports HTTP handling well, and it is already used in ABRT.
+
+Only secure (HTTPS) communication must be allowed for the
+communication with abrt-retrace-server, because coredumps and
+backtraces are private data. Users may decide to publish their
+backtraces in a bug tracker after reviewing them, but the retrace
+server doesn't do that. The HTTPS requirement must be specified in the
+server's man page. The server must support HTTP persistent connections
+to to avoid frequent SSL renegotiations. The server's manual page
+should include a recommendation for administrator to check that the
+persistent connections are enabled.
+
+----------------------------------------------------------------------
+2.1 Creating a new task
+----------------------------------------------------------------------
+
+A client might create a new task by sending a HTTP request to the
+https://server/create URL, and providing an archive as the request
+content. The archive must contain crash data files. The crash data
+files are a subset of the local /var/spool/abrt/ccpp-time-pid/
+directory contents, so the client must only pack and upload them.
+
+The server must support uncompressed tar archives, and tar archives
+compressed with gzip and xz. Uncompressed archives are the most
+efficient way for local network delivery, and gzip can be used there
+as well because of its good compression speed.
+
+The xz compression file format is well suited for public server setup
+(slow network), as it provides good compression ratio, which is
+important for compressing large coredumps, and it provides reasonable
+compress/decompress speed and memory consumption (see the chapter '5
+Traffic and load estimation' for the measurements). The XZ Utils
+implementation with the compression level 2 should be used to compress
+the data.
+
+The HTTP request for a new task must use the POST method. It must
+contain a proper 'Content-Length' and 'Content-Type' fields. If the
+method is not POST, the server must return the "405 Method Not
+Allowed" HTTP error code. If the 'Content-Length' field is missing,
+the server must return the "411 Length Required" HTTP error code. If a
+'Content-Type' other than 'application/x-tar', 'application/x-gzip',
+'application/x-xz' is used, the server must return the "415
+unsupported Media Type" HTTP error code. If the 'Content-Length' value
+is greater than a limit set in the server configuration file (30 MB by
+default), or the real HTTP request size gets larger than the limit +
+10 KB for headers, then the server must return the "413 Request Entity
+Too Large" HTTP error code, and provide an explanation, including the
+limit, in the response body. The limit must be changeable from the
+server configuration file.
+
+If there is less than 20 GB of free disk space in the
+/var/spool/abrt-retrace directory, the server must return the "507
+Insufficient Storage" HTTP error code. The server must return the same
+HTTP error code if decompressing the received archive would cause the
+free disk space to become less than 20 GB. The 20 GB limit must be
+changeable from the server configuration file.
+
+If the data from the received archive would take more than 500 MB of
+disk space when uncompressed, the server must return the "413 Request
+Entity Too Large" HTTP error code, and provide an explanation,
+including the limit, in the response body. The size limit must be
+changeable from the server configuration file. It can be set pretty
+high because coredumps, that take most disk space, are stored on the
+server only temporarily until the backtrace is generated. When the
+backtrace is generated the coredump is deleted by the
+abrt-retrace-worker, so most disk space is released.
+
+The uncompressed data size for xz archives can be obtained by calling
+`xz --list file.tar.xz`. The '--list' option has been implemented only
+recently, so it might be necessary to implement a method to get the
+uncompressed data size by extracting the archive to the stdout, and
+counting the extracted bytes, and call this method if the '--list'
+doesn't work on the server. Likewise, the uncompressed data size for
+gzip archives can be obtained by calling `gzip --list file.tar.gz`.
+
+If an upload from a client succeeds, the server creates a new
+directory /var/spool/abrt-retrace/<id> and extracts the received
+archive into it. Then it checks that the directory contains all the
+required files, checks their sizes, and then sends a HTTP
+response. After that it spawns a subprocess with abrt-retrace-worker
+on that directory.
+
+To support multiple architectures, the retrace server needs a GDB
+package compiled separately for every supported target architecture
+(see the avr-gdb package in Fedora for an example). This is
+technically and economically better solution than using a standalone
+machine for every supported architecture and re-sending coredumps
+depending on client's architecture. However, GDB's support for using a
+target architecture different from the host architecture seems to be
+fragile. If it doesn't work, the QEMU user mode emulation should be
+tried as an alternative approach.
+
+The following files from the local crash directory are required to be
+present in the archive: coredump, architecture, release, packages
+(this one does not exist yet). If one or more files are not present in
+the archive, or some other file is present in the archive, the server
+must return the "403 Forbidden" HTTP error code. If the size of any
+file except the coredump exceeds 100 KB, the server must return the
+"413 Request Entity Too Large" HTTP error code, and provide an
+explanation, including the limit, in the response body. The 100 KB
+limit must be changeable from the server configuration file.
+
+If the file check succeeds, the server HTTP response must have the
+"201 Created" HTTP code. The response must include the following HTTP
+header fields:
+- "X-Task-Id" containing a new server-unique numerical task id
+- "X-Task-Password" containing a newly generated password, required to
+  access the result
+- "X-Task-Est-Time" containing a number of seconds the server
+  estimates it will take to generate the backtrace
+
+The 'X-Task-Password' is a random alphanumeric ([a-zA-Z0-9]) sequence
+22 characters long. 22 alphanumeric characters corresponds to 128 bit
+password, because [a-zA-Z0-9] = 62 characters, and 2^128 < 62^22. The
+source of randomness must be, directly or indirectly,
+/dev/urandom. The rand() function from glibc and similar functions
+from other libraries cannot be used because of their poor
+characteristics (in several aspects). The password must be stored to
+the /var/spool/abrt-retrace/<id>/password file, so passwords sent by a
+client in subsequent requests can be verified.
+
+The task id is intentionally not used as a password, because it is
+desirable to keep the id readable and memorable for
+humans. Password-like ids would be a loss when an user authentication
+mechanism is added, and server-generated password will no longer be
+necessary.
+
+The algorithm for the "X-Task-Est-Time" time estimation should take
+the previous analyses of coredumps with the same corresponding package
+name into account. The server should store simple history in a SQLite
+database to know how long it takes to generate a backtrace for certain
+package. It could be as simple as this: - initialization step one:
+"CREATE TABLE package_time (id INTEGER PRIMARY KEY AUTOINCREMENT,
+package, release, time)"; we need the 'id' for the database cleanup -
+to know the insertion order of rows, so the "AUTOINCREMENT" is
+important here; the 'package' is the package name without the version
+and release numbers, the 'release' column stores the operating system,
+and the 'time' is the number of seconds it took to generate the
+backtrace - initialization step two: "CREATE INDEX package_release ON
+package_time (package, release)"; we compute the time only for single
+package on single supported OS release per query, so it makes sense to
+create an index to speed it up - when a task is finished: "INSERT INTO
+package_time (package, release, time) VALUES ('??', '??', '??')"  - to
+get the average time: "SELECT AVG(time) FROM package_time WHERE
+package == '??' AND release == '??'"; the arithmetic mean seems to be
+sufficient here
+
+So the server knows that crashes from an OpenOffice.org package take
+5 minutes to process in average, and it can return the value 300
+(seconds) in the field. The client does not waste time asking about
+that task every 20 seconds, but the first status request comes after
+300 seconds. And even when the package changes (rebases etc.), the
+database provides good estimations after some time ('2.5 Task cleanup'
+chapter describes how the data are pruned).
+
+The server response HTTP body is generated and sent gradually as the
+task is performed. Client chooses either to receive the body, or
+terminate after getting all headers and ask for status and backtrace
+asynchronously.
+
+The server re-sends the output of abrt-retrace-worker (its stdout and
+stderr) to the response the body. In addition, a line with the task
+status is added in the form `X-Task-Status: PENDING` to the body every
+5 seconds. When the worker process ends, either FINISHED_SUCCESS or
+FINISHED_FAILURE status line is sent. If it's FINISHED_SUCCESS, the
+backtrace is attached after this line. Then the response body is
+closed.
+
+----------------------------------------------------------------------
+2.2 Task status
+----------------------------------------------------------------------
+
+A client might request a task status by sending a HTTP GET request to
+the https://someserver/<id> URL, where <id> is the numerical task id
+returned in the "X-Task-Id" field by https://someserver/create. If the
+<id> is not in the valid format, or the task <id> does not exist, the
+server must return the "404 Not Found" HTTP error code.
+
+The client request must contain the "X-Task-Password" field, and its
+content must match the password stored in the
+/var/spool/abrt-retrace/<id>/password file. If the password is not
+valid, the server must return the "403 Forbidden" HTTP error code.
+
+If the checks pass, the server returns the "200 OK" HTTP code, and
+includes a field "X-Task-Status" containing one of the following
+values: "FINISHED_SUCCESS", "FINISHED_FAILURE", "PENDING".
+
+The field contains "FINISHED_SUCCESS" if the file
+/var/spool/abrt-retrace/<id>/backtrace exists. The client might get
+the backtrace on the https://someserver/<id>/backtrace URL. The log
+can be downloaded from the https://someserver/<id>/log URL, and it
+might contain warnings about some missing debuginfos etc.
+
+The field contains "FINISHED_FAILURE" if the file
+/var/spool/abrt-retrace/<id>/backtrace does not exist, but the file
+/var/spool/abrt-retrace/<id>/retrace-log exists. The retrace-log file
+containing error messages can be downloaded by the client from the
+https://someserver/<id>/log URL.
+
+The field contains "PENDING" if neither file exists. The client should
+ask again after 10 seconds or later.
+
+----------------------------------------------------------------------
+2.3 Requesting a backtrace
+----------------------------------------------------------------------
+
+A client might request a backtrace by sending a HTTP GET request to
+the https://someserver/<id>/backtrace URL, where <id> is the numerical
+task id returned in the "X-Task-Id" field by
+https://someserver/create. If the <id> is not in the valid format, or
+the task <id> does not exist, the server must return the "404 Not
+Found" HTTP error code.
+
+The client request must contain the "X-Task-Password" field, and its
+content must match the password stored in the
+/var/spool/abrt-retrace/<id>/password file. If the password is not
+valid, the server must return the "403 Forbidden" HTTP error code.
+
+If the file /var/spool/abrt-retrace/<id>/backtrace does not exist, the
+server must return the "404 Not Found" HTTP error code.  Otherwise it
+returns the file contents, and the "Content-Type" field must contain
+"text/plain".
+
+----------------------------------------------------------------------
+2.4 Requesting a log
+----------------------------------------------------------------------
+
+A client might request a task log by sending a HTTP GET request to the
+https://someserver/<id>/log URL, where <id> is the numerical task id
+returned in the "X-Task-Id" field by https://someserver/create. If the
+<id> is not in the valid format, or the task <id> does not exist, the
+server must return the "404 Not Found" HTTP error code.
+
+The client request must contain the "X-Task-Password" field, and its
+content must match the password stored in the
+/var/spool/abrt-retrace/<id>/password file. If the password is not
+valid, the server must return the "403 Forbidden" HTTP error code.
+
+If the file /var/spool/abrt-retrace/<id>/retrace-log does not exist,
+the server must return the "404 Not Found" HTTP error code.  Otherwise
+it returns the file contents, and the "Content-Type" field must
+contain "text/plain".
+
+----------------------------------------------------------------------
+2.5 Task cleanup
+----------------------------------------------------------------------
+
+Tasks that were created more than 5 days ago must be deleted, because
+tasks occupy disk space (not so much space, because the coredumps are
+deleted after the retrace, and only backtraces and configuration
+remain). A shell script "abrt-retrace-clean" must check the creation
+time and delete the directories in /var/spool/abrt-retrace. It is
+supposed that the server administrator sets cron to call the script
+once a day. This assumption must be mentioned in the
+abrt-retrace-clean manual page.
+
+The database containing packages and processing times should also be
+regularly pruned to remain small and provide data quickly. The cleanup
+script should delete some rows for packages with too many entries:
+a. get a list of packages from the database: "SELECT DISTINCT package,
+   release FROM package_time"
+b. for every package, get the row count: "SELECT COUNT(*) FROM
+   package_time WHERE package == '??' AND release == '??'"
+c. for every package with the row count larger than 100, some rows
+   most be removed so that only the newest 100 rows remain in the
+   database:
+   - to get highest row id which should be deleted, execute "SELECT id
+     FROM package_time WHERE package == '??' AND release == '??' ORDER
+     BY id LIMIT 1 OFFSET ??", where the OFFSET is the total number of
+     rows for that single package minus 100
+   - then all the old rows can be deleted by executing "DELETE FROM
+     package_time WHERE package == '??' AND release == '??' AND id <=
+     ??"
+
+----------------------------------------------------------------------
+2.6 Limiting traffic
+----------------------------------------------------------------------
+
+The maximum number of simultaneously running tasks must be limited to
+20 by the server. The limit must be changeable from the server
+configuration file. If a new request comes when the server is fully
+occupied, the server must return the "503 Service Unavailable" HTTP
+error code.
+
+The archive extraction, chroot preparation, and gdb analysis is mostly
+limited by the hard drive size and speed.
+
+----------------------------------------------------------------------
+3. Retrace worker
+----------------------------------------------------------------------
+
+The worker (abrt-retrace-worker binary) gets a
+/var/spool/abrt-retrace/<id> directory as an input. The worker reads
+the operating system name and version, the coredump, and the list of
+packages needed for retracing (a package containing the binary which
+crashed, and packages with the libraries that are used by the binary).
+
+The worker prepares a new "chroot" subdirectory with the packages,
+their debuginfo, and gdb installed. In other words, a new directory
+/var/spool/abrt-retrace/<id>/chroot is created and the packages are
+unpacked or installed into this directory, so for example the gdb ends
+up as /var/.../<id>/chroot/usr/bin/gdb.
+
+After the "chroot" subdirectory is prepared, the worker moves the
+coredump there and changes root (using the chroot system function) of
+a child script there. The child script runs the gdb on the coredump,
+and the gdb sees the corresponding crashy binary, all the debuginfo
+and all the proper versions of libraries on right places.
+
+When the gdb run is finished, the worker copies the resulting
+backtrace to the /var/spool/abrt-retrace/<id>/backtrace file and
+stores a log from the whole chroot process to the retrace-log file in
+the same directory. Then it removes the chroot directory.
+
+The GDB installed into the chroot must be able to:
+- run on the server (same architecture, or we can use QEMU user space
+  emulation, see
+  http://wiki.qemu.org/download/qemu-doc.html#QEMU-User-space-emulator)
+- process the coredump (possibly from another architecture): that
+  means we need a special GDB for every supported architecture
+- be able to handle coredumps created in an environment with prelink
+  enabled (should not be a problem, see
+  http://sourceware.org/ml/gdb/2009-05/msg00175.html)
+- use libc, zlib, readline, ncurses, expat and Python packages, while
+  the version numbers required by the coredump might be different from
+  what is required by the GDB
+
+The gdb might fail to run with certain combinations of package
+dependencies. Nevertheless, we need to provide the libc/Python/*
+package versions which are required by the coredump. If we would not
+do that, the backtraces generated from such an environment would be of
+lower quality. Consider a coredump which was caused by a crash of
+Python application on a client, and which we analyze on the retrace
+server with completely different version of Python because the
+client's Python version is not compatible with our GDB.
+
+We can solve the issue by installing the GDB package dependencies
+first, move their binaries to some safe place (/lib/gdb in the
+chroot), and create the /etc/ld.so.preload file pointing to that
+place, or set LD_LIBRARY_PATH. Then we can unpack libc binaries and
+other packages and their versions as required by the coredump to the
+common paths, and the GDB would run happily, using the libraries from
+/lib/gdb and not those from /lib and /usr/lib. This approach can use
+standard GDB builds with various target architectures: gdb, gdb-i386,
+gdb-ppc64, gdb-s390 (nonexistent in Fedora/EPEL at the time of writing
+this).
+
+The GDB and its dependencies are stored separately from the packages
+used as data for coredump processing. A single combination of GDB and
+its dependencies can be used across all supported OS to generate
+backtraces.
+
+The retrace worker must be able to prepare a chroot-ready environment
+for certain supported operating system, which is different from the
+retrace server's operating system. It needs to fake the /dev directory
+and create some basic files in /etc like passwd and hosts. We can use
+the "mock" library (https://fedorahosted.org/mock/) to do that, as it
+does almost what we need (but not exactly as it has a strong focus on
+preparing the environment for rpmbuild and running it), or we can come
+up with our own solution, while stealing some code from the mock
+library. The /usr/bin/mock executable is entirely unuseful for the
+retrace server, but the underlying Python library can be used. So if
+would like to use mock, an ABRT-specific interface to the mock library
+must be written or the retrace worker must be written in Python and
+use the mock Python library directly.
+
+We should save time and disk space by extracting only binaries and
+dynamic libraries from the packages for the coredump analysis, and
+omit all other files. We can save even more time and disk space by
+extracting only the libraries and binaries really referenced by the
+coredump (eu-unstrip tells us the list). Packages should not be
+_installed_ to the chroot, they should be _extracted_ only, because we
+use them as a data source, and we never run them.
+
+Another idea to be considered is that we can avoid the package
+extraction if we can teach GDB to read the dynamic libraries, the
+binary, and the debuginfo directly from the RPM packages. We would
+provide a backend to GDB which can do that, and provide tiny front-end
+program which tells the backend which RPMs it should use and then run
+the GDB command loop. The result would be a GDB wrapper/extension we
+need to maintain, but it should end up pretty small. We would use
+Python to write our extension, as we do not want to (inelegantly)
+maintain a patch against GDB core. We need to ask GDB people if the
+Python interface is capable of handling this idea, and how much work
+it would be to implement it.
+
+----------------------------------------------------------------------
+4. Package repository
+----------------------------------------------------------------------
+
+We should support every Fedora release with all packages that ever
+made it to the updates and updates-testing repositories. In order to
+provide all that packages, a local repository is maintained for every
+supported operating system. The debuginfos might be provided by a
+debuginfo server in future (it will save the server disk space). We
+should support the usage of local debuginfo first, and add the
+debuginfofs support later.
+
+A repository with Fedora packages must be maintained locally on the
+server to provide good performance and to provide data from older
+packages already removed from the official repositories. We need a
+package downloader, which scans Fedora servers for new packages, and
+downloads them so they are immediately available.
+
+Older versions of packages are regularly deleted from the updates and
+updates-testing repositories. We must support older versions of
+packages, because that is one of two major pain-points that the
+retrace server is supposed to solve (the other one is the slowness of
+debuginfo download and debuginfo disk space requirements).
+
+A script abrt-reposync must download packages from Fedora
+repositories, but it must not delete older versions of the
+packages. The retrace server administrator is supposed to call this
+script using cron every ~6 hours. This expectation must be documented
+in the abrt-reposync manual page. The script can use use wget, rsync,
+or reposync tool to get the packages. The remote yum source
+repositories must be configured from a configuration file or files
+(/etc/yum.repos.d might be used).
+
+When the abrt-reposync is used to sync with the Rawhide repository,
+unneeded packages (where a newer version exists) must be removed after
+residing one week with the newer package in the same repository.
+
+All the unneeded content from the newly downloaded packages should be
+removed to save disk space and speed up chroot creation. We need just
+the binaries and dynamic libraries, and that is a tiny part of package
+contents.
+
+The packages should be downloaded to a local repository in
+/var/cache/abrt-repo/{fedora12,fedora12-debuginfo,...}.
+
+----------------------------------------------------------------------
+5. Traffic and load estimation
+----------------------------------------------------------------------
+
+2500 bugs are reported from ABRT every month. Approximately 7.3% from
+that are Python exceptions, which don't need a retrace server. That
+means that 2315 bugs need a retrace server. That is 77 bugs per day,
+or 3.3 bugs every hour on average. Occasional spikes might be much
+higher (imagine a user that decided to report all his 8 crashes from
+last month).
+
+We should probably not try to predict if the monthly bug count goes up
+or down. New, untested versions of software are added to Fedora, but
+on the other side most software matures and becomes less crashy.  So
+let's assume that the bug count stays approximately the same.
+
+Test crashes (see that we should probably use `xz -2` to compress
+coredumps):
+- firefox with 7 tabs with random pages opened
+   - coredump size: 172 MB
+   - xz:
+     - compression level 6 - default:
+       - compression time on my machine: 32.5 sec
+       - compressed coredump: 5.4 MB
+       - decompression time: 2.7 sec
+     - compression level 3:
+       - compression time on my machine: 23.4 sec
+       - compressed coredump: 5.6 MB
+       - decompression time: 1.6 sec
+     - compression level 2:
+       - compression time on my machine: 6.8 sec
+       - compressed coredump: 6.1 MB
+       - decompression time: 3.7 sec
+     - compression level 1:
+       - compression time on my machine: 5.1 sec
+       - compressed coredump: 6.4 MB
+       - decompression time: 2.4 sec
+   - gzip:
+     - compression level 9 - highest:
+       - compression time on my machine: 7.6 sec
+       - compressed coredump: 7.9 MB
+       - decompression time: 1.5 sec
+     - compression level 6 - default:
+       - compression time on my machine: 2.6 sec
+       - compressed coredump: 8 MB
+       - decompression time: 2.3 sec
+     - compression level 3:
+       - compression time on my machine: 1.7 sec
+       - compressed coredump: 8.9 MB
+       - decompression time: 1.7 sec
+- thunderbird with thousands of emails opened
+   - coredump size: 218 MB
+   - xz:
+     - compression level 6 - default:
+       - compression time on my machine: 60 sec
+       - compressed coredump size: 12 MB
+       - decompression time: 3.6 sec
+     - compression level 3:
+       - compression time on my machine: 42 sec
+       - compressed coredump size: 13 MB
+       - decompression time: 3.0 sec
+     - compression level 2:
+       - compression time on my machine: 10 sec
+       - compressed coredump size: 14 MB
+       - decompression time: 3.0 sec
+     - compression level 1:
+       - compression time on my machine: 8.3 sec
+       - compressed coredump size: 15 MB
+       - decompression time: 3.2 sec
+   - gzip
+     - compression level 9 - highest:
+       - compression time on my machine: 14.9 sec
+       - compressed coredump size: 18 MB
+       - decompression time: 2.4 sec
+     - compression level 6 - default:
+       - compression time on my machine: 4.4 sec
+       - compressed coredump size: 18 MB
+       - decompression time: 2.2 sec
+     - compression level 3:
+       - compression time on my machine: 2.7 sec
+       - compressed coredump size: 20 MB
+       - decompression time: 3 sec
+- evince with 2 pdfs (1 and 42 pages) opened:
+   - coredump size: 73 MB
+   - xz:
+     - compression level 2:
+       - compression time on my machine: 2.9 sec
+       - compressed coredump size: 3.6 MB
+       - decompression time: 0.7 sec
+     - compression level 1:
+       - compression time on my machine: 2.5 sec
+       - compressed coredump size: 3.9 MB
+       - decompression time: 0.7 sec
+- OpenOffice.org Impress with 25 pages presentation:
+   - coredump size: 116 MB
+   - xz:
+     - compression level 2:
+       - compression time on my machine: 7.1 sec
+       - compressed coredump size: 12 MB
+       - decompression time: 2.3 sec
+
+So let's imagine there are some users that want to report their
+crashes approximately at the same time. Here is what the retrace
+server must handle:
+- 2 OpenOffice crashes
+- 2 evince crashes
+- 2 thunderbird crashes
+- 2 firefox crashes
+
+We will use the xz archiver with the compression level 2 on the ABRT's
+side to compress the coredumps. So the users spend 53.6 seconds in
+total packaging the coredumps.
+
+The packaged coredumps have 71.4 MB, and the retrace server must
+receive that data.
+
+The server unpacks the coredumps (perhaps in the same time), so they
+need 1158 MB of disk space on the server. The decompression will take
+19.4 seconds.
+
+Several hundred megabytes will be needed to install all the required
+binaries and debuginfos for every chroot (8 chroots 1 GB each = 8 GB,
+but this seems like an extreme, maximal case). Some space will be
+saved by using a debuginfofs.
+
+Note that most applications are not as heavyweight as OpenOffice and
+Firefox.
+
+----------------------------------------------------------------------
+6. Security
+----------------------------------------------------------------------
+
+The retrace server communicates with two other entities: it accepts
+coredumps form users, and it downloads debuginfos and packages from
+distribution repositories.
+
+General security from GDB flaws and malicious data is provided by
+chroot. The GDB accesses the debuginfos, packages, and the coredump
+from within the chroot, unable to access the retrace server's
+environment. We should consider setting a disk quota to every chroot
+directory, and limit the GDB access to resources using cgroups.
+
+SELinux policy should be written for both the retrace server's HTTP
+interface, and for the retrace worker.
+
+----------------------------------------------------------------------
+6.1 Clients
+----------------------------------------------------------------------
+
+The clients, which are using the retrace server and sending coredumps
+to it, must fully trust the retrace server administrator.  The server
+administrator must not try to get sensitive data from client
+coredumps.  That seems to be a major bottleneck of the retrace server
+idea.  However, users of an operating system already trust the OS
+provider in various important matters. So when the retrace server is
+operated by the operating system provider, that might be acceptable by
+users.
+
+We cannot avoid sending clients' coredumps to the retrace server, if
+we want to generate quality backtraces containing the values of
+variables. Minidumps are not acceptable solution, as they lower the
+quality of the resulting backtraces, while not improving user
+security.
+
+Can the retrace server trust clients? We must know what can a
+malicious client achieve by crafting a nonstandard coredump, which
+will be processed by server's GDB.  We should ask GDB experts about
+this.
+
+Another question is whether we can allow users providing some packages
+and debuginfo together with a coredump. That might be useful for
+users, who run the operating system only with some minor
+modifications, and they still want to use the retrace server. So they
+send a coredump together with a few nonstandard packages. The retrace
+server uses the nonstandard packages together with the OS packages to
+generate the backtrace. Is it safe? We must know what can a malicious
+client achieve by crafting a special binary and debuginfo, which will
+be processed by server's GDB.
+
+----------------------------------------------------------------------
+6.2 Packages and debuginfo
+----------------------------------------------------------------------
+
+We can safely download packages and debuginfo from the distribution,
+as the packages are signed by the distribution, and the package origin
+can be verified.
+
+When the debuginfo server is done, the retrace server can safely use
+it, as the data will also be signed.
+
+----------------------------------------------------------------------
+7 Future work
+----------------------------------------------------------------------
+
+1. Coredump stripping. Jan Kratochvil: With my test of OpenOffice.org
+presentation kernel core file has 181MB, xz -2 of it has 65MB.
+According to `set target debug 1' GDB reads only 131406 bytes of it
+(incl. the NOTE segment).
+
+2. User management for the HTTP interface. We need multiple
+authentication sources (x509 for RHEL).
+
+3. Make architecture, release, packages files, which must be included
+in the package when creating a task, optional. Allow uploading a
+coredump without involving tar: just coredump, coredump.gz, or
+coredump.xz.
+
+4. Handle non-standard packages (provided by user)
diff --git a/doc/retrace-server.xhtml b/doc/retrace-server.xhtml
new file mode 100644
index 00000000..3c24ddaf
--- /dev/null
+++ b/doc/retrace-server.xhtml
@@ -0,0 +1,860 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN"
+"http://www.w3.org/TR/MathML2/dtd/xhtml-math11-f.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+  <title>Retrace server design</title>
+</head>
+<body>
+<h1>Retrace server design</h1>
+
+<p>The retrace server provides a coredump analysis and backtrace
+generation service over a network using HTTP protocol.</p>
+
+<table id="toc" class="toc">
+<tr>
+<td>
+<div id="toctitle">
+  <h2>Contents</h2>
+</div>
+<ul>
+<li><a href="#overview">1 Overview</a></li>
+<li><a href="#http_interface">2 HTTP interface</a>
+  <ul>
+  <li><a href="#creating_a_new_task">2.1 Creating a new task</a></li>
+  <li><a href="#task_status">2.2 Task status</a></li>
+  <li><a href="#requesting_a_backtrace">2.3 Requesting a backtrace</a></li>
+  <li><a href="#requesting_a_log">2.4 Requesting a log file</a></li>
+  <li><a href="#task_cleanup">2.5 Task cleanup</a></li>
+  <li><a href="#limiting_traffic">2.6 Limiting traffic</a></li>
+  </ul>
+</li>
+<li><a href="#retrace_worker">3 Retrace worker</a></li>
+<li><a href="#package_repository">4 Package repository</a></li>
+<li><a href="#traffic_and_load_estimation">5 Traffic and load estimation</a></li>
+<li><a href="#security">6 Security</a>
+  <ul>
+  <li><a href="#clients">6.1 Clients</a></li>
+  <li><a href="#packages_and_debuginfo">6.2 Packages and debuginfo</a></li>
+  </ul>
+</li>
+<li><a href="#future_work">7 Future work</a></li>
+</ul>
+</td>
+</tr>
+</table>
+
+<h2><a name="overview">Overview</a></h2>
+
+<p>A client sends a coredump (created by Linux kernel) together with
+some additional information to the server, and gets a backtrace
+generation task ID in response. Then the client, after some time, asks
+the server for the task status, and when the task is done (backtrace
+has been generated from the coredump), the client downloads the
+backtrace. If the backtrace generation fails, the client gets an error
+code and downloads a log indicating what happened. Alternatively, the
+client sends a coredump, and keeps receiving the server response
+message. Server then, via the response's body, periodically sends
+status of the task, and delivers the resulting backtrace as soon as
+it's ready.</p>
+
+<p>The retrace server must be able to support multiple operating
+systems and their releases (Fedora N-1, N, Rawhide, Branched Rawhide,
+RHEL), and multiple architectures within a single installation.</p>
+
+<p>The retrace server consists of the following parts:</p>
+<ol>
+<li>abrt-retrace-server: a HTTP interface script handling the
+communication with clients, task creation and management</li>
+<li>abrt-retrace-worker: a program doing the environment preparation
+and coredump processing</li>
+<li>package repository: a repository placed on the server containing
+all the application binaries, libraries, and debuginfo necessary for
+backtrace generation</li>
+</ol>
+
+<h2><a name="http_interface">HTTP interface</a></h2>
+
+<p>The HTTP interface application is a script written in Python. The
+script is named <code>abrt-retrace-server</code>, and it uses the
+<a href="http://www.python.org/dev/peps/pep-0333/">Python Web Server
+Gateway Interface</a> (WSGI) to interact with the web server.
+Administrators may use
+<a href="http://code.google.com/p/modwsgi/">mod_wsgi</a> to run
+<code>abrt-retrace-server</code> on Apache. The mod_wsgi is a part of
+both Fedora 12 and RHEL 6. The Python language is a good choice for
+this application, because it supports HTTP handling well, and it is
+already used in ABRT.</p>
+
+<p>Only secure (HTTPS) communication must be allowed for the
+communication with <code>abrt-retrace-server</code>, because coredumps
+and backtraces are private data. Users may decide to publish their
+backtraces in a bug tracker after reviewing them, but the retrace
+server doesn't do that. The HTTPS requirement must be specified in the
+server's man page. The server must support HTTP persistent connections
+to to avoid frequent SSL renegotiations. The server's manual page
+should include a recommendation for administrator to check that the
+persistent connections are enabled.</p>
+
+<h3><a name="creating_a_new_task">Creating a new task</a></h3>
+
+<p>A client might create a new task by sending a HTTP request to the
+https://server/create URL, and providing an archive as the
+request content. The archive must contain crash data files. The crash
+data files are a subset of the local /var/spool/abrt/ccpp-time-pid/
+directory contents, so the client must only pack and upload them.</p>
+
+<p>The server must support uncompressed tar archives, and tar archives
+compressed with gzip and xz. Uncompressed archives are the most
+efficient way for local network delivery, and gzip can be used there
+as well because of its good compression speed.</p>
+
+<p>The xz compression file format is well suited for public server
+setup (slow network), as it provides good compression ratio, which is
+important for compressing large coredumps, and it provides reasonable
+compress/decompress speed and memory consumption (see the chapter
+<a href="#traffic_and_load_estimation">Traffic and load estimation</a>
+for the measurements). The <code>XZ Utils</code> implementation with
+the compression level 2 should be used to compress the data.</p>
+
+<p>The HTTP request for a new task must use the POST method. It must
+contain a proper <code>Content-Length</code> and
+<code>Content-Type</code> fields. If the method is not POST, the
+server must return the "405 Method Not Allowed" HTTP error code. If
+the <code>Content-Length</code> field is missing, the server must
+return the "411 Length Required" HTTP error code. If an
+<code>Content-Type</code> other than <code>application/x-tar</code>,
+<code>application/x-gzip</code>, <code>application/x-xz</code> is
+used, the server must return the "415 unsupported Media Type" HTTP
+error code. If the <code>Content-Length</code> value is greater than a
+limit set in the server configuration file (30 MB by default), or the
+real HTTP request size gets larger than the limit + 10 KB for headers,
+then the server must return the "413 Request Entity Too Large" HTTP
+error code, and provide an explanation, including the limit, in the
+response body. The limit must be changeable from the server
+configuration file.</p>
+
+<p>If there is less than 20 GB of free disk space in the
+<code>/var/spool/abrt-retrace</code> directory, the server must return
+the "507 Insufficient Storage" HTTP error code. The server must return
+the same HTTP error code if decompressing the received archive would
+cause the free disk space to become less than 20 GB. The 20 GB limit
+must be changeable from the server configuration file.</p>
+
+<p>If the data from the received archive would take more than 500
+MB of disk space when uncompressed, the server must return the "413
+Request Entity Too Large" HTTP error code, and provide an explanation,
+including the limit, in the response body. The size limit must be
+changeable from the server configuration file. It can be set pretty
+high because coredumps, that take most disk space, are stored on the
+server only temporarily until the backtrace is generated. When the
+backtrace is generated the coredump is deleted by the
+<code>abrt-retrace-worker</code>, so most disk space is released.</p>
+
+<p>The uncompressed data size for xz archives can be obtained by
+calling <code>`xz --list file.tar.xz`</code>. The <code>--list</code>
+option has been implemented only recently, so it might be necessary to
+implement a method to get the uncompressed data size by extracting the
+archive to the stdout, and counting the extracted bytes, and call this
+method if the <code>--list</code> doesn't work on the
+server. Likewise, the uncompressed data size for gzip archives can be
+obtained by calling <code>`gzip --list file.tar.gz`</code></p>
+
+<p>If an upload from a client succeeds, the server creates a new
+directory <code>/var/spool/abrt-retrace/&lt;id&gt;</code> and extracts
+the received archive into it. Then it checks that the directory
+contains all the required files, checks their sizes, and then sends a
+HTTP response. After that it spawns a subprocess with
+<code>abrt-retrace-worker</code> on that directory.</p>
+
+<p>To support multiple architectures, the retrace server needs a GDB
+package compiled separately for every supported target architecture
+(see the avr-gdb package in Fedora for an example). This is
+technically and economically better solution than using a standalone
+machine for every supported architecture and resending coredumps
+depending on client's architecture. However, GDB's support for using a
+target architecture different from the host architecture seems to be
+fragile. If it doesn't work, the QEMU user mode emulation should be
+tried as an alternative approach.</p>
+
+<p>The following files from the local crash directory are required to
+be present in the archive: <code>coredump</code>,
+<code>architecture</code>, <code>release</code>, <code>packages</code>
+(this one does not exist yet). If one or more files are not present in
+the archive, or some other file is present in the archive, the server
+must return the "403 Forbidden" HTTP error code. If the size of any
+file except the coredump exceeds 100 KB, the server must return the
+"413 Request Entity Too Large" HTTP error code, and provide an
+explanation, including the limit, in the response body. The 100 KB
+limit must be changeable from the server configuration file.</p>
+
+<p>If the file check succeeds, the server HTTP response must have the
+"201 Created" HTTP code. The response must include the following HTTP
+header fields:</p>
+<ul>
+<li><code>X-Task-Id</code> containing a new server-unique numerical
+task id</li>
+<li><code>X-Task-Password</code> containing a newly generated
+password, required to access the result</li>
+<li><code>X-Task-Est-Time</code> containing a number of seconds the
+server estimates it will take to generate the backtrace</li>
+</ul>
+
+<p>The <code>X-Task-Password</code> is a random alphanumeric
+(<code>[a-zA-Z0-9]</code>) sequence 22 characters long. 22
+alphanumeric characters corresponds to 128 bit password,
+because <code>[a-zA-Z0-9]</code> = 62 characters, and 2<sup>128</sup>
+&lt; 62<sup>22</sup>. The source of randomness must be, directly or
+indirectly, <code>/dev/urandom</code>. The
+<code>rand()</code> function from glibc and similar functions from
+other libraries cannot be used because of their poor characteristics
+(in several aspects). The password must be stored to the
+<code>/var/spool/abrt-retrace/&lt;id&gt;/password</code> file, so
+passwords sent by a client in subsequent requests can be verified.</p>
+
+<p>The task id is intentionally not used as a password, because it is
+desirable to keep the id readable and memorable for
+humans. Password-like ids would be a loss when an user authentication
+mechanism is added, and server-generated password will no longer be
+necessary.</p>
+
+<p>The algorithm for the <code>X-Task-Est-Time</code> time estimation
+should take the previous analyses of coredumps with the same
+corresponding package name into account. The server should store
+simple history in a SQLite database to know how long it takes to
+generate a backtrace for certain package. It could be as simple as
+this:</p>
+<ul>
+  <li>initialization step one: <code>CREATE TABLE package_time (id
+  INTEGER PRIMARY KEY AUTOINCREMENT, package, release, time)</code>;
+  we need the <code>id</code> for the database cleanup - to know the
+  insertion order of rows, so the <code>AUTOINCREMENT</code> is
+  important here; the <code>package</code> is the package name without
+  the version and release numbers, the <code>release</code> column
+  stores the operating system, and the <code>time</code> is the number
+  of seconds it took to generate the backtrace</li>
+  <li>initialization step two: <code>CREATE INDEX package_release ON
+  package_time (package, release)</code>; we compute the time only for
+  single package on single supported OS release per query, so it makes
+  sense to create an index to speed it up</li>
+  <li>when a task is finished: <code>INSERT INTO package_time
+  (package, release, time) VALUES ('??', '??', '??')</code></li>
+  <li>to get the average time: <code>SELECT AVG(time) FROM
+  package_time WHERE package == '??' AND release == '??'</code>; the
+  arithmetic mean seems to be sufficient here</li>
+</ul>
+
+<p>So the server knows that crashes from an OpenOffice.org package
+take 5 minutes to process in average, and it can return the value 300
+(seconds) in the field. The client does not waste time asking about
+that task every 20 seconds, but the first status request comes after
+300 seconds. And even when the package changes (rebases etc.), the
+database provides good estimations after some time anyway
+(<a href="#task_cleanup">Task cleanup</a> chapter describes how the
+data are pruned).</p>
+
+<p>The server response HTTP <i>body</i> is generated and sent
+gradually as the task is performed. Client chooses either to receive
+the body, or terminate after getting all headers and ask the server
+for status and backtrace asynchronously.</p>
+
+<p>The server re-sends the output of abrt-retrace-worker (its stdout
+and stderr) to the response the body. In addition, a line with the
+task status is added in the form <code>X-Task-Status: PENDING</code>
+to the body every 5 seconds. When the worker process ends,
+either <code>FINISHED_SUCCESS</code> or <code>FINISHED_FAILURE</code>
+status line is sent. If it's <code>FINISHED_SUCCESS</code>, the
+backtrace is attached after this line. Then the response body is
+closed.</p>
+
+<h3><a name="task_status">Task status</a></h3>
+
+<p>A client might request a task status by sending a HTTP GET request
+to the <code>https://someserver/&lt;id&gt;</code> URL, where
+<code>&lt;id&gt;</code> is the numerical task id returned in the
+<code>X-Task-Id</code> field by
+<code>https://someserver/create</code>. If the
+<code>&lt;id&gt;</code> is not in the valid format, or the task
+<code>&lt;id&gt;</code> does not exist, the server must return the
+"404 Not Found" HTTP error code.</p>
+
+<p>The client request must contain the "X-Task-Password" field, and
+its content must match the password stored in the
+<code>/var/spool/abrt-retrace/&lt;id&gt;/password</code> file. If the
+password is not valid, the server must return the "403 Forbidden" HTTP
+error code.</p>
+
+<p>If the checks pass, the server returns the "200 OK" HTTP code, and
+includes a field "X-Task-Status" containing one of the following
+values: <code>FINISHED_SUCCESS</code>,
+<code>FINISHED_FAILURE</code>, <code>PENDING</code>.</p>
+
+<p>The field contains <code>FINISHED_SUCCESS</code> if the file
+<code>/var/spool/abrt-retrace/&lt;id&gt;/backtrace</code> exists. The
+client might get the backtrace on the
+<code>https://someserver/&lt;id&gt;/backtrace</code> URL. The log
+might be obtained on the
+<code>https://someserver/&lt;id&gt;/log</code> URL, and it might
+contain warnings about some missing debuginfos etc.</p>
+
+<p>The field contains <code>FINISHED_FAILURE</code> if the file
+<code>/var/spool/abrt-retrace/&lt;id&gt;/backtrace</code> does not
+exist, and file
+<code>/var/spool/abrt-retrace/&lt;id&gt;/retrace-log</code>
+exists. The retrace-log file containing error messages can be
+downloaded by the client from the
+<code>https://someserver/&lt;id&gt;/log</code> URL.</p>
+
+<p>The field contains <code>PENDING</code> if neither file exists. The
+client should ask again after 10 seconds or later.</p>
+
+<h3><a name="requesting_a_backtrace">Requesting a backtrace</a></h3>
+
+<p>A client might request a backtrace by sending a HTTP GET request to
+the <code>https://someserver/&lt;id&gt;/backtrace</code> URL, where
+<code>&lt;id&gt;</code> is the numerical task id returned in the
+"X-Task-Id" field by <code>https://someserver/create</code>. If the
+<code>&lt;id&gt;</code> is not in the valid format, or the task
+<code>&lt;id&gt;</code> does not exist, the server must return the
+"404 Not Found" HTTP error code.</p>
+
+<p>The client request must contain the "X-Task-Password" field, and
+its content must match the password stored in the
+<code>/var/spool/abrt-retrace/&lt;id&gt;/password</code> file. If the
+password is not valid, the server must return the "403 Forbidden" HTTP
+error code.</p>
+
+<p>If the file /var/spool/abrt-retrace/&lt;id&gt;/backtrace does not
+exist, the server must return the "404 Not Found" HTTP error code.
+Otherwise it returns the file contents, and the "Content-Type" field
+must contain "text/plain".</p>
+
+<h3><a name="requesting_a_log">Requesting a log</a></h3>
+
+<p>A client might request a task log by sending a HTTP GET request to
+the <code>https://someserver/&lt;id&gt;/log</code> URL, where
+<code>&lt;id&gt;</code> is the numerical task id returned in the
+"X-Task-Id" field by <code>https://someserver/create</code>. If the
+<code>&lt;id&gt;</code> is not in the valid format, or the task
+<code>&lt;id&gt;</code> does not exist, the server must return the
+"404 Not Found" HTTP error code.</p>
+
+<p>The client request must contain the "X-Task-Password" field, and
+its content must match the password stored in the
+<code>/var/spool/abrt-retrace/&lt;id&gt;/password</code> file. If the
+password is not valid, the server must return the "403 Forbidden" HTTP
+error code.</p>
+
+<p>If the file
+<code>/var/spool/abrt-retrace/&lt;id&gt;/retrace-log</code> does not
+exist, the server must return the "404 Not Found" HTTP error code.
+Otherwise it returns the file contents, and the "Content-Type" must
+contain "text/plain".</p>
+
+<h3><a name="task_cleanup">Task cleanup</a></h3>
+
+<p>Tasks that were created more than 5 days ago must be deleted,
+because tasks occupy disk space (not so much space, as the coredumps
+are deleted after the retrace, and only backtraces and configuration
+remain). A shell script <code>abrt-retrace-clean</code> must check the
+creation time and delete the directories
+in <code>/var/spool/abrt-retrace</code>. It is supposed that the
+server administrator sets <code>cron</code> to call the script once a
+day. This assumption must be mentioned in
+the <code>abrt-retrace-clean</code> manual page.</p>
+
+<p>The database containing packages and processing times should also
+be regularly pruned to remain small and provide data quickly. The
+cleanup script should delete some rows for packages with too many
+entries:</p>
+<ol>
+<li>get a list of packages from the database: <code>SELECT DISTINCT
+package, release FROM package_time</code></li>
+<li>for every package, get the row count: <code>SELECT COUNT(*) FROM
+package_time WHERE package == '??' AND release == '??'</code></li>
+<li>for every package with the row count larger than 100, some rows
+most be removed so that only the newest 100 rows remain in the
+database:
+<ul>
+  <li>to get highest row id which should be deleted,
+  execute <code>SELECT id FROM package_time WHERE package == '??' AND
+  release == '??' ORDER BY id LIMIT 1 OFFSET ??</code>, where the
+  <code>OFFSET</code> is the total number of rows for that single
+  package minus 100</li>
+  <li>then all the old rows can be deleted by executing <code>DELETE
+  FROM package_time WHERE package == '??' AND release == '??' AND id
+  &lt;= ??</code></li>
+</ul>
+</li>
+</ol>
+
+<h3><a name="limiting_traffic">Limiting traffic</a></h3>
+
+<p>The maximum number of simultaneously running tasks must be limited
+to 20 by the server. The limit must be changeable from the server
+configuration file. If a new request comes when the server is fully
+occupied, the server must return the "503 Service Unavailable" HTTP
+error code.</p>
+
+<p>The archive extraction, chroot preparation, and gdb analysis is
+mostly limited by the hard drive size and speed.</p>
+
+<h2><a name="retrace_worker">Retrace worker</a></h2>
+
+<p>The worker (<code>abrt-retrace-worker</code> binary) gets a
+<code>/var/spool/abrt-retrace/&lt;id&gt;</code> directory as an
+input. The worker reads the operating system name and version, the
+coredump, and the list of packages needed for retracing (a package
+containing the binary which crashed, and packages with the libraries
+that are used by the binary).</p>
+
+<p>The worker prepares a new "chroot" subdirectory with the packages,
+their debuginfo, and gdb installed. In other words, a new directory
+<code>/var/spool/abrt-retrace/&lt;id&gt;/chroot</code> is created and
+the packages are unpacked or installed into this directory, so for
+example the gdb ends up as
+<code>/var/.../&lt;id&gt;/chroot/usr/bin/gdb</code>.</p>
+
+<p>After the "chroot" subdirectory is prepared, the worker moves the
+coredump there and changes root (using the chroot system function) of
+a child script there. The child script runs the gdb on the coredump,
+and the gdb sees the corresponding crashy binary, all the debuginfo
+and all the proper versions of libraries on right places.</p>
+
+<p>When the gdb run is finished, the worker copies the resulting
+backtrace to the
+<code>/var/spool/abrt-retrace/&lt;id&gt;/backtrace</code> file and
+stores a log from the whole chroot process to the
+<code>retrace-log</code> file in the same directory. Then it removes
+the chroot directory.</p>
+
+<p>The GDB installed into the chroot must:</p>
+<ul>
+<li>run on the server (same architecture, or we can use
+<a href="http://wiki.qemu.org/download/qemu-doc.html#QEMU-User-space-emulator">QEMU
+user space emulation</a>)</li>
+<li>process the coredump (possibly from another architecture): that
+means we need a special GDB for every supported architecture</li>
+<li>be able to handle coredumps created in an environment with prelink
+enabled
+(<a href="http://sourceware.org/ml/gdb/2009-05/msg00175.html">should
+not</a> be a problem)</li>
+<li>use libc, zlib, readline, ncurses, expat and Python packages,
+while the version numbers required by the coredump might be different
+from what is required by the GDB</li>
+</ul>
+
+<p>The gdb might fail to run with certain combinations of package
+dependencies. Nevertheless, we need to provide the libc/Python/*
+package versions which are required by the coredump. If we would not
+do that, the backtraces generated from such an environment would be of
+lower quality. Consider a coredump which was caused by a crash of
+Python application on a client, and which we analyze on the retrace
+server with completely different version of Python because the
+client's Python version is not compatible with our GDB.</p>
+
+<p>We can solve the issue by installing the GDB package dependencies
+first, move their binaries to some safe place (<code>/lib/gdb</code>
+in the chroot), and create the <code>/etc/ld.so.preload</code> file
+pointing to that place, or set <code>LD_LIBRARY_PATH</code>. Then we
+can unpack libc binaries and other packages and their versions as
+required by the coredump to the common paths, and the GDB would run
+happily, using the libraries from <code>/lib/gdb</code> and not those
+from <code>/lib</code> and <code>/usr/lib</code>. This approach can
+use standard GDB builds with various target architectures: gdb,
+gdb-i386, gdb-ppc64, gdb-s390 (nonexistent in Fedora/EPEL at the time
+of writing this).</p>
+
+<p>The GDB and its dependencies are stored separately from the packages
+used as data for coredump processing. A single combination of GDB and
+its dependencies can be used across all supported OS to generate
+backtraces.</p>
+
+<p>The retrace worker must be able to prepare a chroot-ready
+environment for certain supported operating system, which is different
+from the retrace server's operating system. It needs to fake
+the <code>/dev</code> directory and create some basic files in
+<code>/etc</code> like passwd and hosts. We can use
+the <a href="https://fedorahosted.org/mock/">mock</a> library to do
+that, as it does almost what we need (but not exactly as it has a
+strong focus on preparing the environment for rpmbuild and running
+it), or we can come up with our own solution, while stealing some code
+from the mock library. The <code>/usr/bin/mock</code> executable is
+entirely unuseful for the retrace server, but the underlying Python
+library can be used. So if would like to use mock, an ABRT-specific
+interface to the mock library must be written or the retrace worker
+must be written in Python and use the mock Python library
+directly.</p>
+
+<p>We should save some time and disk space by extracting only binaries
+and dynamic libraries from the packages for the coredump analysis, and
+omit all other files. We can save even more time and disk space by
+extracting only the libraries and binaries really referenced by the
+coredump (eu-unstrip tells us). Packages should not be
+<em>installed</em> to the chroot, they should be <em>extracted</em>
+only, because we use them as a data source, and we never run them.</p>
+
+<p>Another idea to be considered is that we can avoid the package
+extraction if we can teach GDB to read the dynamic libraries, the
+binary, and the debuginfo directly from the RPM packages. We would
+provide a backend to GDB which can do that, and provide tiny front-end
+program which tells the backend which RPMs it should use and then run
+the GDB command loop. The result would be a GDB wrapper/extension we
+need to maintain, but it should end up pretty small. We would use
+Python to write our extension, as we do not want to (inelegantly)
+maintain a patch against GDB core. We need to ask GDB people if the
+Python interface is capable of handling this idea, and how much work
+it would be to implement it.</p>
+
+<h2><a name="package_repository">Package repository</a></h2>
+
+<p>We should support every Fedora release with all packages that ever
+made it to the updates and updates-testing repositories. In order to
+provide all that packages, a local repository is maintained for every
+supported operating system. The debuginfos might be provided by a
+debuginfo server in future (it will save the server disk space). We
+should support the usage of local debuginfo first, and add the
+debuginfofs support later.</p>
+
+<p>A repository with Fedora packages must be maintained locally on the
+server to provide good performance and to provide data from older
+packages already removed from the official repositories. We need a
+package downloader, which scans Fedora servers for new packages, and
+downloads them so they are immediately available.</p>
+
+<p>Older versions of packages are regularly deleted from the updates
+and updates-testing repositories. We must support older versions of
+packages, because that is one of two major pain-points that the
+retrace server is supposed to solve (the other one is the slowness of
+debuginfo download and debuginfo disk space requirements).</p>
+
+<p>A script abrt-reposync must download packages from Fedora
+repositories, but it must not delete older versions of the
+packages. The retrace server administrator is supposed to call this
+script using cron every ~6 hours. This expectation must be documented
+in the abrt-reposync manual page. The script can use use wget, rsync,
+or reposync tool to get the packages. The remote yum source
+repositories must be configured from a configuration file or files
+(/etc/yum.repos.d might be used).</p>
+
+<p>When the abrt-reposync is used to sync with the Rawhide repository,
+unneeded packages (where a newer version exists) must be removed after
+residing one week with the newer package in the same repository.</p>
+
+<p>All the unneeded content from the newly downloaded packages should be
+removed to save disk space and speed up chroot creation. We need just
+the binaries and dynamic libraries, and that is a tiny part of package
+contents.</p>
+
+<p>The packages should be downloaded to a local repository in
+/var/cache/abrt-repo/{fedora12,fedora12-debuginfo,...}.</p>
+
+<h2><a name="traffic_and_load_estimation">Traffic and load estimation</a></h2>
+
+<p>2500 bugs are reported from ABRT every month. Approximately 7.3%
+from that are Python exceptions, which don't need a retrace
+server. That means that 2315 bugs need a retrace server. That is 77
+bugs per day, or 3.3 bugs every hour on average. Occasional spikes
+might be much higher (imagine a user that decided to report all his 8
+crashes from last month).</p>
+
+<p>We should probably not try to predict if the monthly bug count goes up
+or down. New, untested versions of software are added to Fedora, but
+on the other side most software matures and becomes less crashy.  So
+let's assume that the bug count stays approximately the same.</p>
+
+<p>Test crashes (see that we should probably use <code>`xz -2`</code>
+to compress coredumps):</p>
+<table border="1">
+<tr>
+  <th colspan="3">application</th>
+  <th>firefox with 7 tabs with random pages opened</th>
+  <th>thunderbird with thousands of emails opened</th>
+  <th>evince with 2 pdfs (1 and 42 pages) opened</th>
+  <th>OpenOffice.org Impress with 25 pages presentation</th>
+</tr>
+<tr>
+  <th colspan="3">coredump size</th>
+  <td>172 MB</td>
+  <td>218 MB</td>
+  <td>73 MB</td>
+  <td>116 MB</td>
+</tr>
+<tr>
+  <th rowspan="17">xz compression</th>
+</tr>
+<tr>
+  <th rowspan="4">level 6 (default)</th>
+</tr>
+<tr>
+  <th>compression time</th>
+  <td>32.5 sec</td>
+  <td>60 sec</td>
+  <td></td>
+  <td></td>
+</tr>
+<tr>
+  <th>compressed size</th>
+  <td>5.4 MB</td>
+  <td>12 MB</td>
+  <td></td>
+  <td></td>
+</tr>
+<tr>
+  <th>decompression time</th>
+  <td>2.7 sec</td>
+  <td>3.6 sec</td>
+  <td></td>
+  <td></td>
+</tr>
+<tr>
+  <th rowspan="4">level 3</th>
+</tr>
+<tr>
+  <th>compression time</th>
+  <td>23.4 sec</td>
+  <td>42 sec</td>
+  <td></td>
+  <td></td>
+</tr>
+<tr>
+  <th>compressed size</th>
+  <td>5.6 MB</td>
+  <td>13 MB</td>
+  <td></td>
+  <td></td>
+</tr>
+<tr>
+  <th>decompression time</th>
+  <td>1.6 sec</td>
+  <td>3.0 sec</td>
+  <td></td>
+  <td></td>
+</tr>
+<tr>
+  <th rowspan="4">level 2</th>
+</tr>
+<tr>
+  <th>compression time</th>
+  <td>6.8 sec</td>
+  <td>10 sec</td>
+  <td>2.9 sec</td>
+  <td>7.1 sec</td>
+</tr>
+<tr>
+  <th>compressed size</th>
+  <td>6.1 MB</td>
+  <td>14 MB</td>
+  <td>3.6 MB</td>
+  <td>12 MB</td>
+</tr>
+<tr>
+  <th>decompression time</th>
+  <td>3.7 sec</td>
+  <td>3.0 sec</td>
+  <td>0.7 sec</td>
+  <td>2.3 sec</td>
+</tr>
+<tr>
+  <th rowspan="4">level 1</th>
+</tr>
+<tr>
+  <th>compression time</th>
+  <td>5.1 sec</td>
+  <td>8.3 sec</td>
+  <td>2.5 sec</td>
+  <td></td>
+</tr>
+<tr>
+  <th>compressed size</th>
+  <td>6.4 MB</td>
+  <td>15 MB</td>
+  <td>3.9 MB</td>
+  <td></td>
+</tr>
+<tr>
+  <th>decompression time</th>
+  <td>2.4 sec</td>
+  <td>3.2 sec</td>
+  <td>0.7 sec</td>
+  <td></td>
+</tr>
+<tr>
+  <th rowspan="13">gzip compression</th>
+</tr>
+<tr>
+  <th rowspan="4">level 9 (highest)</th>
+</tr>
+<tr>
+  <th>compression time</th>
+  <td>7.6 sec</td>
+  <td>14.9 sec</td>
+  <td></td>
+  <td></td>
+</tr>
+<tr>
+  <th>compressed size</th>
+  <td>7.9 MB</td>
+  <td>18 MB</td>
+  <td></td>
+  <td></td>
+</tr>
+<tr>
+  <th>decompression time</th>
+  <td>1.5 sec</td>
+  <td>2.4 sec</td>
+  <td></td>
+  <td></td>
+</tr>
+<tr>
+  <th rowspan="4">level 6 (default)</th>
+</tr>
+<tr>
+  <th>compression time</th>
+  <td>2.6 sec</td>
+  <td>4.4 sec</td>
+  <td></td>
+  <td></td>
+</tr>
+<tr>
+  <th>compressed size</th>
+  <td>8 MB</td>
+  <td>18 MB</td>
+  <td></td>
+  <td></td>
+</tr>
+<tr>
+  <th>decompression time</th>
+  <td>2.3 sec</td>
+  <td>2.2 sec</td>
+  <td></td>
+  <td></td>
+</tr>
+<tr>
+  <th rowspan="4">level 3</th>
+</tr>
+<tr>
+  <th>compression time</th>
+  <td>1.7 sec</td>
+  <td>2.7 sec</td>
+  <td></td>
+  <td></td>
+</tr>
+<tr>
+  <th>compressed size</th>
+  <td>8.9 MB</td>
+  <td>20 MB</td>
+  <td></td>
+  <td></td>
+</tr>
+<tr>
+  <th>decompression time</th>
+  <td>1.7 sec</td>
+  <td>3 sec</td>
+  <td></td>
+  <td></td>
+</tr>
+</table>
+
+<p>So let's imagine there are some users that want to report their
+crashes approximately at the same time. Here is what the retrace
+server must handle:</p>
+<ol>
+<li>2 OpenOffice crashes</li>
+<li>2 evince crashes</li>
+<li>2 thunderbird crashes</li>
+<li>2 firefox crashes</li>
+</ol>
+
+<p>We will use the xz archiver with the compression level 2 on the ABRT's
+side to compress the coredumps. So the users spend 53.6 seconds in
+total packaging the coredumps.</p>
+
+<p>The packaged coredumps have 71.4 MB, and the retrace server must
+receive that data.</p>
+
+<p>The server unpacks the coredumps (perhaps in the same time), so they
+need 1158 MB of disk space on the server. The decompression will take
+19.4 seconds.</p>
+
+<p>Several hundred megabytes will be needed to install all the
+required packages and debuginfos for every chroot (8 chroots 1 GB each
+= 8 GB, but this seems like an extreme, maximal case). Some space will
+be saved by using a debuginfofs.</p>
+
+<p>Note that most applications are not as heavyweight as OpenOffice and
+Firefox.</p>
+
+<h2><a name="security">Security</a></h2>
+
+<p>The retrace server communicates with two other entities: it accepts
+coredumps form users, and it downloads debuginfos and packages from
+distribution repositories.</p>
+
+<p>General security from GDB flaws and malicious data is provided by
+chroot. The GDB accesses the debuginfos, packages, and the coredump
+from within the chroot, unable to access the retrace server's
+environment. We should consider setting a disk quota to every chroot
+directory, and limit the GDB access to resources using cgroups.</p>
+
+<p>SELinux policy should be written for both the retrace server's HTTP
+interface, and for the retrace worker.</p>
+
+<h3><a name="clients">Clients</a></h3>
+
+<p>The clients, which are using the retrace server and sending coredumps
+to it, must fully trust the retrace server administrator.  The server
+administrator must not try to get sensitive data from client
+coredumps.  That seems to be a major bottleneck of the retrace server
+idea.  However, users of an operating system already trust the OS
+provider in various important matters. So when the retrace server is
+operated by the operating system provider, that might be acceptable by
+users.</p>
+
+<p>We cannot avoid sending clients' coredumps to the retrace server, if
+we want to generate quality backtraces containing the values of
+variables. Minidumps are not acceptable solution, as they lower the
+quality of the resulting backtraces, while not improving user
+security.</p>
+
+<p>Can the retrace server trust clients? We must know what can a
+malicious client achieve by crafting a nonstandard coredump, which
+will be processed by server's GDB.  We should ask GDB experts about
+this.</p>
+
+<p>Another question is whether we can allow users providing some packages
+and debuginfo together with a coredump. That might be useful for
+users, who run the operating system only with some minor
+modifications, and they still want to use the retrace server. So they
+send a coredump together with a few nonstandard packages. The retrace
+server uses the nonstandard packages together with the OS packages to
+generate the backtrace. Is it safe? We must know what can a malicious
+client achieve by crafting a special binary and debuginfo, which will
+be processed by server's GDB.</p>
+
+<h3><a name="packages_and_debuginfo">Packages and debuginfo</a></h3>
+
+<p>We can safely download packages and debuginfo from the distribution,
+as the packages are signed by the distribution, and the package origin
+can be verified.</p>
+
+<p>When the debuginfo server is done, the retrace server can safely use
+it, as the data will also be signed.</p>
+
+<h2><a name="future_work">Future work</a></h2>
+
+<p>1. Coredump stripping. Jan Kratochvil: With my test of OpenOffice.org
+presentation kernel core file has 181MB, xz -2 of it has 65MB.
+According to `set target debug 1' GDB reads only 131406 bytes of it
+(incl. the NOTE segment).</p>
+
+<p>2. User management for the HTTP interface. We need multiple
+authentication sources (x509 for RHEL).</p>
+
+<p>3. Make <code>architecture</code>, <code>release</code>,
+<code>packages</code> files, which must be included in the package
+when creating a task, optional. Allow uploading a coredump without
+involving tar: just coredump, coredump.gz, or coredump.xz.</p>
+
+<p>4. Handle non-standard packages (provided by user)</p>
+</body>
+</html>
author	Karel Klic <kklic@redhat.com>	2010-11-16 17:06:13 +0100
committer	Karel Klic <kklic@redhat.com>	2010-12-02 16:13:42 +0100
commit	2a959ffa76358eef17c4563823a73d62b2a8929e (patch)
tree	31556675325e30bf1032d55f9dfe382b7220dfce /doc
parent	feeab473082eb31e7bf4ca17a624bc683a4b6fb8 (diff)
download	abrt-2a959ffa76358eef17c4563823a73d62b2a8929e.tar.gz abrt-2a959ffa76358eef17c4563823a73d62b2a8929e.tar.xz abrt-2a959ffa76358eef17c4563823a73d62b2a8929e.zip