diff options
author | Karel Klic <kklic@redhat.com> | 2010-11-16 17:06:13 +0100 |
---|---|---|
committer | Karel Klic <kklic@redhat.com> | 2010-12-02 16:13:42 +0100 |
commit | 2a959ffa76358eef17c4563823a73d62b2a8929e (patch) | |
tree | 31556675325e30bf1032d55f9dfe382b7220dfce /doc/retrace-server.xhtml | |
parent | feeab473082eb31e7bf4ca17a624bc683a4b6fb8 (diff) | |
download | abrt-2a959ffa76358eef17c4563823a73d62b2a8929e.tar.gz abrt-2a959ffa76358eef17c4563823a73d62b2a8929e.tar.xz abrt-2a959ffa76358eef17c4563823a73d62b2a8929e.zip |
Added retrace server design document
Diffstat (limited to 'doc/retrace-server.xhtml')
-rw-r--r-- | doc/retrace-server.xhtml | 860 |
1 files changed, 860 insertions, 0 deletions
diff --git a/doc/retrace-server.xhtml b/doc/retrace-server.xhtml new file mode 100644 index 00000000..3c24ddaf --- /dev/null +++ b/doc/retrace-server.xhtml @@ -0,0 +1,860 @@ +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN" +"http://www.w3.org/TR/MathML2/dtd/xhtml-math11-f.dtd"> +<html xmlns="http://www.w3.org/1999/xhtml"> +<head> + <title>Retrace server design</title> +</head> +<body> +<h1>Retrace server design</h1> + +<p>The retrace server provides a coredump analysis and backtrace +generation service over a network using HTTP protocol.</p> + +<table id="toc" class="toc"> +<tr> +<td> +<div id="toctitle"> + <h2>Contents</h2> +</div> +<ul> +<li><a href="#overview">1 Overview</a></li> +<li><a href="#http_interface">2 HTTP interface</a> + <ul> + <li><a href="#creating_a_new_task">2.1 Creating a new task</a></li> + <li><a href="#task_status">2.2 Task status</a></li> + <li><a href="#requesting_a_backtrace">2.3 Requesting a backtrace</a></li> + <li><a href="#requesting_a_log">2.4 Requesting a log file</a></li> + <li><a href="#task_cleanup">2.5 Task cleanup</a></li> + <li><a href="#limiting_traffic">2.6 Limiting traffic</a></li> + </ul> +</li> +<li><a href="#retrace_worker">3 Retrace worker</a></li> +<li><a href="#package_repository">4 Package repository</a></li> +<li><a href="#traffic_and_load_estimation">5 Traffic and load estimation</a></li> +<li><a href="#security">6 Security</a> + <ul> + <li><a href="#clients">6.1 Clients</a></li> + <li><a href="#packages_and_debuginfo">6.2 Packages and debuginfo</a></li> + </ul> +</li> +<li><a href="#future_work">7 Future work</a></li> +</ul> +</td> +</tr> +</table> + +<h2><a name="overview">Overview</a></h2> + +<p>A client sends a coredump (created by Linux kernel) together with +some additional information to the server, and gets a backtrace +generation task ID in response. Then the client, after some time, asks +the server for the task status, and when the task is done (backtrace +has been generated from the coredump), the client downloads the +backtrace. If the backtrace generation fails, the client gets an error +code and downloads a log indicating what happened. Alternatively, the +client sends a coredump, and keeps receiving the server response +message. Server then, via the response's body, periodically sends +status of the task, and delivers the resulting backtrace as soon as +it's ready.</p> + +<p>The retrace server must be able to support multiple operating +systems and their releases (Fedora N-1, N, Rawhide, Branched Rawhide, +RHEL), and multiple architectures within a single installation.</p> + +<p>The retrace server consists of the following parts:</p> +<ol> +<li>abrt-retrace-server: a HTTP interface script handling the +communication with clients, task creation and management</li> +<li>abrt-retrace-worker: a program doing the environment preparation +and coredump processing</li> +<li>package repository: a repository placed on the server containing +all the application binaries, libraries, and debuginfo necessary for +backtrace generation</li> +</ol> + +<h2><a name="http_interface">HTTP interface</a></h2> + +<p>The HTTP interface application is a script written in Python. The +script is named <code>abrt-retrace-server</code>, and it uses the +<a href="http://www.python.org/dev/peps/pep-0333/">Python Web Server +Gateway Interface</a> (WSGI) to interact with the web server. +Administrators may use +<a href="http://code.google.com/p/modwsgi/">mod_wsgi</a> to run +<code>abrt-retrace-server</code> on Apache. The mod_wsgi is a part of +both Fedora 12 and RHEL 6. The Python language is a good choice for +this application, because it supports HTTP handling well, and it is +already used in ABRT.</p> + +<p>Only secure (HTTPS) communication must be allowed for the +communication with <code>abrt-retrace-server</code>, because coredumps +and backtraces are private data. Users may decide to publish their +backtraces in a bug tracker after reviewing them, but the retrace +server doesn't do that. The HTTPS requirement must be specified in the +server's man page. The server must support HTTP persistent connections +to to avoid frequent SSL renegotiations. The server's manual page +should include a recommendation for administrator to check that the +persistent connections are enabled.</p> + +<h3><a name="creating_a_new_task">Creating a new task</a></h3> + +<p>A client might create a new task by sending a HTTP request to the +https://server/create URL, and providing an archive as the +request content. The archive must contain crash data files. The crash +data files are a subset of the local /var/spool/abrt/ccpp-time-pid/ +directory contents, so the client must only pack and upload them.</p> + +<p>The server must support uncompressed tar archives, and tar archives +compressed with gzip and xz. Uncompressed archives are the most +efficient way for local network delivery, and gzip can be used there +as well because of its good compression speed.</p> + +<p>The xz compression file format is well suited for public server +setup (slow network), as it provides good compression ratio, which is +important for compressing large coredumps, and it provides reasonable +compress/decompress speed and memory consumption (see the chapter +<a href="#traffic_and_load_estimation">Traffic and load estimation</a> +for the measurements). The <code>XZ Utils</code> implementation with +the compression level 2 should be used to compress the data.</p> + +<p>The HTTP request for a new task must use the POST method. It must +contain a proper <code>Content-Length</code> and +<code>Content-Type</code> fields. If the method is not POST, the +server must return the "405 Method Not Allowed" HTTP error code. If +the <code>Content-Length</code> field is missing, the server must +return the "411 Length Required" HTTP error code. If an +<code>Content-Type</code> other than <code>application/x-tar</code>, +<code>application/x-gzip</code>, <code>application/x-xz</code> is +used, the server must return the "415 unsupported Media Type" HTTP +error code. If the <code>Content-Length</code> value is greater than a +limit set in the server configuration file (30 MB by default), or the +real HTTP request size gets larger than the limit + 10 KB for headers, +then the server must return the "413 Request Entity Too Large" HTTP +error code, and provide an explanation, including the limit, in the +response body. The limit must be changeable from the server +configuration file.</p> + +<p>If there is less than 20 GB of free disk space in the +<code>/var/spool/abrt-retrace</code> directory, the server must return +the "507 Insufficient Storage" HTTP error code. The server must return +the same HTTP error code if decompressing the received archive would +cause the free disk space to become less than 20 GB. The 20 GB limit +must be changeable from the server configuration file.</p> + +<p>If the data from the received archive would take more than 500 +MB of disk space when uncompressed, the server must return the "413 +Request Entity Too Large" HTTP error code, and provide an explanation, +including the limit, in the response body. The size limit must be +changeable from the server configuration file. It can be set pretty +high because coredumps, that take most disk space, are stored on the +server only temporarily until the backtrace is generated. When the +backtrace is generated the coredump is deleted by the +<code>abrt-retrace-worker</code>, so most disk space is released.</p> + +<p>The uncompressed data size for xz archives can be obtained by +calling <code>`xz --list file.tar.xz`</code>. The <code>--list</code> +option has been implemented only recently, so it might be necessary to +implement a method to get the uncompressed data size by extracting the +archive to the stdout, and counting the extracted bytes, and call this +method if the <code>--list</code> doesn't work on the +server. Likewise, the uncompressed data size for gzip archives can be +obtained by calling <code>`gzip --list file.tar.gz`</code></p> + +<p>If an upload from a client succeeds, the server creates a new +directory <code>/var/spool/abrt-retrace/<id></code> and extracts +the received archive into it. Then it checks that the directory +contains all the required files, checks their sizes, and then sends a +HTTP response. After that it spawns a subprocess with +<code>abrt-retrace-worker</code> on that directory.</p> + +<p>To support multiple architectures, the retrace server needs a GDB +package compiled separately for every supported target architecture +(see the avr-gdb package in Fedora for an example). This is +technically and economically better solution than using a standalone +machine for every supported architecture and resending coredumps +depending on client's architecture. However, GDB's support for using a +target architecture different from the host architecture seems to be +fragile. If it doesn't work, the QEMU user mode emulation should be +tried as an alternative approach.</p> + +<p>The following files from the local crash directory are required to +be present in the archive: <code>coredump</code>, +<code>architecture</code>, <code>release</code>, <code>packages</code> +(this one does not exist yet). If one or more files are not present in +the archive, or some other file is present in the archive, the server +must return the "403 Forbidden" HTTP error code. If the size of any +file except the coredump exceeds 100 KB, the server must return the +"413 Request Entity Too Large" HTTP error code, and provide an +explanation, including the limit, in the response body. The 100 KB +limit must be changeable from the server configuration file.</p> + +<p>If the file check succeeds, the server HTTP response must have the +"201 Created" HTTP code. The response must include the following HTTP +header fields:</p> +<ul> +<li><code>X-Task-Id</code> containing a new server-unique numerical +task id</li> +<li><code>X-Task-Password</code> containing a newly generated +password, required to access the result</li> +<li><code>X-Task-Est-Time</code> containing a number of seconds the +server estimates it will take to generate the backtrace</li> +</ul> + +<p>The <code>X-Task-Password</code> is a random alphanumeric +(<code>[a-zA-Z0-9]</code>) sequence 22 characters long. 22 +alphanumeric characters corresponds to 128 bit password, +because <code>[a-zA-Z0-9]</code> = 62 characters, and 2<sup>128</sup> +< 62<sup>22</sup>. The source of randomness must be, directly or +indirectly, <code>/dev/urandom</code>. The +<code>rand()</code> function from glibc and similar functions from +other libraries cannot be used because of their poor characteristics +(in several aspects). The password must be stored to the +<code>/var/spool/abrt-retrace/<id>/password</code> file, so +passwords sent by a client in subsequent requests can be verified.</p> + +<p>The task id is intentionally not used as a password, because it is +desirable to keep the id readable and memorable for +humans. Password-like ids would be a loss when an user authentication +mechanism is added, and server-generated password will no longer be +necessary.</p> + +<p>The algorithm for the <code>X-Task-Est-Time</code> time estimation +should take the previous analyses of coredumps with the same +corresponding package name into account. The server should store +simple history in a SQLite database to know how long it takes to +generate a backtrace for certain package. It could be as simple as +this:</p> +<ul> + <li>initialization step one: <code>CREATE TABLE package_time (id + INTEGER PRIMARY KEY AUTOINCREMENT, package, release, time)</code>; + we need the <code>id</code> for the database cleanup - to know the + insertion order of rows, so the <code>AUTOINCREMENT</code> is + important here; the <code>package</code> is the package name without + the version and release numbers, the <code>release</code> column + stores the operating system, and the <code>time</code> is the number + of seconds it took to generate the backtrace</li> + <li>initialization step two: <code>CREATE INDEX package_release ON + package_time (package, release)</code>; we compute the time only for + single package on single supported OS release per query, so it makes + sense to create an index to speed it up</li> + <li>when a task is finished: <code>INSERT INTO package_time + (package, release, time) VALUES ('??', '??', '??')</code></li> + <li>to get the average time: <code>SELECT AVG(time) FROM + package_time WHERE package == '??' AND release == '??'</code>; the + arithmetic mean seems to be sufficient here</li> +</ul> + +<p>So the server knows that crashes from an OpenOffice.org package +take 5 minutes to process in average, and it can return the value 300 +(seconds) in the field. The client does not waste time asking about +that task every 20 seconds, but the first status request comes after +300 seconds. And even when the package changes (rebases etc.), the +database provides good estimations after some time anyway +(<a href="#task_cleanup">Task cleanup</a> chapter describes how the +data are pruned).</p> + +<p>The server response HTTP <i>body</i> is generated and sent +gradually as the task is performed. Client chooses either to receive +the body, or terminate after getting all headers and ask the server +for status and backtrace asynchronously.</p> + +<p>The server re-sends the output of abrt-retrace-worker (its stdout +and stderr) to the response the body. In addition, a line with the +task status is added in the form <code>X-Task-Status: PENDING</code> +to the body every 5 seconds. When the worker process ends, +either <code>FINISHED_SUCCESS</code> or <code>FINISHED_FAILURE</code> +status line is sent. If it's <code>FINISHED_SUCCESS</code>, the +backtrace is attached after this line. Then the response body is +closed.</p> + +<h3><a name="task_status">Task status</a></h3> + +<p>A client might request a task status by sending a HTTP GET request +to the <code>https://someserver/<id></code> URL, where +<code><id></code> is the numerical task id returned in the +<code>X-Task-Id</code> field by +<code>https://someserver/create</code>. If the +<code><id></code> is not in the valid format, or the task +<code><id></code> does not exist, the server must return the +"404 Not Found" HTTP error code.</p> + +<p>The client request must contain the "X-Task-Password" field, and +its content must match the password stored in the +<code>/var/spool/abrt-retrace/<id>/password</code> file. If the +password is not valid, the server must return the "403 Forbidden" HTTP +error code.</p> + +<p>If the checks pass, the server returns the "200 OK" HTTP code, and +includes a field "X-Task-Status" containing one of the following +values: <code>FINISHED_SUCCESS</code>, +<code>FINISHED_FAILURE</code>, <code>PENDING</code>.</p> + +<p>The field contains <code>FINISHED_SUCCESS</code> if the file +<code>/var/spool/abrt-retrace/<id>/backtrace</code> exists. The +client might get the backtrace on the +<code>https://someserver/<id>/backtrace</code> URL. The log +might be obtained on the +<code>https://someserver/<id>/log</code> URL, and it might +contain warnings about some missing debuginfos etc.</p> + +<p>The field contains <code>FINISHED_FAILURE</code> if the file +<code>/var/spool/abrt-retrace/<id>/backtrace</code> does not +exist, and file +<code>/var/spool/abrt-retrace/<id>/retrace-log</code> +exists. The retrace-log file containing error messages can be +downloaded by the client from the +<code>https://someserver/<id>/log</code> URL.</p> + +<p>The field contains <code>PENDING</code> if neither file exists. The +client should ask again after 10 seconds or later.</p> + +<h3><a name="requesting_a_backtrace">Requesting a backtrace</a></h3> + +<p>A client might request a backtrace by sending a HTTP GET request to +the <code>https://someserver/<id>/backtrace</code> URL, where +<code><id></code> is the numerical task id returned in the +"X-Task-Id" field by <code>https://someserver/create</code>. If the +<code><id></code> is not in the valid format, or the task +<code><id></code> does not exist, the server must return the +"404 Not Found" HTTP error code.</p> + +<p>The client request must contain the "X-Task-Password" field, and +its content must match the password stored in the +<code>/var/spool/abrt-retrace/<id>/password</code> file. If the +password is not valid, the server must return the "403 Forbidden" HTTP +error code.</p> + +<p>If the file /var/spool/abrt-retrace/<id>/backtrace does not +exist, the server must return the "404 Not Found" HTTP error code. +Otherwise it returns the file contents, and the "Content-Type" field +must contain "text/plain".</p> + +<h3><a name="requesting_a_log">Requesting a log</a></h3> + +<p>A client might request a task log by sending a HTTP GET request to +the <code>https://someserver/<id>/log</code> URL, where +<code><id></code> is the numerical task id returned in the +"X-Task-Id" field by <code>https://someserver/create</code>. If the +<code><id></code> is not in the valid format, or the task +<code><id></code> does not exist, the server must return the +"404 Not Found" HTTP error code.</p> + +<p>The client request must contain the "X-Task-Password" field, and +its content must match the password stored in the +<code>/var/spool/abrt-retrace/<id>/password</code> file. If the +password is not valid, the server must return the "403 Forbidden" HTTP +error code.</p> + +<p>If the file +<code>/var/spool/abrt-retrace/<id>/retrace-log</code> does not +exist, the server must return the "404 Not Found" HTTP error code. +Otherwise it returns the file contents, and the "Content-Type" must +contain "text/plain".</p> + +<h3><a name="task_cleanup">Task cleanup</a></h3> + +<p>Tasks that were created more than 5 days ago must be deleted, +because tasks occupy disk space (not so much space, as the coredumps +are deleted after the retrace, and only backtraces and configuration +remain). A shell script <code>abrt-retrace-clean</code> must check the +creation time and delete the directories +in <code>/var/spool/abrt-retrace</code>. It is supposed that the +server administrator sets <code>cron</code> to call the script once a +day. This assumption must be mentioned in +the <code>abrt-retrace-clean</code> manual page.</p> + +<p>The database containing packages and processing times should also +be regularly pruned to remain small and provide data quickly. The +cleanup script should delete some rows for packages with too many +entries:</p> +<ol> +<li>get a list of packages from the database: <code>SELECT DISTINCT +package, release FROM package_time</code></li> +<li>for every package, get the row count: <code>SELECT COUNT(*) FROM +package_time WHERE package == '??' AND release == '??'</code></li> +<li>for every package with the row count larger than 100, some rows +most be removed so that only the newest 100 rows remain in the +database: +<ul> + <li>to get highest row id which should be deleted, + execute <code>SELECT id FROM package_time WHERE package == '??' AND + release == '??' ORDER BY id LIMIT 1 OFFSET ??</code>, where the + <code>OFFSET</code> is the total number of rows for that single + package minus 100</li> + <li>then all the old rows can be deleted by executing <code>DELETE + FROM package_time WHERE package == '??' AND release == '??' AND id + <= ??</code></li> +</ul> +</li> +</ol> + +<h3><a name="limiting_traffic">Limiting traffic</a></h3> + +<p>The maximum number of simultaneously running tasks must be limited +to 20 by the server. The limit must be changeable from the server +configuration file. If a new request comes when the server is fully +occupied, the server must return the "503 Service Unavailable" HTTP +error code.</p> + +<p>The archive extraction, chroot preparation, and gdb analysis is +mostly limited by the hard drive size and speed.</p> + +<h2><a name="retrace_worker">Retrace worker</a></h2> + +<p>The worker (<code>abrt-retrace-worker</code> binary) gets a +<code>/var/spool/abrt-retrace/<id></code> directory as an +input. The worker reads the operating system name and version, the +coredump, and the list of packages needed for retracing (a package +containing the binary which crashed, and packages with the libraries +that are used by the binary).</p> + +<p>The worker prepares a new "chroot" subdirectory with the packages, +their debuginfo, and gdb installed. In other words, a new directory +<code>/var/spool/abrt-retrace/<id>/chroot</code> is created and +the packages are unpacked or installed into this directory, so for +example the gdb ends up as +<code>/var/.../<id>/chroot/usr/bin/gdb</code>.</p> + +<p>After the "chroot" subdirectory is prepared, the worker moves the +coredump there and changes root (using the chroot system function) of +a child script there. The child script runs the gdb on the coredump, +and the gdb sees the corresponding crashy binary, all the debuginfo +and all the proper versions of libraries on right places.</p> + +<p>When the gdb run is finished, the worker copies the resulting +backtrace to the +<code>/var/spool/abrt-retrace/<id>/backtrace</code> file and +stores a log from the whole chroot process to the +<code>retrace-log</code> file in the same directory. Then it removes +the chroot directory.</p> + +<p>The GDB installed into the chroot must:</p> +<ul> +<li>run on the server (same architecture, or we can use +<a href="http://wiki.qemu.org/download/qemu-doc.html#QEMU-User-space-emulator">QEMU +user space emulation</a>)</li> +<li>process the coredump (possibly from another architecture): that +means we need a special GDB for every supported architecture</li> +<li>be able to handle coredumps created in an environment with prelink +enabled +(<a href="http://sourceware.org/ml/gdb/2009-05/msg00175.html">should +not</a> be a problem)</li> +<li>use libc, zlib, readline, ncurses, expat and Python packages, +while the version numbers required by the coredump might be different +from what is required by the GDB</li> +</ul> + +<p>The gdb might fail to run with certain combinations of package +dependencies. Nevertheless, we need to provide the libc/Python/* +package versions which are required by the coredump. If we would not +do that, the backtraces generated from such an environment would be of +lower quality. Consider a coredump which was caused by a crash of +Python application on a client, and which we analyze on the retrace +server with completely different version of Python because the +client's Python version is not compatible with our GDB.</p> + +<p>We can solve the issue by installing the GDB package dependencies +first, move their binaries to some safe place (<code>/lib/gdb</code> +in the chroot), and create the <code>/etc/ld.so.preload</code> file +pointing to that place, or set <code>LD_LIBRARY_PATH</code>. Then we +can unpack libc binaries and other packages and their versions as +required by the coredump to the common paths, and the GDB would run +happily, using the libraries from <code>/lib/gdb</code> and not those +from <code>/lib</code> and <code>/usr/lib</code>. This approach can +use standard GDB builds with various target architectures: gdb, +gdb-i386, gdb-ppc64, gdb-s390 (nonexistent in Fedora/EPEL at the time +of writing this).</p> + +<p>The GDB and its dependencies are stored separately from the packages +used as data for coredump processing. A single combination of GDB and +its dependencies can be used across all supported OS to generate +backtraces.</p> + +<p>The retrace worker must be able to prepare a chroot-ready +environment for certain supported operating system, which is different +from the retrace server's operating system. It needs to fake +the <code>/dev</code> directory and create some basic files in +<code>/etc</code> like passwd and hosts. We can use +the <a href="https://fedorahosted.org/mock/">mock</a> library to do +that, as it does almost what we need (but not exactly as it has a +strong focus on preparing the environment for rpmbuild and running +it), or we can come up with our own solution, while stealing some code +from the mock library. The <code>/usr/bin/mock</code> executable is +entirely unuseful for the retrace server, but the underlying Python +library can be used. So if would like to use mock, an ABRT-specific +interface to the mock library must be written or the retrace worker +must be written in Python and use the mock Python library +directly.</p> + +<p>We should save some time and disk space by extracting only binaries +and dynamic libraries from the packages for the coredump analysis, and +omit all other files. We can save even more time and disk space by +extracting only the libraries and binaries really referenced by the +coredump (eu-unstrip tells us). Packages should not be +<em>installed</em> to the chroot, they should be <em>extracted</em> +only, because we use them as a data source, and we never run them.</p> + +<p>Another idea to be considered is that we can avoid the package +extraction if we can teach GDB to read the dynamic libraries, the +binary, and the debuginfo directly from the RPM packages. We would +provide a backend to GDB which can do that, and provide tiny front-end +program which tells the backend which RPMs it should use and then run +the GDB command loop. The result would be a GDB wrapper/extension we +need to maintain, but it should end up pretty small. We would use +Python to write our extension, as we do not want to (inelegantly) +maintain a patch against GDB core. We need to ask GDB people if the +Python interface is capable of handling this idea, and how much work +it would be to implement it.</p> + +<h2><a name="package_repository">Package repository</a></h2> + +<p>We should support every Fedora release with all packages that ever +made it to the updates and updates-testing repositories. In order to +provide all that packages, a local repository is maintained for every +supported operating system. The debuginfos might be provided by a +debuginfo server in future (it will save the server disk space). We +should support the usage of local debuginfo first, and add the +debuginfofs support later.</p> + +<p>A repository with Fedora packages must be maintained locally on the +server to provide good performance and to provide data from older +packages already removed from the official repositories. We need a +package downloader, which scans Fedora servers for new packages, and +downloads them so they are immediately available.</p> + +<p>Older versions of packages are regularly deleted from the updates +and updates-testing repositories. We must support older versions of +packages, because that is one of two major pain-points that the +retrace server is supposed to solve (the other one is the slowness of +debuginfo download and debuginfo disk space requirements).</p> + +<p>A script abrt-reposync must download packages from Fedora +repositories, but it must not delete older versions of the +packages. The retrace server administrator is supposed to call this +script using cron every ~6 hours. This expectation must be documented +in the abrt-reposync manual page. The script can use use wget, rsync, +or reposync tool to get the packages. The remote yum source +repositories must be configured from a configuration file or files +(/etc/yum.repos.d might be used).</p> + +<p>When the abrt-reposync is used to sync with the Rawhide repository, +unneeded packages (where a newer version exists) must be removed after +residing one week with the newer package in the same repository.</p> + +<p>All the unneeded content from the newly downloaded packages should be +removed to save disk space and speed up chroot creation. We need just +the binaries and dynamic libraries, and that is a tiny part of package +contents.</p> + +<p>The packages should be downloaded to a local repository in +/var/cache/abrt-repo/{fedora12,fedora12-debuginfo,...}.</p> + +<h2><a name="traffic_and_load_estimation">Traffic and load estimation</a></h2> + +<p>2500 bugs are reported from ABRT every month. Approximately 7.3% +from that are Python exceptions, which don't need a retrace +server. That means that 2315 bugs need a retrace server. That is 77 +bugs per day, or 3.3 bugs every hour on average. Occasional spikes +might be much higher (imagine a user that decided to report all his 8 +crashes from last month).</p> + +<p>We should probably not try to predict if the monthly bug count goes up +or down. New, untested versions of software are added to Fedora, but +on the other side most software matures and becomes less crashy. So +let's assume that the bug count stays approximately the same.</p> + +<p>Test crashes (see that we should probably use <code>`xz -2`</code> +to compress coredumps):</p> +<table border="1"> +<tr> + <th colspan="3">application</th> + <th>firefox with 7 tabs with random pages opened</th> + <th>thunderbird with thousands of emails opened</th> + <th>evince with 2 pdfs (1 and 42 pages) opened</th> + <th>OpenOffice.org Impress with 25 pages presentation</th> +</tr> +<tr> + <th colspan="3">coredump size</th> + <td>172 MB</td> + <td>218 MB</td> + <td>73 MB</td> + <td>116 MB</td> +</tr> +<tr> + <th rowspan="17">xz compression</th> +</tr> +<tr> + <th rowspan="4">level 6 (default)</th> +</tr> +<tr> + <th>compression time</th> + <td>32.5 sec</td> + <td>60 sec</td> + <td></td> + <td></td> +</tr> +<tr> + <th>compressed size</th> + <td>5.4 MB</td> + <td>12 MB</td> + <td></td> + <td></td> +</tr> +<tr> + <th>decompression time</th> + <td>2.7 sec</td> + <td>3.6 sec</td> + <td></td> + <td></td> +</tr> +<tr> + <th rowspan="4">level 3</th> +</tr> +<tr> + <th>compression time</th> + <td>23.4 sec</td> + <td>42 sec</td> + <td></td> + <td></td> +</tr> +<tr> + <th>compressed size</th> + <td>5.6 MB</td> + <td>13 MB</td> + <td></td> + <td></td> +</tr> +<tr> + <th>decompression time</th> + <td>1.6 sec</td> + <td>3.0 sec</td> + <td></td> + <td></td> +</tr> +<tr> + <th rowspan="4">level 2</th> +</tr> +<tr> + <th>compression time</th> + <td>6.8 sec</td> + <td>10 sec</td> + <td>2.9 sec</td> + <td>7.1 sec</td> +</tr> +<tr> + <th>compressed size</th> + <td>6.1 MB</td> + <td>14 MB</td> + <td>3.6 MB</td> + <td>12 MB</td> +</tr> +<tr> + <th>decompression time</th> + <td>3.7 sec</td> + <td>3.0 sec</td> + <td>0.7 sec</td> + <td>2.3 sec</td> +</tr> +<tr> + <th rowspan="4">level 1</th> +</tr> +<tr> + <th>compression time</th> + <td>5.1 sec</td> + <td>8.3 sec</td> + <td>2.5 sec</td> + <td></td> +</tr> +<tr> + <th>compressed size</th> + <td>6.4 MB</td> + <td>15 MB</td> + <td>3.9 MB</td> + <td></td> +</tr> +<tr> + <th>decompression time</th> + <td>2.4 sec</td> + <td>3.2 sec</td> + <td>0.7 sec</td> + <td></td> +</tr> +<tr> + <th rowspan="13">gzip compression</th> +</tr> +<tr> + <th rowspan="4">level 9 (highest)</th> +</tr> +<tr> + <th>compression time</th> + <td>7.6 sec</td> + <td>14.9 sec</td> + <td></td> + <td></td> +</tr> +<tr> + <th>compressed size</th> + <td>7.9 MB</td> + <td>18 MB</td> + <td></td> + <td></td> +</tr> +<tr> + <th>decompression time</th> + <td>1.5 sec</td> + <td>2.4 sec</td> + <td></td> + <td></td> +</tr> +<tr> + <th rowspan="4">level 6 (default)</th> +</tr> +<tr> + <th>compression time</th> + <td>2.6 sec</td> + <td>4.4 sec</td> + <td></td> + <td></td> +</tr> +<tr> + <th>compressed size</th> + <td>8 MB</td> + <td>18 MB</td> + <td></td> + <td></td> +</tr> +<tr> + <th>decompression time</th> + <td>2.3 sec</td> + <td>2.2 sec</td> + <td></td> + <td></td> +</tr> +<tr> + <th rowspan="4">level 3</th> +</tr> +<tr> + <th>compression time</th> + <td>1.7 sec</td> + <td>2.7 sec</td> + <td></td> + <td></td> +</tr> +<tr> + <th>compressed size</th> + <td>8.9 MB</td> + <td>20 MB</td> + <td></td> + <td></td> +</tr> +<tr> + <th>decompression time</th> + <td>1.7 sec</td> + <td>3 sec</td> + <td></td> + <td></td> +</tr> +</table> + +<p>So let's imagine there are some users that want to report their +crashes approximately at the same time. Here is what the retrace +server must handle:</p> +<ol> +<li>2 OpenOffice crashes</li> +<li>2 evince crashes</li> +<li>2 thunderbird crashes</li> +<li>2 firefox crashes</li> +</ol> + +<p>We will use the xz archiver with the compression level 2 on the ABRT's +side to compress the coredumps. So the users spend 53.6 seconds in +total packaging the coredumps.</p> + +<p>The packaged coredumps have 71.4 MB, and the retrace server must +receive that data.</p> + +<p>The server unpacks the coredumps (perhaps in the same time), so they +need 1158 MB of disk space on the server. The decompression will take +19.4 seconds.</p> + +<p>Several hundred megabytes will be needed to install all the +required packages and debuginfos for every chroot (8 chroots 1 GB each += 8 GB, but this seems like an extreme, maximal case). Some space will +be saved by using a debuginfofs.</p> + +<p>Note that most applications are not as heavyweight as OpenOffice and +Firefox.</p> + +<h2><a name="security">Security</a></h2> + +<p>The retrace server communicates with two other entities: it accepts +coredumps form users, and it downloads debuginfos and packages from +distribution repositories.</p> + +<p>General security from GDB flaws and malicious data is provided by +chroot. The GDB accesses the debuginfos, packages, and the coredump +from within the chroot, unable to access the retrace server's +environment. We should consider setting a disk quota to every chroot +directory, and limit the GDB access to resources using cgroups.</p> + +<p>SELinux policy should be written for both the retrace server's HTTP +interface, and for the retrace worker.</p> + +<h3><a name="clients">Clients</a></h3> + +<p>The clients, which are using the retrace server and sending coredumps +to it, must fully trust the retrace server administrator. The server +administrator must not try to get sensitive data from client +coredumps. That seems to be a major bottleneck of the retrace server +idea. However, users of an operating system already trust the OS +provider in various important matters. So when the retrace server is +operated by the operating system provider, that might be acceptable by +users.</p> + +<p>We cannot avoid sending clients' coredumps to the retrace server, if +we want to generate quality backtraces containing the values of +variables. Minidumps are not acceptable solution, as they lower the +quality of the resulting backtraces, while not improving user +security.</p> + +<p>Can the retrace server trust clients? We must know what can a +malicious client achieve by crafting a nonstandard coredump, which +will be processed by server's GDB. We should ask GDB experts about +this.</p> + +<p>Another question is whether we can allow users providing some packages +and debuginfo together with a coredump. That might be useful for +users, who run the operating system only with some minor +modifications, and they still want to use the retrace server. So they +send a coredump together with a few nonstandard packages. The retrace +server uses the nonstandard packages together with the OS packages to +generate the backtrace. Is it safe? We must know what can a malicious +client achieve by crafting a special binary and debuginfo, which will +be processed by server's GDB.</p> + +<h3><a name="packages_and_debuginfo">Packages and debuginfo</a></h3> + +<p>We can safely download packages and debuginfo from the distribution, +as the packages are signed by the distribution, and the package origin +can be verified.</p> + +<p>When the debuginfo server is done, the retrace server can safely use +it, as the data will also be signed.</p> + +<h2><a name="future_work">Future work</a></h2> + +<p>1. Coredump stripping. Jan Kratochvil: With my test of OpenOffice.org +presentation kernel core file has 181MB, xz -2 of it has 65MB. +According to `set target debug 1' GDB reads only 131406 bytes of it +(incl. the NOTE segment).</p> + +<p>2. User management for the HTTP interface. We need multiple +authentication sources (x509 for RHEL).</p> + +<p>3. Make <code>architecture</code>, <code>release</code>, +<code>packages</code> files, which must be included in the package +when creating a task, optional. Allow uploading a coredump without +involving tar: just coredump, coredump.gz, or coredump.xz.</p> + +<p>4. Handle non-standard packages (provided by user)</p> +</body> +</html> |