You are here: Foswiki>EinsteinAtHome/BOINC Web>EvaluationOfCreditNew (14 Dec 2016, OliverBock)Edit Attach

Evaluation of CreditNew


Superficially BOINC credit related issues are often deemed of little scientific consequence, by users and project staff alike. However, as a measure of work the interrelationship of task estimates and important backend and client functionality is critical. Correct work estimation is central to scheduling of tasks to be processed at all levels, and therefore forms the backbone of making BOINC a valuable scientific instrument.

Here are some notes describing the key identified deficiencies, noting overall that there is a temporal mismatch between the demands of user experience, expectation hardwired into the BOINC client software, and the needs of BOINC server software.

These are primarily identified as basic control systems related issues to be addressed, with 'Open questions' intended to identify key related aspects (such as reliability and maintenance demands), so that project developer and end-user experience may be improved.


[quick and dirty dump of the CN flaw whitepaper - WS]

Reliance on inappropriate benchmark

BOINC's Whetstone implementation is double precision FPU, and does not include any parallelism support (SIMD vectorisation etc). Whetstone is not designed as a peak device throughput measure, contrary to CreditNew code and documentation. Short term (bandaid) recommendation is to fudge in the average host and application parallelism. This will force the lowest underclaiming hosts (AVX in Seti multibeam V7 case) back into a semi-realistic ballpark (though still low if the suggested figures are used).

At SETI, two distinct credit drops were witnessed below the 'fair cobblestone scale'. The first was at CreditNew introduction during multibeam V6, as the system scaled to stock SSE+ underclaiming applications, the second with simultaneous AVX/setiathome_V7.

These CPU apps by underclaim become the 'best app', so scaling reference usurping legitimate claims.

Possible long term fixes involve uncoupling from CPU Whetstone, and directly using stabilised (controlled) elapsed time averages instead. Possible bonus advantages include vectorisation/parallelism/optimisation awareness that can cope with heterogeneous configurations, as well as potentially reduced server load/complexity.

Also BOINC's Whetstone is vulnerable to collusion based cheating.

Scaling averages undamped

The existing mechanism, as implemented, resembles a 'weighted sigma' type control loop, in turn resembling a PI-controller. Undamped/unstable, it is susceptible to noise induced instabilities, such as from machine usage patterns and system changes. There are temporal overlaps between the control points, scheduling and validation, that are mixed with respect to user perceived ordering (topological mixing leading to 'chaotic looking' behaviour). Medium term recommendation is to replace the sampled weighted sigma configuration with a one time tuned PID control loop. This is simpler, less costly in server side resources (no lookups/tracking of averages etc), and less susceptible to instabilities.

Multi-threaded CPU applications omitted

Using more resources, to achieve the same work in a shorter time, should not divide the credit. It does. A potentially viable recommendation here is to regard any 'claim' below estimate as a function of parallelism of some sort. In certain respects this will make estimate formulation simpler, in that an estimate even marginally representative of the minimum operations required, can be sufficient to identify and quantify parallelism, sufficiently even for asymmetric heterogeneous arrangements (MT apps are in design-prototype for SETI@home, to better handle expected large tasks to come for multibeam and AP).

Over-reliance on accurate rsc_fpops_est

host_scale, derived from CreditNew 's unstable pfc claim samples/averages are, in effect, 'server side per application DCF' (given a new name). Having disabled 'old project DCF' client side introduces time domain discrepancies and usage/change tracking difficulties. When coupled with CreditNew 's scaling issues and instabilities, imprecise task initial estimates will tend to destabilise and be overly sensitive to initial hard estimate tolerances. Short term recommendation is to ensure host_scale is properly damped, as with CreditNew 's mechanisms. Then for longer term, implement per application DCF in the client. This has been done before, sources available on request, and is effective at compensating for usage and system changes dynamically.

Observations from live run on Albert

Albert live run started on the 4th of June 2014.

Day Zero observations

Placeholder for Cross App normalisation issues

- normalisation of pfc_scale makes no sense with respect to time estimates, unless host scales are active and unbound

- Inactive host scale at onramp of new app or host forces use of normalised pfc scale for time estimates, with no other scaling compensation --> ~2-80 or more times underestimates without other factors

- GPU only normalisation to 10% default ( pfc_scale 0.1 ) guarantees underestimate before host scales become active, and high host scales after long convergence

- app version pfc scales of <1 break Flops definition, most efficient app is supposed to have normalised pfc_scale of 1 maintaining that definition, everything else greater.

Assorted detail referring to baseline (unmodified) Boinc CreditNew system, as at 24th June 2014 (see resources)

- The top portion depicts the runtime estimate mechanism 'as-is' (at a very abstract high level), not my solutions ;), That's indeed the first part that needs coarse correction .

- Let's be clear that as of initial application onramp, the existing scaling errors outweigh any slop in the project supplied estimates,

- don't_use_dcf (as in old Project DCF) is already active, has been fort a very long time, with clients >= 7.0.28 (according to code)

- The two cascaded scales (with orange feedback average/gain triangles) are the two points that adjust your base estimate to an application wide one first, and host specific one second.

- The coarse scaling error on each scale for CPU SSE2 apps is ~2.25x, so potential initial time estimate undershoot for such CPU app is in the region of (1/2.25)^2 = ~0.2 , so about five times too short, before any host scale takes effect.

- The coarse scaing error on each scale for new GPU app is from 2x to 20x, so potential initial time estimate undershoot for such GPU app is in the region of (1/20)^2 = ~0.0025 , (~400 times too short, before any host scaling takes effect, can be worse with multiple tasks and other factors, been observed up to 3000x too short during the recent onramp)

- As far as the high level system design is concerned, the 'intent' makes clear engineering sense, but the implementation is naive with respect to industrial control systems engineering practices. That's mostly to do with choices of defaults, bounds/safeties, and stability metrics.

The basic patches, to be detailed & stepwise, involve:

- Coarse scale corrections (feedback to those estimates, and remove improper host scale on credit for non-outliers)

- 'proper' Damping of those (noisy) averages

- Calibration of the GPU default efficiency (scale) to something better than it is ( 0.1, need to be lower to avoid TIME_EXCEEDED)

- further examination of MT and potential future heterogeneous arrangements is warranted, though would need to see what level of stability is acheived with the prior cases

Issues requiring attention

The gross, public-facing, issues which cause dis-satisfaction on the message boards (and for some project administrators) A checklist of the things which should ideally be fixed and working before we go for a wider Beta test on a more public project.

... I hope that this could capture/reduce at least some of the elements I have listed in 'Open Questions', and try to draw links to other dedicated topics where needed. IMO we will need to compartmentalise a fair bit, although for the moment things need to be pretty holistic. JG
  • Initial runtime estimates for new hosts: either too high (causing EDF) or too low (risk of exit 197).

    (continues throughout the life of the project. Primary cause seems to be poor initial speed estimates, especially for coprocessors)

    [Tech Sidenote: ,initial estimate relies on a snapshot of scale, which is 'unstable' in formal senses, thereforre poor initial speed estimate is a symptom too JG]
  • Initial runtime estimates for new app_versions

    (rarer, but more drastic when it goes wrong)
    [Tech Sidenote: As for above with unconvereged anchoring pfc_scale, validation and work estimate overlap in time on reissue etc JG]

  • Overall average credit granted
    [Tech Sidenote: A possible control point to examine. Median may be better JG]
  • Variability of credit from task to task

    (low credit gets noticed and complained about, high credit is silently accepted. The imbalance of reporting leads to an insidious inflationary pressure on project administrators)
    [Tech Sidenote: ROFL. Density of periodic orbits in the scaling parameters around stochastic inputs, topologically mixed in the temporal domain JG]

  • Automatic detection and isolation of runtime outliers, which can skew the averages - especially in the early stages of a new app_ver

    (applies to projects like LHC and SETI. Can be solved with custom validators, but that's undocumented in the BOINC Wiki)
    [Tech Sidenote: stable systems don't have ''runtime outliers', and resist such noisey inputs. They are damped by using feedback mechanisms, so largely eliminate the need for
    detecting all but the most extreme conditions beyond pre-established operational failsafe limits. JG]

Android device does a whetstone, gives [either] a 'normal', 'neon' or 'vfp result into host p_fpops (SIMD vectorisation aware)

Android device [inital] requests tasks with a scheduler request, which includes host p_fpops

Scheduler [,not currently SIMD vectorisation aware] selects tasks and app versions, sets estimate and bound using peak_flops (which was initialised to host.p_fpops [or 1E9 if it was <= 0 )

Android receives tasks & begins processing

Assuming project unscaled estimate was 'reasonable', if app is NOT vfp or neon [i.e. not SIMD vectorised], wrong whetstone variant is applied [on initial request, dividing [effective] bound [duration] by the vectorisation efficiency of the specialised the region of 3x, which would seem to match up with the SIMAP time exceeded scenario, where bound of ~3x estimate converges with expected actual elapsed.

Bandaid suggestions become:

- extend bound to more than 3 times

- vectorise/optimise the application(s)

- don't let people use their phones to make calls.

In addition: subsequent validation of host app version peak_flops is subject to the same averaging quantisation instabilities as above, so convergence behaviour is non-deterministic + stochaistically driven..

  1. Do not develop new features, such as Android's SIMD vectorisation aware whetstone bench, in the master branch. These are dependent on scheduler, client and possibly validator changes to be made simultaneously.
  2. Test these new features prior to integration, on a systemwide level.
  3. Avoid changes that break prior (even faulty) assumptions without impact analysis and some sort of migration plan,

Open questions

  • Relationship of estimates to project & application setup oveheads
  • Relationship of estimates and credit to end-user experience and retention, client and application reliability (e.g. boincapi)
  • Realistic short, medium and long term goals.


Topic attachments
I Attachment Action Size Date Who Comment
CreditNew_baseline_24thJun2014.pngpng CreditNew_baseline_24thJun2014.png manage 68 K 26 Jun 2014 - 08:54 JasonGroothuis Boinc CreditNew high level information flow, baseline (unmodified) reference 24th June 2014
Topic revision: r25 - 14 Dec 2016, OliverBock
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback