MLC@home [TWIM Notes] Nov 17 2020

News from the various BOINC projects, sniffed out for you by our lovable wee mutt.
User avatar
Newshound
BOINC news desk
BOINC news desk
Posts: 2424
Joined: Fri Oct 09, 2015 12:58 pm
Location: At the news desk
Contact:

#1 MLC@home [TWIM Notes] Nov 17 2020

Post by Newshound »

This Week in MLC@Home Notes for Nov 17 2020
A weekly summary of news and notes for MLC@Home

Summary
GPU week(s), part 4: The saga continues.

This coming week we're starting to pivot back to writing and analysis of existing data. There's a lot to discuss related to GPUs still, but we also
need to take time to write a bit; and frankly, the issues with the Linux/CUDA client have us stumped for the moment, so taking a week or two to focus on some of the science that's been piling up should help clear our heads and come back to it with a fresh perspective.
This past week we enabled the release track ("mlds-gpu" application) of the GPU client for Windows and Linux, and the good news is that at least the Windows client is working fairly well*
, and its chewing through WUs wonderfully, complimenting the CPU crunching that continues to unabated. Having a separate app allows us to capture the wildly different RAM requirements for CUDA WUs without penalizing CPU crunchers. The GPU line allows us to send out longer dataset 1+2 WUs, which has led to a nice boost to the number of complete Parity* WUs, meaning we finally have over 1000 examples of each network type for Dataset1, and on our way to that in Dataset 2. It would be nice to wrap up those two sooner rather than later. We'll continue to release WUs in parallel to both the CPU and GPU queues to keep both users fed.
Not all is well in GPU land though. We released the Linux/CUDA client, but after several days, not a single WU completed without error, so we've pulled that back from production and will try again. This is incredibly frustrating, as it works on our test machine, but on volunteer machines it fails with CUDA errors indicating userspacedriver incompatibilities. Clearly we're not bundling it up correctly. In addition, there's been some strange results to CPU utilization and the Windows CUDA client. Users have reported better performance and utilization if they assign two CPUs to the WU instead of one, even though one core remains idle the entire time. There's some speculation in the linked thread, but we should track that down soon as well.

All that's to say we're really excited that GPU support is at least partially live and giving us a nice performance boost, but it's also been more of a drain on resources than anticipated, and we need to turn focus back a bit before tackling Linux/CUDA again. If any experienced Linux/CUDA devs would like to offer help deploying our pytorch/cuda app combination, we'd love for you to contact us and help us troubleshoot.

More specific news below, some of it is even non-GPU related!

News:
  • You can follow GPU client progress on several forum threads like this one: https://www.mlcathome.org/mlcathome/forum_thread.php?id=111
    We fired up our ARM-based test systems that had fallen off the network to make sure the current ARM app continues to run. We're able to verify that all three of our arm32/arm64 test systems running Debian 10 are crunching fine with the latest client, this includes a RPi3 (32-bit), RPi4 (64-bit), and a CuBox-i4 (32-bit).
    The Dataset 1+2 WUs we release in the GPU queue have a larger epoch limit than those in the CPU queue, and have a proportional increase in credit awarded. We may make a similar change in the CPU queue, but it would mean much longer runtimes, so for now we're seeing how it goes in the GPU queue and will make a determination in the future.
    We spent some time this week researching how to drop the AppImage (FUSE) requirement on Linux. Its definitely possible, but we're loathe to roll out that change, even to the test queue as, overall, appimage hasn't caused too many issues and don't want to do anything unnecessary at the moment. We thought it might help with the Linux/CUDA issues, but no longer things that's true.
    Datasets 1,2 and 3 continue crunching away. GREAT progress so far!
    We know some of the web pages are out of date, and we hope to address that this week. Updates queued include: a complete update/redo of the MLDS Dataset page, and an update to the "system requirements" section of the main page to better list minimum software requirements.
    If we divide each of the three datasets into 3 releases based on the number of examples in each release (100, 1000, 10000), then we're ready to package up Dataset 1 (100, 1000), Dataset 2 (100), and Dataset 3 (100).
    If you aren't aware of the BOINC Network Podcast, the MLC@Home devs lurk there and sometimes contribute Be sure to check it out if you're interested: https://www.boinc.network/.
    We hope to get back to preparing Dataset 4, and writing a tech report/paper to go along with the Dataset releases this week.


Project status snapshot:
(note these numbers are approximations)

Tasks
Tasks ready to send 48470 Tasks in progress 24464 Users With credit 1190 Registered in past 24 hours 47 Hosts With recent credit 2129 Registered in past 24 hours 25 Current GigaFLOPS 33798.72

Dataset 1 and 2 progress:
SingleDirectMachine 10002/10004 EightBitMachine 10001/10006 SingleInvertMachine 10001/10003 SimpleXORMachine 10000/10002 ParityMachine 1005/10005 ParityModified 336/10005 EightBitModified 6803/10006 SimpleXORModified 10005/10005 SingleDirectModified 10004/10004 SingleInvertModified 10002/10002
Dataset 3 progress:
Overall (so far): 64502/84376 Milestone 1, 100x100: 10000/10000 Milestone 2, 100x1000: 64502/100000 Milestone 3: 100x10000: 64502/1000000

Last week's TWIM Notes: Nov 9 2020

Thanks again to all our volunteers!

-- The MLC@Home Admins[/s]
Homepage: https://www.mlcathome.org/
Twitter: @MLCHome2

Source: https://www.mlcathome.org/mlcathome/for ... php?id=124
TSBT's update on all the news from the BOINC projects.

Return to “Newshound RSS feeds”