Infinite DistRTgen WU
- Janos (retired)
- Still a Newbie
- Posts: 1919
- Joined: Thu Feb 23, 2012 8:58 am
- Location: Aberdeenshire, Scotland
#1 Infinite DistRTgen WU
Anyone else having problems with infinitely running DistRTgen WU's?
On ALL of my machines crunching DistRTgen I get a WU once or twice a DAY which just runs and runs. 100% usage for the duration and I have caught tasks running for over 8 hours (I should never sleep!).
I've tried all the normal stuff like a big hammer and a strong telling off but no resolution as yet. Project reset, reinstall of BOINC, under clocking, default clocking. I've even tried a complete OS reinstall.
All the machines are working fine otherwise and it is just DistRTgen WU's which give me bother. There is nothing on the DistRTgen forums which tends to make me think it is something I am doing wrong.
Even a healthy dose of single malt has not solved things. Any ideas? Think I should try some more 18 year old Macallan?
On ALL of my machines crunching DistRTgen I get a WU once or twice a DAY which just runs and runs. 100% usage for the duration and I have caught tasks running for over 8 hours (I should never sleep!).
I've tried all the normal stuff like a big hammer and a strong telling off but no resolution as yet. Project reset, reinstall of BOINC, under clocking, default clocking. I've even tried a complete OS reinstall.
All the machines are working fine otherwise and it is just DistRTgen WU's which give me bother. There is nothing on the DistRTgen forums which tends to make me think it is something I am doing wrong.
Even a healthy dose of single malt has not solved things. Any ideas? Think I should try some more 18 year old Macallan?
#2 Re: Infinite DistRTgen WU
It happens from time to time. Just check it every now and then and delete work until if it is over running excessively. Having said that it has not happened to me for some time but I have done nothing in particular to try to resolve it.Janos wrote:Anyone else having problems with infinitely running DistRTgen WU's?
On ALL of my machines crunching DistRTgen I get a WU once or twice a DAY which just runs and runs. 100% usage for the duration and I have caught tasks running for over 8 hours (I should never sleep!).
I've tried all the normal stuff like a big hammer and a strong telling off but no resolution as yet. Project reset, reinstall of BOINC, under clocking, default clocking. I've even tried a complete OS reinstall.
All the machines are working fine otherwise and it is just DistRTgen WU's which give me bother. There is nothing on the DistRTgen forums which tends to make me think it is something I am doing wrong.
Even a healthy dose of single malt has not solved things. Any ideas? Think I should try some more 18 year old Macallan?
Just cos I have a lot of credits does not mean I know what I am doing.
- Janos (retired)
- Still a Newbie
- Posts: 1919
- Joined: Thu Feb 23, 2012 8:58 am
- Location: Aberdeenshire, Scotland
#3
I am going to have to find a fix as it is happening way to often. Like you Gary I have had the odd one in the past but the last few days have been crazy. Each machine is currently getting two or three a day. I just killed one WU which had been going for 2h 21m.
#4
My ATI7970 machine is running Bionic 7.0.27 (x64)Janos wrote:I am going to have to find a fix as it is happening way to often. Like you Gary I have had the odd one in the past but the last few days have been crazy. Each machine is currently getting two or three a day. I just killed one WU which had been going for 2h 21m.
I am running Windows 7 latest version - automatic updates on
Driver information - what ever this means...
Driver Packaging Version 8.961-120405a-137813C-ATI
Provider Advanced Micro Devices, Inc.
2D Driver Version 8.01.01.1243
2D Driver File Path /REGISTRY/MACHINE/SYSTEM/ControlSet001/Control/CLASS/{4D36E968-E325-11CE-BFC1-08002BE10318}/0002
Direct3D Version 7.14.10.0903
OpenGL Version 6.14.10.11631
AMD VISION Engine Control Center Version 2012.0405.2205.37728
AMD Audio Driver Version 7.12.0.7706
Best of luck, as I said I usually just wait for these problems to resolve themselves
#5
I found this http://www.freerainbowtables.com/phpBB3 ... fd2e117741
don't know if it helps but on the last post.
Mikey: I'm running stock settings. But 2 of the 4 cards involved came from the factory overclocked. I've got some additional info.
1. The dual GPU machine does get hanging WUs on both Device 0 and 1, it's just that Device 0 hangs are more common.
2. It appears that there is a driver reset each time the hang up appears. (This is probably the causative factor)
3. All cards involved come from Zotac.
Based on the above information, I don't know if the manufacturer is to blame or not. My other GPU machines (running older Pentium machine (DELL) - GTX560 or laptop - GTX660M) don't seem to ever have a problem. I do think that the problem most likely occurs when there is a switch between another project and DistrRTgen (Collaz or PrimeGrid). What I do know is that all GTX560 / Ti are running the same driver version and Windows 7.
don't know if it helps but on the last post.
Mikey: I'm running stock settings. But 2 of the 4 cards involved came from the factory overclocked. I've got some additional info.
1. The dual GPU machine does get hanging WUs on both Device 0 and 1, it's just that Device 0 hangs are more common.
2. It appears that there is a driver reset each time the hang up appears. (This is probably the causative factor)
3. All cards involved come from Zotac.
Based on the above information, I don't know if the manufacturer is to blame or not. My other GPU machines (running older Pentium machine (DELL) - GTX560 or laptop - GTX660M) don't seem to ever have a problem. I do think that the problem most likely occurs when there is a switch between another project and DistrRTgen (Collaz or PrimeGrid). What I do know is that all GTX560 / Ti are running the same driver version and Windows 7.
#6
see here
http://www.setiusa.us/showthread.php?41 ... Fail/page2
scroll down to post #15
here http://www.xtremesystems.org/forums/sho ... og+reg+fix
scroll down to post #13
and here http://msdn.microsoft.com/en-us/library ... 85%29.aspx
Has this just started since you got the new 79xx's
http://www.setiusa.us/showthread.php?41 ... Fail/page2
scroll down to post #15
here http://www.xtremesystems.org/forums/sho ... og+reg+fix
scroll down to post #13
and here http://msdn.microsoft.com/en-us/library ... 85%29.aspx
Has this just started since you got the new 79xx's
- Janos (retired)
- Still a Newbie
- Posts: 1919
- Joined: Thu Feb 23, 2012 8:58 am
- Location: Aberdeenshire, Scotland
#7
Ah nice work! It certainly seems logical that that a GPU thread could cause a timeout.
I will give the registry settings a whirl and report back.
Cheers
I will give the registry settings a whirl and report back.
Cheers
- Janos (retired)
- Still a Newbie
- Posts: 1919
- Joined: Thu Feb 23, 2012 8:58 am
- Location: Aberdeenshire, Scotland
#8
Just one hour after installing the new registry settings (with reboot) and I have an infinite WU.
- Janos (retired)
- Still a Newbie
- Posts: 1919
- Joined: Thu Feb 23, 2012 8:58 am
- Location: Aberdeenshire, Scotland
#9
And another. This is nuts!
- Janos (retired)
- Still a Newbie
- Posts: 1919
- Joined: Thu Feb 23, 2012 8:58 am
- Location: Aberdeenshire, Scotland
#10
Happened now 4 times but all on the same machine. It looks like the other two are fixed (famous last words).
I am going to reinstall drivers, windows updates, the registry heck, etc on the "failing machine" and see if that resolves things.
I am going to reinstall drivers, windows updates, the registry heck, etc on the "failing machine" and see if that resolves things.
#11
Are you running multiple units on the card or single incident ? Might it be that that machine is being memory bound ? If you're not already doing so try running 1 unit per GPU with a whole cpu core spare for each card.
I have a similar problem with poem and my 3 GPU's. units run fine on 1GPU but simply stall out and keep resetting on the other two. I have to exclude Poem on these two to run it on 1 GPU.
I have a similar problem with poem and my 3 GPU's. units run fine on 1GPU but simply stall out and keep resetting on the other two. I have to exclude Poem on these two to run it on 1 GPU.
- Janos (retired)
- Still a Newbie
- Posts: 1919
- Joined: Thu Feb 23, 2012 8:58 am
- Location: Aberdeenshire, Scotland
#13
I am running a default config with a single WU running on a single 7970. There are no other tasks running (during this testing phase).
The PC is using less than 25% memory.
Touch wood, the other two machines are still working well.
The PC is using less than 25% memory.
Touch wood, the other two machines are still working well.
- Janos (retired)
- Still a Newbie
- Posts: 1919
- Joined: Thu Feb 23, 2012 8:58 am
- Location: Aberdeenshire, Scotland
#14
Windows 7 - 7.0.28 64bitalezevo1 wrote:What version of Boinc are you running ?
#15
I'm running 7.0.47. So far it's running very well. You could try upgrading. If it doesn't work you can always just reinstall 7.0.28.
I was running 7.0.45 and that eventually caused the whole of Boinc to go into a reset loop both on the CPU's and the GPU's when it tried to start a second Poem on the same GPU.
I was running 7.0.45 and that eventually caused the whole of Boinc to go into a reset loop both on the CPU's and the GPU's when it tried to start a second Poem on the same GPU.
- Janos (retired)
- Still a Newbie
- Posts: 1919
- Joined: Thu Feb 23, 2012 8:58 am
- Location: Aberdeenshire, Scotland
#16
Yeah, I might try that tomorrow. I was also thinking about swapping the 7970 in the machine which keeps failing with a card on another machine which seems to now be working - to test out any hardware issues.
#17
Two other thoughts...
Check the power management options haven't reverted to sleeping the monitor or turning off the GPU or you're new KVA switch could it be causing the card to sense no monitor load and sleep the GPU ?
and if I remember right do the ATI cards not have a turbo mode or similar ? I use afterburner to check the core clocks etc on my cards as my 610 has a tendency ( why I don't know ) to overclock itself fro 810 mhz. At 870 mhz it runs seti fine but at 910 the seti units stall on the GPU and sit with a schedular wait message. The error messages are better and the config setup for GPU's are far better on 7.0.47. Does you're cards not have the ability to increase the core clock speed as the demand on the GPU goes up ?
Check the power management options haven't reverted to sleeping the monitor or turning off the GPU or you're new KVA switch could it be causing the card to sense no monitor load and sleep the GPU ?
and if I remember right do the ATI cards not have a turbo mode or similar ? I use afterburner to check the core clocks etc on my cards as my 610 has a tendency ( why I don't know ) to overclock itself fro 810 mhz. At 870 mhz it runs seti fine but at 910 the seti units stall on the GPU and sit with a schedular wait message. The error messages are better and the config setup for GPU's are far better on 7.0.47. Does you're cards not have the ability to increase the core clock speed as the demand on the GPU goes up ?
- Janos (retired)
- Still a Newbie
- Posts: 1919
- Joined: Thu Feb 23, 2012 8:58 am
- Location: Aberdeenshire, Scotland
#18
Power checked
Sleep checked
Same clock settings as the other two cards
Same driver versions
Same windows setup (well almost this one is Ultimate and the other two are Pro)
I am going to test out the hardware tomorrow. Maybe flash the motherboard bios...
Sleep checked
Same clock settings as the other two cards
Same driver versions
Same windows setup (well almost this one is Ultimate and the other two are Pro)
I am going to test out the hardware tomorrow. Maybe flash the motherboard bios...
#19
Knew I'd read something about this when looking at ATI cards ( all mine are nVidea )
http://forums.anandtech.com/archive/ind ... 44769.html
Apparently the zerocore tech can sometimes idle your cards by itself. Think you need to set the cards into high performance mode or something. Not sure if this is of any help or not.
http://forums.anandtech.com/archive/ind ... 44769.html
Apparently the zerocore tech can sometimes idle your cards by itself. Think you need to set the cards into high performance mode or something. Not sure if this is of any help or not.
- Janos (retired)
- Still a Newbie
- Posts: 1919
- Joined: Thu Feb 23, 2012 8:58 am
- Location: Aberdeenshire, Scotland
#20
I *think* I have fixed it.
After much use of big hammer and single malt, I swapped the power feed to the card and it now seems to be working perfectly.
Credits be incoming :)
Thanks guys for the help with the debug process.
The registry settings were a superb tip and definitely fixed the mission of Microsoft to protect the user from his own stupidity, at all cost, because Windows knows better.
After much use of big hammer and single malt, I swapped the power feed to the card and it now seems to be working perfectly.
Credits be incoming :)
Thanks guys for the help with the debug process.
The registry settings were a superb tip and definitely fixed the mission of Microsoft to protect the user from his own stupidity, at all cost, because Windows knows better.
- Janos (retired)
- Still a Newbie
- Posts: 1919
- Joined: Thu Feb 23, 2012 8:58 am
- Location: Aberdeenshire, Scotland
#22
Hmm, not quite. I just had to kill another task on the same machine.Janos wrote:I *think* I have fixed it.
- Janos (retired)
- Still a Newbie
- Posts: 1919
- Joined: Thu Feb 23, 2012 8:58 am
- Location: Aberdeenshire, Scotland
#23
Different machine this time. The rate of infinite WU's has dropped dramatically but still happening all too often.
I will try plan D and see what happens next...
I will try plan D and see what happens next...
- Janos (retired)
- Still a Newbie
- Posts: 1919
- Joined: Thu Feb 23, 2012 8:58 am
- Location: Aberdeenshire, Scotland
#24
Out of my three crunchers, two had infinite units this morning: one of 6:03:01 and the other 5:47:41 of utter wasted electric. Not happy.
- Janos (retired)
- Still a Newbie
- Posts: 1919
- Joined: Thu Feb 23, 2012 8:58 am
- Location: Aberdeenshire, Scotland
#25
Came home to find more locked up WU's. That is more than 27 hours of GPU time lost today. Going to faff with more settings this evening but if it continues I will use a way bigger hammer.
- Janos (retired)
- Still a Newbie
- Posts: 1919
- Joined: Thu Feb 23, 2012 8:58 am
- Location: Aberdeenshire, Scotland
#26
I sat down tonight and thought I can't find the cause, how do I cure the symptom?
So, I have written some code to check the time the WU has been running and if it is over 20% of the average of the last 5 WU completion times then it auto suspends the active WU. I can then manually see, at my leisure, if the suspended WU should be restarted or aborted.
Worst case is half a dozen suspended work units per day rather than 2 or 3 crunchers locked up achieving not very much for hours at a time.
So, I have written some code to check the time the WU has been running and if it is over 20% of the average of the last 5 WU completion times then it auto suspends the active WU. I can then manually see, at my leisure, if the suspended WU should be restarted or aborted.
Worst case is half a dozen suspended work units per day rather than 2 or 3 crunchers locked up achieving not very much for hours at a time.