Viewing Issue Advanced Details
ID Category [?] Severity [?] Reproducibility Date Submitted Last Update
04709 Core Minor Random Feb 27, 2012, 22:24 Mar 24, 2015, 10:36
Tester Firewave View Status Public Platform MAME (Self-compiled)
Assigned To Resolution Open OS Windows Vista/7 (32-bit)
Status [?] Acknowledged Driver
Version 0.145u3 Fixed in Version Build
Fixed in Git Commit Github Pull Request #
Summary 04709: chdman: random deadlocks
Description Since the addition of the lzma and FLAC compression there have be several reports of chdman hanging.

Now I encountered it myself and it was hang right at the end of the operation using "createcd". I was using a Visual Studio 2010 64-bit debug compile. When I attached it to the debugger it said, that the application had encountered a deadlock. chdman hung in the progress() function in the fflush(stderr) call.

So far I have not been able to reproduce this. But this matches a report by another user with a GCC compile, that had a hang also very much at the end of a conversion of arctthnd.

On a side note - all other threads were in WaitForSingleObject(thread->wakeevent).
Steps To Reproduce
Additional Information
Github Commit
Flags
Regression Version
Affected Sets / Systems
Attached Files
jpg file icon lockedup.jpg (167,998 bytes) Feb 29, 2012, 14:39 Uploaded by Fujix
Fujix
Relationships
related to 05891Acknowledged  chdman: ThreadSanitizer: data race 
Notes
45
User avatar
No.08287
Haze
Senior Tester
Feb 27, 2012, 23:18
edited on: Feb 27, 2012, 23:18
I have a feeling it's more related to the CPP rewrite, possible a thread safety issue somewhere?

I processed, converted, and verified over 900gb of CHD images with the FLAC code in old CHDMAN with no issues, but I've seen fairly frequent lockups in the new one, interestingly only ever when processing HDD images tho (and sometimes with compression *off*)

Making the host IO 'busy' seems to exaggerate the problem, for example if I'm doing other things with the PC at the same time it seems to be more frequent.
User avatar
No.08288
Fujix
Administrator
Feb 28, 2012, 05:10
I have encountered this problem many many times.. I gave up converting CHDS for now.
And as I reported previously (and deleted), when it started converting a CHD, chdman ignores "-np 1" and eats up all CPUs. My Core 2 Duo shows 100% usage even if I restrict the number of processors.
User avatar
No.08289
Firewave
Senior Tester
Feb 28, 2012, 14:24
My hang had nothing to with threads. They were also finished and waiting for the main loop to wake them. The loop was stuck in the fflush(stderr) call in the progress() fuction.

When I did the big conversion test I had not a single lockup whatsoever. The test ran over two days.

Fujix, could you use a chdman build with symbols and attach the gdb to it when it hangs and get a backtrace of all threads? I have never had this issue with a GCC compile and with a Visual Studio it only happened once.
User avatar
No.08293
Fujix
Administrator
Feb 29, 2012, 02:37
chdman on gdb is very modest for me, converted 3 chds and no lock up so far.

And I found it makes three threads if I don't restrict the number of process, looks working properly and CPU usage is acceptable on gdb.
User avatar
No.08297
Fujix
Administrator
Feb 29, 2012, 14:40
It locked up when I was copying the carevil chd, uploaded how cmd.exe looks like.
In fact chdman is still working and uses CPU.
User avatar
No.08298
Haze
Senior Tester
Feb 29, 2012, 16:30
yeah, it's usually hung at one core of activity (50% here)
User avatar
No.08393
Darkfalz
Tester
Mar 19, 2012, 15:04
Mine crashed (hung) once, after I paused and restarted.
User avatar
No.08522
Iaspis
Tester
May 8, 2012, 08:09
is this fixed, after the u8 update?
User avatar
No.08523
Tafoid
Administrator
May 8, 2012, 09:52
I haven't heard that it has been, so I assume it hasn't been. Someone with 64-bit CHDMAN need to do some extensive converting to find out, I guess.
User avatar
No.08524
Haze
Senior Tester
May 8, 2012, 15:44
Nothing has been checked in which is liable to fix it, and I was still seeing it very recently, as have others trying to use the frontend posted on MW.

I need to give it another run to make sure the compile too switch hasn't changed the behavior, but I'd be very suspicious that the bug wasn't just being hidden if it has.
User avatar
No.08876
Iaspis
Tester
Aug 21, 2012, 07:06
Sorry for asking again, I see there've been some fixes since 0.146u1, especialy the fix about a memory leak in libflac/libflac/md5.c . Is this bug still active?
User avatar
No.08877
Haze
Senior Tester
Aug 21, 2012, 19:30
nothing specific has been done to fix it.. why are you asking? are you still encountering it, or do you think it's fixed?
User avatar
No.08880
Iaspis
Tester
Aug 26, 2012, 21:10
No, I haven't encountered it, however it's random so I couldn't be sure. I thought that those changes might have affected it eventually but I was wrong, apparently.
User avatar
No.08881
Haze
Senior Tester
Aug 26, 2012, 21:41
edited on: Aug 26, 2012, 21:42
the os etc. seems to play a role..

my 32-bit Windows XP Athlon 3000+ (single core) box converts fine 100% of the time.

All 3 Win 7 64-bit C2D Intel machines I've tested it on get stuck say 80% of the time on the largest Beatmania CHD.

I can test a current version but I haven't seen a change go in specifically to fix it.

I did try creating a 100% reproduction test case, but a blank 40gb file seems to convert fine every time, even if it's the blank areas it seems to get stuck on in normal cases, it's an annoying bug.
User avatar
No.08979
Ki3r
Tester
Sep 27, 2012, 19:06
I was struggling with this bug for days, with chdman from SVN rev 17996 and 18126. One in every 2 or 3 CHDs was causing chdman to fall asleep while converting from v4 to v5.

Then I tried turning off Hyper-threading in BIOS. The bug is gone! I could convert every CHD that I couldn't before.

The computer is a old P4 @3Ghz running Windows XP SP3. I hope this info give some clues to the developers to fix this PITA bug.
User avatar
No.08982
NekoEd
Senior Tester
Sep 27, 2012, 22:45
If it works just fine after disabling hyperthreading, then it would appear that this bug is only present on multiprocessor systems (hyperthreading gives you what is essentially two CPUs, after all.)
User avatar
No.08994
Ki3r
Tester
Oct 1, 2012, 09:45
From what I've read most people managed to convert the CHDs without problems. Assuming that most of today's CPUs are multi-core and just a few have HT I suspect that the problem lies only in HT.

Googling for "hyper threading deadlock" also returns many results.
User avatar
No.08995
NekoEd
Senior Tester
Oct 1, 2012, 16:22
Haze reported hangups on three machines with Core 2 Duo processors. These machines do not have hyperthreading, they're pure multicore. So the issue is with any machine on which there are more than one core, whether the extra cores are logical (hyperthreading, mostly later Pentium 4s), physical (Core 2 Duo/Quad) or a combination of both (multiple physical cores each with hyperthreading, most mobile Core ix processors).

Does chdman decide to multithread based on the number of processors? If there's a way to force it to use multithreading under a single core processor without multithreading, it should be checked whether that causes it to deadlock or not. Either way, it looks like that someone is going to have to trace each thread of operation and watch for it to stall then find out why.
User avatar
No.08998
Ki3r
Tester
Oct 3, 2012, 09:57
You're right., just checked C2D specs. Don't have a clue then.

I noticed the problem occurred in highly compressible CHDs. I remember some of those with 20GB of logical size which end with just a few MB after compression but can't pinpoint a name.
User avatar
No.09000
Haze
Senior Tester
Oct 4, 2012, 18:36
yes, that's pretty much what we've noticed.

Only seems to happen with HDDs, almost always on areas which are mostly blank space
Doesn't happen on a specifically made test case HDD which is always blank space tho.

Definitely threading related, can't recreate the lockup on a single core machine, at all. The only traces we have point to some kind of flush IO deadlock.
User avatar
No.09090
jumper
Tester
Nov 10, 2012, 01:53
Joined to chime in...hope this helps at least a little...

I've compiled 147u2 chdman on both 32bit Windows XP(dual core) (using the tools from mamedev), and Ubuntu 12.04 LTS (64bit)(quad core).
Both exhibit the same deadlocks with the same HDDs. My debugging skills are lacking, so, I cannot point to anything specific.

Eventually, after endless retries, the HDDs seem to convert, but, there seems to be no rhyme or reason as why chdman deadlocks...it seems to always be random, but, problem files are problem files.(Files that exhibit deadlocks repeatedly give deadlocks until EVENTUALLY converting)

On Windows, using a third party tool to limit the processor affinity to one processor seems to help a little, but, deadlocks still occur.
I'm unsure yet if anything helps on Ubuntu. Since this is really meant for Windows debugging, I'll shy away from any more on Ubuntu, but, I believe I read somewhere that there was a thought that a Windows compiling tool may have introduced the problem. Since the problem exists on Ubuntu also, the problem is quite possibly in the source code.
User avatar
No.09091
NekoEd
Senior Tester
Nov 12, 2012, 13:55
I think the best course of action at this point is to pepper the code with debug statements anywhere a deadlock might occur, report every lock obtained, print everything that's going on, etc. Just make the whole thing very verbose about the underlying details then let it run; eventually we may find where it's tripping over itself and hanging up....
User avatar
No.09092
M.A.S.H.
Senior Tester
Nov 12, 2012, 18:48
I used Portable-VirtualBox and XP SP2 and have no problems with chdman hangings or other things.

Portable-VirtualBox: http://www.vbox.me/ (...This program loads all necessary files and installed them for you!)
Operating System: Microsoft Windows
Version Windows XP
Memory: 512 MB
Shared Clipboard: Bidirectional
Extended Features: Non enabled
Video Memory: 32 MB
Shared Folders: D:\Temp + Full Access for the MAME/MESS HDDs

NOTE:
- Copy chdman and HDDs into D:\Temp, start Portable-VirtualBox/XP and then command prompt.
- Type d: ...for the Shared Folder
- Type chdman createhd -i j.raw -o new.chd READY! ...or do other operation with your HDDs images :)
User avatar
No.09097
NekoEd
Senior Tester
Nov 19, 2012, 14:00
M.A.S.H.: We've already determined that the issue doesn't occur on uniprocessor systems, only multiprocessor.
User avatar
No.09145
haynor666
Tester
Dec 20, 2012, 20:58
Self compiled 147u4 x64 and works on windows 7 professional x64 core i5 (at work). So far I managed conwert half of chds v4 to v5.

Also I've recompressed carnevil on windows 7 home premium Core2Duo 3,20 GHz (at home).
User avatar
No.09237
Iaspis
Tester
Jan 15, 2013, 16:42
Same as haynor, no problems here as well.
User avatar
No.09249
NekoEd
Senior Tester
Jan 19, 2013, 01:55
This was a pretty pervasive, nasty bug; I'd like a few more reports that it's fixed before we close it out.
User avatar
No.09297
Haze
Senior Tester
Jan 29, 2013, 08:13
edited on: Jan 29, 2013, 08:16
nope, not fixed.

just had it with a CD image for the first time too, and it again seems to be related to putting any extra strain on the I/O while CHDMAN works, in this case I'd left it converting PSX CDs all night, and it hung at the very point I went back to the PC in the morning and started trying to use it for something else at the same time causing heavy disk access while it restored everything.

This really seems to be something to do with the i/o threading or i/o in general.

It's worth noting that -c none was being used at the time, so COMPRESSION WAS TURNED OFF

this again causes CHDMAN to write data more quickly, and is probably why I managed to trigger it on a CD image rather than the usual 0 fill areas of HDDs.
User avatar
No.09299
NekoEd
Senior Tester
Feb 1, 2013, 03:43
Got it, still exists. Wow, this is a tough bug to squash. Keeping open since it's still there.
User avatar
No.09300
Firewave
Senior Tester
Feb 1, 2013, 14:49
Could I please have some very detailed information on the problem?

What compiler did you use?
What build flags did you use?
What parameters did you use chdman with?
How many CPUs has your machine?
What was the exact input file it happened with?
User avatar
No.09301
Haze
Senior Tester
Feb 1, 2013, 18:52
edited on: Feb 1, 2013, 18:54
1) the official Windows ones, both old and new
2) the default ones, both 32-bit and 64-bit
3) default createhdd options, copy operations on HDD images, with or without compression, you name it, nothing special.
4) all machines I've seen it on are
5) it's random, but most typically it occurs on large HDD images with a significant amount of blank space. I've had it *once* on a cd using -c none (out of around 6TB of CD conversions) - I couldn't reproduce it on an entirely blank image tho bizarrely, but maybe that's just a bit too 'clean' to trigger it.

it seems most connected to doing thing that might cause heavy i/o or slow down one of the threads (which I guess heavy i/o would)

it does NOT happen on a single core Win XP machine, not even once, as others have pointed out, it's a multi-core issue.

i've seen it happen with -c none when creating CHDs for MESS (as well as the one time it's done it on a CD) so it isn't tied to any specific compression library.

There are 5 different machines I've seen this on, it isn't a heating problem or RAM problem or HDD problem, the most flaky machine of the lot (the 32-bit XP box) is the one where it works!
User avatar
No.09312
Firewave
Senior Tester
Feb 8, 2013, 14:10
5) Could you give me exact names of images it happened with that can be found somewhere on the web?
User avatar
No.09542
Iaspis
Tester
May 24, 2013, 20:29
I can also confirm this is not happening on older single core & XP machines (tested a few, already)
User avatar
No.09614
Firewave
Senior Tester
Jun 21, 2013, 17:29
r23841 fixes a hang when chd_file_compressor::async_read() fails and also improved the error reporting a bit. Maybe this helps with this issue.
User avatar
No.09621
Haze
Senior Tester
Jun 26, 2013, 19:33
edited on: Jun 26, 2013, 19:34
still happens with latest code as of now.

was trying to convert a taito type x hard drive (WDC WD400EB-11CPF0.img from Taisen Hot Gimmick Mix Party) and it hung at ~ 29.2 % on one machine and ~ 40 % on another.

CPU usage was 50% (rather than 100%) indicating one core has probably stalled.

no error messages given.
User avatar
No.09622
Haze
Senior Tester
Jun 26, 2013, 22:03
edited on: Jun 28, 2013, 19:35
ok, so far I've attempted this conversion 15 times across 5 different PCs and it's frozen at random points each and every time.

I'm going to have to dig out the single core XP box at this rate!

*edit* add another 5 attempts to that count..

all in all it took over 40 attempts across 5 pcs to actually get 1 conversion that worked without stalling. none of the machines have heat issues or drive problems, they can run stress test software night and day. If other people aren't still seeing this I'm amazed.
User avatar
No.09624
Firewave
Senior Tester
Jun 28, 2013, 15:43
Worked for me on the first try and did three attempts in total each with a build compiled with a different compiler and they all came through. They were all unoptimized DEBUG=1 builds though.
User avatar
No.09625
Haze
Senior Tester
Jun 28, 2013, 19:39
right, it seems to be less frequent, if at all happening with debug compiles. (converted 10/10 times with no problem here)

with a regular build it still happens even if I turn *all* compression off so it isn't one of the compression routines either.

we know it doesn't occur on single core machines either, however if DOES still occur on a dual core setup if you use the task manager to set 'affinity' to one core after CHDMAN has started, that makes no difference other than making it slower.
User avatar
No.09626
NekoEd
Senior Tester
Jun 29, 2013, 00:02
I'd say sprinkle it with debug printfs all over the place. Anywhere you think it could be hanging up, any time it takes a mutex or some other resource, make it print the grab and the release. Eventually we should be able to track down to around where in the code it's losing track of itself and crapping out.... hopefully.
User avatar
No.09642
Firewave
Senior Tester
Jul 8, 2013, 09:16
I was able to reproduce this on a older machine with less CPUs and it hangs with waiting for compression results, but all the threads are waiting for be waked up. It seems more prone to happen with less threads and looks like the relation between livethreads and the active flag of the thread might be the issue. It should be reproducible on modern machines with -np 2.
User avatar
No.09644
Haze
Senior Tester
Jul 9, 2013, 13:58
it happens even with compression turned off tho? (-c none or whatever the current option for that is)
User avatar
No.10075
Gherry
Tester
Dec 17, 2013, 07:50
Hi,
I can confirm that chdman still randomly hangs.

First of all my hardware specs:
Core 2 Duo E8200, 6GB RAM, OS Win7 64bit

On computer the -np option doesn't work: cpu is always 100% capped, even with -np1.


That said, I still had to convert old .chd from v4 to v5 and since I finally had some spare time I started doing it
(I used chdman from .149 and .151 precompiled packages from mamedev.org) same behavior with both versions.

Most of the chds converted without any problems, but few hanged.
Basically the prompt hangs and cpu utilization goes from 100% (both cores) to only 100% on one core.

Since I read that it was (probably) a problem with multi-core I tried running chdman in a virtual machine on the same pc (virtualbox guest: win7) with only one core and chdman finished the operation without any problem.


Behavior is really weird, the same chd hangs one time at 30% and the next time at 97%.
These 4 chds gave me more trouble than most (I couldn't convert them even in multiple tries)
jnero - jn010108.chd
bmiidx - c44jaa03.chd
sf2049se - sf2049se.chd
sf2049te - sf2049te.chd

Since all of the 4 converted on the first go on the VM with one core, I'd say it's 90% related with multi-core code.

Sorry I cannot be more specific, if needed I can run some test, just let me know.
User avatar
No.10076
NekoEd
Senior Tester
Dec 17, 2013, 18:58
Thank you, Gherry. It's always good to have updates on outstanding bugs, and your information may prove to be valuable in helping to track down the issue.
User avatar
No.10083
Haze
Senior Tester
Dec 20, 2013, 08:25
smf added a hack / some code so that if you set an environmental variable it forces MAME to only only a single thread for i/o, which prevents the hangs on dual core systems, doesn't really solve the issue because it's an undocumented workaraound, but points in the same direction as everything else, the MAME I/O code isn't threadsafe.
User avatar
No.11340
Firewave
Senior Tester
Jan 2, 2015, 13:58
Several data race "fixes" based on ThreadSanitizer warnings have been applied to the sdlwork.c/winwork.c code for 0.158. There's still some data races in the CHD code though.

We should add a separate issue though if there are still issues with the -np parameter for chdman instead of discussing it in here.