After a crash, X68000 XVI will no longer boot

Started by MrKsoft, August 07, 2017, 12:55:50 AM

Previous topic - Next topic

MrKsoft

Hi, I am seeking advice about what I can do about my XVI...

Before we start, I want to note that I have had this machine for about 3 months and it has been fully functional during that time (used maybe 1-2 times a week).  Got it off Yahoo Japan, and the seller stated that the system and PSU were fully recapped and the battery replaced with a CR2032 holder.  As far as I can tell from examining the machine, that is all true.

Yesterday I was just running MMDSP when the X68000 randomly hung.  After resetting it, I receive the message shown on the SRAM Issues wiki page. ("Error. Please reset the computer.")  So of course I go and check the battery... yep, it's on the borderline between alive and dead, with a low voltage.  It's conceivable it just happened to die right now.  So I swapped in a new battery and reset the SRAM by holding CLR while booting.

However... the error persists.  And unlike the page states, I cannot bypass it in any way.  Resetting with a disk in the drive either brings the same message instantaneously, or the system will access the floppy for a second or two and then stop on a black screen.  If I boot without a floppy, I can get the normal "please insert a disk" message.  Inserting a disk at this point has the same result: either a black screen or the error message.  Last night I got a few disks to do something different-- they asked for human.sys instead, basically saying the disk wasn't bootable... but they were bootable disks so I dunno what the deal is there.  After resetting a few times it went back to the usual error message and I haven't seen it again.

Repeated SRAM clears also do not help-- in fact, sometimes the error message is overlaid on the SRAM clear dialog and the system locks up, keeping me from doing a reset!  That sounds really bad to me...

Also of note is that 16mhz mode is now completely broke.  Booting at 16mhz causes the error to be printed on the left side of the screen, split across several lines.  Same with the SRAM clear dialog.  I've tried a lot of things now, and I am completely stuck at this point, and absolutely livid that so much money is down the drain if I can't fix this.  :(  What the heck happened!?

Here is everything I have tried:
-Replaced SRAM battery
-Cleared SRAM multiple times
-Attempted to boot the C Complier disk mentioned on the wiki page and reset SRAM that way (obviously, can't even get the disk to boot...)
-Removed SRAM battery, unplugged machine, left overnight in case the SRAM was badly corrupted
-Re-seated the expansion card riser based on some advice I received
-Re-seated all cables
-Swapped floppy drives

Does anyone have some insight, or do I now have a giant doorstop?

Pinwizkid

Has the rest of the unit been capped? There are a few caps on the motherboard and a whole bunch on the interconnect board in the base. I'm not sure if your issue is directly related, but it's a good idea to do them as those caps are hitting the age where they will die and possibly leak. I just capped mine recently, it's a bit time consuming, but they are all gonna need to get done sooner or later.

Good luck and don't give up!

MrKsoft

Crap.  Good thing I saved the translated version of the auction text... turns out my memory was wrong, JUST the power supply was recapped.  Bottom board was cleaned and the battery holder added but not recapped.  So that leaves everything else needing some attention then.  A quick inspection of the boards didn't show anything obviously leaking, but I know that means nothing in the whole scheme of things.  Unfortunately, I don't trust myself enough with a soldering iron to try recapping this guy, considering I can barely solder wires to battery terminals on the old 486s I've done without making a mess, so I'll have to find someone to help me out.

I'm hoping that isn't the immediate issue though, considering that it was fully functional 60 seconds beforehand.  I find it hard to believe that there would be a catastrophic failure like this due to caps, it's usually a more gradual issue.

MrKsoft

I'm bumping this topic with an update.

Today I finally recapped the entire XVI, every cap I found.  On the plus side, no signs of leakage anywhere.  Unfortunately, the recap has resulted in absolutely zero improvement to the problem.  Not even a slight change in the behavior.  At this point I am inclined to believe it is beyond saving.  Probably an actual chip somewhere is no longer working, but I have no idea how to possibly troubleshoot it any further.

If anyone has any additional ideas, I'd love to hear them, but otherwise I'm probably going to throw up a want ad for a new system unit or something, maybe use the PSU from this one as a donor since it's recapped.  Not sure I'm ready to deal with Yahoo Auctions again.

Pinwizkid

Quote from: MrKsoft on July 05, 2018, 11:14:46 AM
I'm bumping this topic with an update.

Today I finally recapped the entire XVI, every cap I found.  On the plus side, no signs of leakage anywhere.  Unfortunately, the recap has resulted in absolutely zero improvement to the problem.  Not even a slight change in the behavior.  At this point I am inclined to believe it is beyond saving.  Probably an actual chip somewhere is no longer working, but I have no idea how to possibly troubleshoot it any further.

If anyone has any additional ideas, I'd love to hear them, but otherwise I'm probably going to throw up a want ad for a new system unit or something, maybe use the PSU from this one as a donor since it's recapped.  Not sure I'm ready to deal with Yahoo Auctions again.

That's a bummer :( I went through a similar situation when my audio died. I figured it was a cap somewhere and rather than just diagnosing which one and replacing just one, I decided to just do all of them since it would need to be done eventually anyway. When I had everything back together, my audio was still dead and it turned out to just be a flaky connection between the power supply and bottom board. In the end, I'm glad I recapped the board, but what a frustrating moment it was when I first reinstalled the re-capped board to see no improvement.

Hopefully you will be able to get this working eventually or find a replacement. Good luck!

DMR

#5
Did you end up teaching yourself soldering skills to recap? Or did you outsource it? In any regard, that's awesome,
because after you fix the real issue, you won't have to worry about the caps for another 20 years.

Though don't give up just yet on this computer. It'd be a shame if you did given that they're so rare!

I'm not that knowledgeable on the x68k yet,
but I wonder if one could find a dumped XVI IPL (Initial Program Loader) that's been disassembled or one could disassemble it.

The error message "ã,¨ãƒ©ãƒ¼ãŒç™ºç"Ÿã—ましたã€,リã,»ãƒƒãƒˆã—てくださã,,ã€," probably exists in there in another encoding,
and you could see the logic that results in the error and work from that angle.

Another angle might be to try to find a custom one that prints out better error info.

In fact, this lead me to this thread where someone has already disassembled and made their own version:
https://nfggames.com/forum2/index.php?topic=5503.0

And in the same thread someone posts similar issues to yours,
and lydux's advice seems to be:

"I'm strongly suspecting a bad (corroded) trace or cable around IOSC chip on I/O board."

But that doesn't seem to be the case for the person in the thread,
though looks like the custom IPL solution would have been the winner to diagnosing a bunch of x68k issues.

Though it looks like there's been no word from lydux since 2015... I hope they're alright.
But it also doesn't look like they open sourced their work, which means anyone new has to start from scratch. Bummer.



Spitko

You might want to spend some quality time with a meter checking all of the traces around the damaged part of the PCB.

Powering a circuit with corrosion damage can accelerate the process, destroying traces that had previously worked. Cleaning the PCB helps prevent this, but we don't know how well the cleaning was done,and how badly the board was originally damaged.

Check every trace that goes near or through the affected area. Also look for other affected parts of the PCB; any place a cap is (was) should be considered suspect. You might even get lucky and just find a corroded solder joint that can simply be reflowed.

If you've done all that and still no dice, time to start probing.

The SRAM chip in an xvi should be a pair of Panasonic MN4464S. It's surprisingly hard to find datasheets on these old things, but we have schematics for the 68k, which is good enough.

Check that Pin 28 is connected to vbatt (ie, the positive terminal of your battery holder), and pins 14 and 20 are grounded. Then inspect the remaining traces to ensure none are corroded.

If the chip legs have corrosion on them, it's possible the SRAM itself bit the dust. Unfortunately, there's no easy way to manually clear the SRAM on these chips, the 68k itself needs to do it. Pin 1 is the reset pin for a few variants of this chip, but the panasonic models were not of that persuasion, and left it unconnected.

Unfortunately, past that it starts getting into logic analyzer territory, so hopefully it's either a simple broken trace, or someone else chimes in.

MrKsoft

#7
Dunno where you got the idea that there's PCB damage.  All the boards are in good shape!  However, I think I am going to start going at it with the ol' multimeter and see if there is anywhere that might have a corroded trace.  I think it may be possible, as the metal shielding at the bottom of the right-side tower arrived with some rust on it.  I didn't see any damage on the board, but it could be something less obvious.  I think I might also reflow the solder joints just to be sure there aren't any cold joints.

Reading that older thread it did give me the idea to try accessing the serial debugger.  Booting up the system with OPT2 held down did allow me to see serial output!  I am not really sure what the expected behavior is, though.  Loading the serial console with no disk inserted gets me to this:

ROM debugger Ver 1.0
Copyright Hudson soft 1987

        NMI break at 00FF0010
PC=00FF0010 USP=FFFFFFF7 SSP=00002000 SR=2000 X:0  N:0  Z:0  V:0  C:0
D  003A0010 00000024 00000000 00000400  00000000 FFFFDFFF 0000FFFF FFFFFFFF
A  00ED0100 00FF08E8 10FE3BD8 00FC006C  00001120 FFFDFEFF 00001FFC 00002000
move    #$2700,SR
+


There is a message on the screen different than usual.  Translating it with the Google Translate app on my phone it says roughly "Cannot boot disk, terminal should be closed".  I think this is normal as I didn't give it a boot device.

However, booting it WITH a disk inserted, the disk does its normal reading sound ("click", "thud", whatever you want to call it) four times before this appears in the console:

ROM debugger Ver 1.0
Copyright Hudson soft 1987

Exceptional Abort By undefined instruction at 000068A8
PC=000068A8 USP=FFFFFFF7 SSP=10006800 SR=2019 X:1  N:1  Z:0  V:0  C:1
D  00000000 FFFFFFFF 0000DC7A FFFFFCBA  BBFF00BF FFFFDFFF F7F7EFFB 00009070
A  0000DC7A 00006800 0001447A FFFFFFFF  00002136 00002128 000021FC 10006800
undefined instruction $71CA
+


I should note, the four reading "clicks" before displaying the error message is the exact same behavior I have been describing up to this point without the console.  So I think this is an equivalent to the generic error message I've been getting, as it occurs at the exact same spot in the loading process.  I did try to see if I could step past it via the debugger, but I could not.  If I run the system at 16mhz instead of 10mhz, I get the ROM debugger's copyright message, but no further lines appear unless I press the interrupt button.  The system ends up at some random RAM address further up (00FF9D42 last time I tried) and after a while of poking around it starts receiving random letters instead of what I type.  Possibly an issue with the timings?  Do the oscillators on these systems go bad?

I can't make heads or tails of this, as I don't know enough about the X68000's architecture, let alone 68K assembly, but maybe someone more knowledgeable can use this information to help identify the problem.  In the mean time I am going to try what I described above to see if a bad trace or cold solder joint might be confusing the system and making it receive a bad instruction like this.

Quote from: DMR on July 06, 2018, 12:42:52 PM
Did you end up teaching yourself soldering skills to recap? Or did you outsource it? In any regard, that's awesome,
because after you fix the real issue, you won't have to worry about the caps for another 20 years.

I did it myself!  I practiced on some old motherboards that needed done, and while I'm not going to say I'm a professional at it, I've really improved to the point that it doesn't feel like I'm going to damage anything when I try.  And considering this XVI still turns on, I can't be doing too poorly!

EDIT:  Just got this one while trying to boot from disk.  Bus error this time.  Still sounded the same hardware wise, but a different message.  Just adding it here for informational purposes.

ROM debugger Ver 1.0
Copyright Hudson soft 1987

Exceptional Abort By bus error
By Memory Access of FFFF9598
at   00007924  move.w  #$0000,$9598.w

system status =  I/N=I R/W=W FC=5
PC=00007924 USP=FFFFFFF7 SSP=000067FC SR=2014 X:1  N:0  Z:1  V:0  C:0
D  000000AF FFFFFFFF 0000DC7A FFFFFCBA  BBFF00BF FFFFDFFF F7F7EFFB 00009070
A  00002000 00006C00 0001447A FFFFFFFF  00002136 00002128 000021FC 000067FC
move.w  #$0000,$9598.w
+


Also going to include the "help screen" in case someone has an idea of commands to try.  I am at least trying to read out some memory locations using the wiki's memory map as a guide.  I may also try clearing the SRAM space this way instead of the CLR key just to see if it's severe corruption.

A                :assemble                R address,drive,sector,length
B                :display breakpoint                            :read device
B[bp][address]   :set breakpoint          S[size]<range> data   :search memory
BC[<bp>]         :clear breakpoint        T[=address][count]    :trace
BE[<bp>]         :enable breakpoint       U[=address][count]    :untrace
BD[<bp>]         :disable breakpoint      W address,drive,sector,length
BR               :reset break count                             :write device
D[size][<range>]        :dump memory      X        :display register
E[size][address]        :edit memory      X[reg]   :register change
F[size] <range> data    :fill memory      Y/N      :yes no ask
G[=address][adress]     :go               Z[num=exp]  :system variable
H                       :display this     ?[exp]   :print expression(hex)
HI                      :trace history    ??[exp]  :print expression(dec)
L[<range>]              :list             \        :loop command line
M <rang> address        :move memory

neko68k

Maybe your floppy drives or your RAM. The first boot is pretty normal looking. The second one with the error suggests it either got the wrong data back from the floppy drive or it is corrupted in RAM. Since it's booting the debug monitor I'd guess it's your floppy drives or cables. Did you recap the floppy drives? Some of them need it. I've heard some people have re-greased the drive and re-aligned the heads but I don't know the details.

MrKsoft

I don't think it's the floppy drives as the same issue occurs if I try to boot an external SCSI2SD which has X68000_V4.HDS written to it.  (Although I get the translated version "Error.  So reset please.")

RAM is a possibility though.  I did check around all the chips and they do not appear to have any damage or corrosion.  I also found that using the debugger I couldn't actually overwrite the memory addresses for the SRAM with 0's as I wanted to try-- nothing would happen.  I compared the values that were in memory to an SRAM file from an emulator and they looked okay for the first few bytes and then the rest appeared to be garbage.

However I found a couple interesting areas of the main board that appear to be bodged in various ways...

https://i.imgur.com/e1RKw4V.jpg - The ALS373 latch at IC90 has pins 6 and 7 ("3Q" and "3D" signals according to the datasheet) detached from the chip and soldered together.

https://i.imgur.com/ff7MhNW.jpg - Two ceramic caps/resistors? (not sure, they are only marked with "47") on the backside of the board.  One is soldered from C150 to pin 6 on IC16 (which is a SN74AS1004AN hex inverting driver... the signal on that pin is labelled 3Y).  The other one is soldered from pin 4 of that same SN74AS1004AN (signal labelled 2Y) to pin 10 of IC19 (which is a SN74AS245N octal bus transceiver... and pin 10 is ground).

Anyone know if those are normal?  I know there are often fixes done to PCBs at the factory, but these don't look quite as professional to me.  According to the schematics they also seem to be within the paths that, to me, might relate to the issues (namely the ALS373 is hooked up to a few more ALS373s which are connected to the DMA controller, and the bodges on the bottom side to a bus driver which is connected to the main RAM, VRAM, etc).

MrKsoft

Updating again with further findings.

Screwing around with the ROM debugger I went back to the Exceptional Abort at 68A8 and compared it to a boot from the same image in XM6 Pro-68k, setting up a break when the PC register hit that address.  The instruction it SHOULD be loading is $61CA but on the XVI it is reading $71CA.  Comparing the binary values of this, I found that the difference was a single bit... so it looks like the bit is flipped.  (0110 0001 1100 1010 for the correct instruction vs 0111 0001 1100 1010 for what the XVI loads into memory)  So great... looks like a memory problem.

So I downloaded memtest68k since I noticed that it did not rely on Human68k to boot.  I figured as a small program it might be able to get past this spot okay.  It didn't boot though, however it seems to have confirmed my suspicions regardless.  The system halts and gives the message "Checksum failed; program is corrupt or didn't load properly".  Except that it actually said: "Checksum fqiled; program is corrupt or@did~'t load properly".  That seems to point very strongly towards a memory problem, or at least something that is messing up the data from the drive as it gets put into memory.  I'll be double checking the drives (same thing happens with both drives which both previously worked, so I assume it is not the drives themselves, but possibly the cable is damaged?) and also looking back at that thread DMR linked and investigating the IOSC chip more closely as was suggested in that case.

I think I'm leaning towards an I/O related problem as if I change memory addresses using the debugger, they do seem to stick even at the problematic addresses.  I suspect if I were really patient I could probably use the debugger to overwrite the entire memory with a snapshot from an already booted system and it would work, but until I think of a good way to automate that I'm not going to try.   :P

DMR

Awesome progress! It’s likely others will run into a similar issue and your findings will be very helpful!

I was wondering if there might be any alignment in the errornous bits (like the error repeats every x bits).
From a quick glance it looks like no, but it might be that the bit only corrupts if one of the neighbors is 1.

If it was a floppy cable issue, I’d expect to see alignment.

MrKsoft

#12
Well, I found the pattern.  Looks like we have a completely stuck bit.  Using the breakpoint at 68A8 I further compared the memory from the XVI versus XM6 Pro-68k. 

Using $68A8 as a start point, I found that every 8th word (or every 128 bits), the 4th bit of that word is stuck as a 1.  An example in the code block below illustrates this (note it only shows those specific words to keep things brief)

bad x68000
71ca (0111 0001 1100 1010)
15b8 (0001 0101 1011 1000)
d139 (1101 0001 0011 1001)
1304 (0001 0011 0000 0100)
1254 (0001 0010 0101 0100)
1ec8 (0001 1110 1100 1000)
3f3c (0011 1111 0011 1100)
7100 (0111 0001 0000 0000)
5ff9 (0101 1111 1111 1001)
7021 (0111 0000 0010 0001)
70f6 (0111 0000 1111 0110)
7100 (0111 0001 0000 0000)
1219 (0001 0010 0001 1001)
101d (0001 0000 0001 1101)
1c15 (0001 1100 0001 0101)
1018 (0001 0000 0001 1000)
1018 (0001 0000 0001 1000)
325f (0011 0010 0101 1111)
10c0 (0001 0000 1100 0000)
7220 (0111 0010 0010 0000)

emulator
61ca (0110 0001 1100 1010)
15b8 (0001 0101 1011 1000)
d139 (1101 0001 0011 1001)
0304 (0000 0011 0000 0100)
0254 (0000 0010 0101 0100)
0ec8 (0000 1110 1100 1000)
3f3c (0011 1111 0011 1100)
6100 (0110 0001 0000 0000)
4ff9 (0100 1111 1111 1001)
7021 (0111 0000 0010 0001)
60f6 (0110 0000 1111 0110)
6100 (0110 0001 0000 0000)
1219 (0001 0010 0001 1001)
101d (0001 0000 0001 1101)
0c15 (0000 1100 0001 0101)
1018 (0001 0000 0001 1000)
1018 (0001 0000 0001 1000)
225f (0010 0010 0101 1111)
10c0 (0001 0000 1100 0000)
7220 (0111 0010 0010 0000)


And well, I guess that's that.  I am not sure what is causing the stuck bit, as I see various differences across the memory space.  Manually editing the memory in the debugger to match the emulator, I can get the "wrong" data starting at $68A8 to correct itself.  Then setting the PC register back to $68A8 I can actually get it to move a bit further.  Unfortunately it hits more bad instructions later on in the $78B4 range, and these ones are not on the same word boundary that I had identified earlier.  It's still the same 4th bit that's stuck though.  And on these ones editing the memory in the debugger does not change that bit.  I can change the value but that 4th bit sticks to a 1.  So I can't patch the memory over and over and force it to boot.

So I guess the next question is what to do next.  It could be as "simple" as bad RAM, but there isn't really much I can do about that (I doubt I can get a new compatible chip and the RAM chips are surface mounted which I don't have the kit to handle).  If we want to look at other causes, I was thinking it could be something with the PEDEC (XVI equivalent of IOSC) garbling data that is sent through it, but I am not sure as the schematic is complicated for me to read and I can't tell if the SCSI signals pass through it or go around it (which is important as this affects both floppy and SCSI booting).  I did try testing the continuity of the pins between the PEDEC and the cable where its signals leave the I/O board but they all checked out fine.  It doesn't look like the weird bodge on IC90 is the problem either - it's wired up to the VICON and CYNTHIA chips which worked completely fine when the system was functioning.  I also checked the continuity on all the cables from the I/O board to the main board, floppy drives, SCSI, etc... and none of this explains why the 16mhz mode is all glitched up.

Whatever the cause, it's definitely something deep down in the hardware, and it's probably not something I can repair.  I think I've exhausted just about everything I can do with my current experience and equipment.

DMR

#13
Awesome!

Type of fault you're looking at is something along the lines of:
OUT_D_3=(!A_6 & !A_5 & !A_4 & !A_3 & !A_2 & !A_1 & !A_0) | MEM_D_3(A_ALL)

The RAM chips (HM514402) are 1024 "words" x 4bits each, and there's only 4 of them.
Each 16-bit word request hits each RAM chip.

Given the pattern is every 128bits... it might not be the RAM chip, but it's possible the address decoder inside the RAM chip is busted...
If there is a fault, it'd be IC24 (if I'm reading the schematics right), since it's the 4th bit.

Can you put a scope on its I/O4 pin (there's got to be a data sheet for it somewhere), and just keep reading every 16th word?
See if it stays a solid 1.

Also have you tried reading unassigned memory areas? To see if they return a 1?
How about if you read the SRAM?

MrKsoft

I don't have a scope unfortunately, despite knowing the basics I have basically never dug this deep into electronics troubleshooting.  For now I'll have to make do with what I have...

I started testing random areas of RAM and I am getting inconsistent results.  I used a RAM dump from XM6 Pro-68k to find areas that should be unassigned and started trying to write F's to the area and then overwriting those with 0's using the debugger.  I have yet to find a working range, but what is really illuminating here is that there are different areas with different issues.  They all revolve around the 4th bit, but it's often different words with the problems.  And some of them are stuck to 0 instead of 1!!

Examples:
$112000-$1120ff:
every 8 words, 4th bit stuck as 0 on 1st and 7th words

$1c0000-$1c00ff:
every 8 words, 4th bit stuck as 1 on 3rd and 5th words

$1f0000-$1f00ff:
Crazy inconsistent, but revolving around 4th bit stuck as 0 instead of 1.  Might as well just lay it out every 8 words:
$1f0000: 1st and 7th word, 4th bit stuck as 0.
$1f0010: GOOD
$1f0020:1st word, 4th bit stuck as 0.
$1f0030: 1st word, 4th bit stuck as 0.
$1f0040: GOOD
$1f0050: 1st word, 4th bit stuck as 0.
$1f0060: GOOD
$1f0070: GOOD
$1f0080: 7th word, 4th bit stuck as 0.
$1f0090: 7th word, 4th bit stuck as 0.
$1f00a0: 1st and 7th word, 4th bit stuck as 0.
$1f00b0: 7th word, 4th bit stuck as 0.
$1f00c0: GOOD
$1f00d0:1st word, 4th bit stuck as 0.
$1f00e0: GOOD
$1f00f0: GOOD

Actually, I just dumped that range again after typing this up and it's changed with additional 7th word, 4th bit errors.  Then I dumped it again and it changed again, with some of those disappearing.  It seems the memory is not staying stable either.  What a mess!

I should note, I have yet to see any issues on even numbered words.  That seems very important.

Regarding the SRAM, I am able to read out portions of it and they appear OK with all 0's.  I cannot edit them to check whether any bits are stuck -- not sure if I should be able to via the debugger or not.  I can do the command (for instance e ed0000 ffff ffff ffff ffff ffff ffff ffff ffff) and it takes it but dumping the same address brings back exactly what it was before.

JulBS0

Hi,

About the SRAM, the behavior you are experiencing is normal.

The system controller write-protects the SRAM most of the time (probably to prevent nasty things that might happen when the computer is unplugged).

To enable writes, write 0x31 as a byte at 0xe8e00d.

MrKsoft

Hm, I couldn't get that to work, nothing happened when I tried to write $e8e00d.  I wasn't able to write the SRAM at all.  However, it does appear to look okay after being cleared (comparing vs an emulator SRAM image) so seems it is returning to a "good" state.

I dumped the memory range for the DOSA chip where that SRAM flag should be written and it appears the stuck bits propagate through here too.  It looks like in an emulator it's mostly going to appear as FFFFs with some FFF8s in there but there are some FFFCs instead.  Probably nothing really interesting to report there.

Anyway, unless I hear otherwise I think it's most likely a RAM problem.  The existing chips are Hitachi HM514400AS7 1,048,576 word x 4bit chips.  The specific chip doesn't seem to be very common-- places to buy some were sketchy at best (I doubt they even had the chip for real), and I didn't even find an exact datasheet.  I found one for the HM514400A series.  Not sure if there is a difference between an HM514400 and HM514402.  They look similar.  I believe the -AS7 ending signifies at 70ns part based on this datasheet.  Just based on comparing the datasheet to the 514402's on the board the pinouts seem to match so I think they're the same chip, maybe the 2 at the end just signifies that they were made for Sharp or something?

Anyway I thought about seeking out a compatible replacement.  I went digging through my parts bin and found a 72-pin PC SIMM that has similar looking RAM chips with the same size and same spread out pin layout.  They are NEC 424400-70 chips.  Looking up the datasheets these are also 1,048,576 word x 4bit.  They seem to have the same refresh cycle speeds as well.  And best of all, the pinout is identical!  I think they are basically the exact same thing but from different manufacturers.

So let's say I found someone that could help me get the old chips off the board (they are surface mounted, it appears, so I don't have the kit to do that cleanly) and put these NEC chips in their place.  Does anyone see a reason that these chips wouldn't work?  It might not actually solve the problem but it might help rule it out in that case.  Then again, it may fix everything.  Who knows?  I just want a bit of assurance that I won't make the problem worse if I go in this more destructive direction.

DMR

So I took my x68k XVI apart completely and I was looking at the motherboard.

I don't have any of the bodges you have:




MrKsoft

#18
Well that's interesting!  I wonder why mine has them at all.  I couldn't see them on any XVI motherboards I could find on Google Images either.

I might try undoing them just to see if anything changes.  The ALS373 seems suspicious since if it's hooked up the way it is then it's completely bypassing the latch logic.

EDIT: Successfully removed all these bodges and absolutely nothing changed.  But hey, that at least means I didn't break anything!

DMR

You said, "Actually, I just dumped that range again after typing this up and it's changed with additional 7th word, 4th bit errors.  Then I dumped it again and it changed again, with some of those disappearing.  It seems the memory is not staying stable either. "

I wonder if the RAM's stable, but the transmission of the bits from the RAM to the CPU is getting corrupted. 

If the bit is supposed to be 0, and 75% of reads show 0,
and then you do a write of 0xFFFF (assuming writes are not corrupted),
and the bit is now supposed to be 1, and 75% of reads show 1,
then I think that might mean it's the transmission. 

Before risking work on the RAM chip, I'd probably get a scope.

Or I think you might not need a scope... I think that if you were to write a while loop that reads a bunch of bad address that all return a 1, you wouldn't need a scope since the RAM chip would hold a 1 until the next read (I think).
In which case a simple LED bulb might possibly do.


MrKsoft

#20
You have a valid point.  It could be that...

I'll look into getting a scope or seeing if I can borrow one from someone.  I have no idea what exactly I need or how to use it... but hey, it's a great learning opportunity.

Edit: Yeah, I'm in over my head on this.  I can't even figure out if I should buy a used full-size analyzer on eBay (with the benefit of having good equipment for future work) or if one of those handheld probes would do the job.  Then the whole deal on how to use it to see if the memory is working.  It may take a while before I have anything more to report.

Also you might be on to something about the memory transmission.  I decided to boot it up and poke around again and it actually gave me an invalid instruction error on a different address than usual.  I went to read out 68A8 as I usually do and it was actually correct for once... of course a few resets and it went back to the usual, but that seems to show some level of variance.  At this point, though, I've checked the connections between everything between the memory and CPU many times.  What could cause that?  Plus, I still have no idea how the broken 16mhz mode plays into all this.  I don't see how it would be memory related for that.  So the questions continue...

MrKsoft

Posting a final update: This issue has been resolved!

It was, indeed, just bad RAM.  I had the chips replaced with the ones from the SIMM I mentioned earlier, and sure enough, it works fine now.  I was able to run Memtest68K and it checked everything successfully, games run, Human68k runs, and 16mhz mode even works again.  So if anyone has similar problems... it could be the RAM.  Getting it replaced is a bit difficult since you need to find compatible chips and the ability to do fine soldering work (or find someone who can do it for you!), but it looks like it's a completely viable, if challenging, repair option. :)

DMR