Re: Re: HELP!!!! Attempting to restore a SBS2000 System by Jeff
Jeff
Mon Dec 15 18:41:01 CST 2003
I'm going to speak to a narrow part of your list of curiosities.
In the past 12 months, on two occasions with servers that have no
relationship to each other at all, I observed a RAID5 volume crash "that
can't happen". In both cases, the drive controller was an Adaptec 2100S, and
in both cases the SCSI drives attached included 3-drives in RAID5 plus a hot
spare in good condition. In both cases as I arrived to investigate the
server, I found that the Adaptec controller had declared at least one drive
Dead, and one or more drives failed. Essentially, a triple drive failure,
but not of the 3 in the RAID at once, rather, two in the RAID an the spare
too!
I'm not going to try to explain this and justify it because...it's not
supposed to happen. In both cases, I called Adaptec and debated with them
how this did happen. I was told that it's exceedingly rare, but that's it
possible, they see it from time to time, and they don't have a technical
culprit for it.
The practical explanation is that the RAID controller is detecting a
condition where it "believes" that more than one drive "in sequence" needs
to come off-line. Another possibility is that when one drive fails, the
controller begins to rebuild on the spare and then the system suffers a
second even (like a power surge) and the rebuild drive kicks out.
In both cases, my servers were behind full power-correcting (voltage
regulator type) UPS with full battery charged conditions. In both cases, I
was able to contact Adaptec and with their guiding instructions, I "forced"
the array to go back online with drives it had rejected. Essentially, this
is telling the controller to ignore the status marking on the drives and
simply put them back online as a volume with the assumption that the data
space is both intact and that the sequence-order of the drives is correct.
Reassembling the drives into volume with the members marked in the wrong
order would hash the contents. In both cases I was able to again read the
drives (even boot the drives in one case) and recover information.
I guess the moral of the story I'm providing is that we all typically think
of RAID sets as supposedly infallible...and honestly, a RAID5 with a hot
spare is practically never going to go down without fair opportunity to save
it entirely. However, what Adaptec is designing their system to do is not
intended to put all cost effort into keeping the machine running, they are
prioritizing keeping the drive contents intact, even if that means they kill
the system to do it. If the system hits a condition the controller believes
has bad options to recover or repair on the fly, it drops the volume out
completely. As such, if the error is a false condition, with intervention it
can go back on.
With a mirror RAID, you have less room for optimism because you are building
the operations on the basis of writing the same thing to two drives without
error correction options, so you don't have the option to trap a bad drive
operation as easily through error correction comparisons. With a mirrored
system, it's just possible that you can have a reversal condition because
you are dealing without a third devices to error correct whatever you do and
flag it as flipped.
"Tim" <Tim@NoSpam> wrote in message
news:ONUlXNvwDHA.2712@tk2msftngp13.phx.gbl...
> I gave up waiting for MS to call back ...
>
> So I decided to follow a process so long as I had at least 1 regression
step
> as follows:
>
> 1. So, I retrieved the original "good" (1 bad sector supposedly) sata disc
> from the hw vendor,
> 2. used drive image to secure copies of the 3 partitions on the one "good"
> disc on another machine (thankfully intel ICH5R raid discs will work on
> other Intel ICH5R mobos if the intel driver is the correct version),
> 3. removed the second disc from the mirror to secure the current situation
> as another possible regression step,
> 4. then on the remaining disc in the now broken raid 1, shot the
partitions,
> 5. shot the ext dos partition that Win98 fdisk had created,
> 6. recreated the partitions,
> 7. used drive image to put the contents back,
> 8. booted (failed with missing ntoskrnl.exe),
> 9. ran a repair install,
> 10. reapplied W2K SP4
> 11. after further testing I will finally plug the second raid 1 disc back
in
> and the resynch of the mirrored drive pair ran automatically.
>
> All is hunky dory. I am concerned about the impact of the stuffed sector
> that drive image never complained about. I have all data etc so am happy.
> AD, Exchange etc. runs OK. Exchange did have a stuffed E00.log file,
fixed.
>
> Stupid thing about this is that all the fixing above is obvious, yet the
> events that caused the whole mess have me baffled. How do 2 discs in a
raid
> 1 fail at the same time? Why / how did the partitions come up with the
wrong
> letters? Why didn't fdisk /mbr work - lack of driver, int13 not good
enough?
> Why did ntoskrnl.exe go walk about?
>
> The MS Support Fellow was reassuring but also baffled and advised that SP4
> should be all that is needed along with a security patch top up.
>
> The customer has now after pressure agreed he does need better backups
than
> disc to disc and to CD-R.
>
> - Tim
>
>
>
>
>
>
> "Jeff Middleton [SBS-MVP]" <jeff@cfisolutions.com> wrote in message
> news:uEtZehCwDHA.2148@TK2MSFTNGP12.phx.gbl...
> > I would first of all agree with Jim on this....get technical help on the
> > phone.
> >
> > On the specific problem, it's not clear to me if you are indicating that
> you
> > received back a pair of mirrored SATA drives or if you received each
> > partition on a separate drive.
> >
> > If it's the later, the reverse order of the partition assignments is
> > commonly caused by the connection of the devices into the system in the
> > wrong order. Simply reversing the cables between the two drives might
> > resolve such a condition.
> >
> > However, more importantly, if you were not using SATA drives before, you
> may
> > find that your system won't boot normally anyway....though it's also not
> > clear to me if you are even getting into Windows, or making the
judgement
> > about drive order using FDISK or some other tool.
> >
> > I would strongly encourage you to not keep fiddling with the drives if
you
> > value the contents, and instead get someone directly involved to get
your
> > fallback condition protected and the boot condition restored without
risk
> > involved.
> >
> >
> > "Jim Behning" <jimbehningmvp@atl.mindspring.com> wrote in message
> > news:brpgtvcqvk2d6s72serha90m4evdd69ki3@4ax.com...
> > > Do what I do. Call Microsoft and try to use one of your free calls if
> > > you have SBS 2000. Or paying is worthwhile.
> > >
> > > "Tim" <Tim@NoSpam> wrote:
> > >
> > > >It gets worse....
> > > >I tried the method in article 249321.
> > > >There are 5 methods & I discounted 1 - 4 as they were either NA or
not
> > > >possible (no network access, recovery console does not allow access
to
> > the
> > > >current C: drive) so I tried method 5.
> > > >
> > > >IE
> > > >
> > > >with a Win98 boot disk, fdisk /mbr
> > > >
> > > >The system now won't boot at all "Missing Operating System".
> > > >I tried doing a fresh install of W2K into the original W2K partition,
> but
> > it
> > > >when it gets to rebooting to resume the install, "Missing Operating
> > System".
> > > >
> > > >I tried using fdisk to set the first partition as active - now a
black
> > > >screen with a flashing cursor... oh the joy!
> > > >
> > > >The first partition also comes up as Ext something as the partition
> type,
> > > >should be NTFS same as the rest.
> > > >
> > > >Since the disc(s) are SATA mirror, I am hesitant to attach either or
> both
> > of
> > > >them to another system and copy all the files off and really fdisk
the
> > > >machine... as given progress so far I am sure they will be corrupted
> even
> > > >further.
> > > >
> > > >- Tim
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >"Tim" <Tim@NoSpam> wrote in message
> > > >news:eFaryD8vDHA.536@tk2msftngp13.phx.gbl...
> > > >> Hi All,
> > > >>
> > > >> Horrible problem. This system suffered a mirrored disc volume
crash -
> > both
> > > >> drives came up with disc errors & had to be replaced.
> > > >>
> > > >> The first drive in the raid was well and truly gone, the second had
1
> > bad
> > > >> sector in a bad location. The hw vendor retrieved the data using
> Ghost
> > and
> > > >> put it back onto two new HDD's (SATA). Nice, but C is now D and D
is
> > now
> > > >C.
> > > >> There is also an E partition with *lots* of valuable stuff on.
> > > >>
> > > >> How do I fix the drive letter assignments?
> > > >>
> > > >> I have looked at various KB articles and they all relate to running
> > > >systems.
> > > >> This isn't running - I have to restore on to C as C otherwise the
> > system
> > > >> won't log on.
> > > >>
> > > >> Is there a way to fix the drive letter assignments?
> > > >>
> > > >>
> > > >> If I do a reinstall of the system onto what was C and is now D,
then
> > > >windows
> > > >> won't let me do an AD restore. If I follow the KB article to
reassign
> > the
> > > >> drive letters, it won't let me log on.
> > > >>
> > > >> Thanks in advance...
> > > >>
> > > >> - Tim
> > > >>
> > > >> PS the system backups are on the E partition. There are other
> backups,
> > but
> > > >> the system went down during a backup and left the most recent
> corrupt.
> > > >>
> > > >>
> > > >>
> > > >
> > >
> > > Jim B. SBS MVP
> > > remove the mvp to send email
> >
> >
>
>
>