[buug] Solaris 9 SPARC disk "mystery" (what's using those disks?)

Thu Feb 21 13:24:51 PST 2008

Solaris 9 SPARC disk "mystery" (what's using those disks?)

Bumped into this "mystery" a while ago.  Still don't have "the answer"
yet ... but haven't exactly spent lots of time trying to track down the
answer either.  Anyway, at least at the time (those two days or so) it
certainly did stump me and my (few - a rather small team) co-workers.

Curious if someone has "the answer" or ideas that may point that way.
The system is actively used in/for production, so it's not like we
can tear it down or shut it down any time we might want to run various
tests on it ... but we can do safe non-destructive tests that can be run
in normal multi-user mode operation.

Below is some of the communications and background on the "mystery"
(redacted and reformatted slightly), and with a bit of epilog notes
added at the end:
______________________________________________
From:   Michael Paoli
Sent:   Thursday, January 17, 2008 12:20 AM
To:     <a co-worker>
Subject: RE: <specific_host> - c8 disk I/O "mystery" (what's using them?)

Ah, ... swap, forgot to check that, ... but alas, nothing there from the
c8 disks either.

Maybe I'll let it percolate in our brains a bit longer, then start
asking around some more.

On Linux, fuser is somewhat different in behavior.
from fuser(1) on LINUX, (e.g. Red Hat) we have:
"
 -m     name specifies a file on a mounted file system or a block device
        that is mounted. All processes accessing files on that file sys-
        tem  are  listed.  If a directory file is specified, it is auto-
        matically changed to name/. to use any file system that might be
        mounted on that directory.
"
... just one of those few zillion "*nix is *nix" (yeah, right) details
to help continue to ensure job market for skilled, experienced *nix
folks - especially multi-platform/flavor folks.

The typical Linux fuser(1) has (at least?) that as a (comparative)
downside, ... but it's got its upsides too, e.g. stuff like:
# fuser -n tcp 80
can be quite handy ... don't have anything quite like that with UNIX
(though it can probably be done via lsof ... if one has that compiled
and installed - so, does lsof come with the operating system ...
depends what *nix operating system).

ZFS had also crossed my mind but that didn't come out until Solaris 10
- and even then it wasn't production release with the first general
availability release of Solaris 10.

So, ... thus far, ... still a bit of a mystery (and Sun didn't release
dtrace until Solaris 10).

And something continues to actively use at least one of the two c8 disks:
# iostat -n -D -l 2 ssd6 ssd7 5 9999
   c8t2d1        c8t2d0
rps wps util  rps wps util
129  67 28.0    0   0  0.0
 52   2 19.0    0   0  0.0
 53   2 19.3    0   0  0.0
...
Peeking at the partition table, and some disk data contents, it seems
likely it is or was configured under Veritas Volume Manager - at least
at some point in time.  Maybe there's some multi-pathing going on that
isn't immediately obvious.

_____________________________________________
From:   <a co-worker>
Sent:   Wednesday, January 16, 2008 5:58 PM
To:     Michael Paoli
Subject:        RE: <specific_host>

Sounds worthy of posting in a Solaris newsgroup.  I'm wondering if
there might have been some swap active on the c8 disks?
Alas, fuser /dev/sda1 and lsof /dev/sda1 doesn't return anything on
Linux either even when the device is mounted as /.

Perhaps it's not the content of the disks that the OS and running
processes care about but rather the noisy state of the fiber connection
might be causing the FC driver the retry way too often?  I'd love to
know how this mystery ends.  : )

_____________________________________________
From: Michael Paoli
Sent: Wednesday, January 16, 2008 4:43 PM
To: <a co-worker>
Subject: RE: <specific_host>

The server is <specific_host>.<FQDN>

An instresting remaining (thus far) mystery is:

And what is using the c8 disks?

Solaris 9 SPARC
c8 disks neither in /etc/vfstab nor output of mount -p
c8 disks neither in output of vxdisk list nor output of vxprint -thf
c8 disks not in output of metastat
fuser shows nothing on the /dev/*dsk/c2* disks/slices
DBAs claim no use of raw disk devices (and disk devices have customary
ownerships and permissions - nothing owned by oracle or the like)

But once the c8 disks were accessible again, the system - and database -
all seems much happier again
And in /var/adm/messages - only a total of about 25 lines related to
errors from the c8 disk devices - which seems rather low to me if
something was trying to access them until the were accessible again.
And when they became accessible again, nothing to /var/adm/messages
noting the change (e.g. no reconnect/renegotation/FC/WWID/LUN/etc.
messages at all).
And the fibre link lights never went out - but some gentle laying on of
hands on the fibre cables regained us access to the fibre attached c8
drives.

_____________________________________________
From: <a DBA>
Sent: Wednesday, January 16, 2008 11:18 AM
To: Michael Paoli; <more coworkers>
Cc: NOC
Subject: RE: <specific_host>

oracle on <specific_host> uses veritas qio for file access.

_____________________________________________
From:   Michael Paoli
Sent:   Wednesday, January 16, 2008 12:16 PM
To:     <more coworkers>
Cc:     NOC
Subject:        RE: <specific_host>

Not quite yet ... still trying to detemine what the system is using
those disks for - apparently not used directly for filesystem or via
Solaris Volume Manager or Veritas Volume Manager.

DBAs - any use of raw devices possibly going on there (e.g. Oracle
using raw disk storage, rather than via filesystem)?

========================================================================
incomplete but epilog notes:

The "laying on of hands" (minor repositioning of cables) was done
2008-01-16 ... it "worked" ... temporarily - the problem was back by
2008-01-17 and I went on-site that day to deal with the hardware
(replaced the fibre cables, repositioned them and the
dressing/routing/hanging of the SCSI cables to resolve any potentially
problematic bending/stressing of the fibre cables and/or SCSI cables -
that has thus far (at least into 2008-02-21, and hoping it continues)
corrected the hardware problem).