Archive for the ‘cluster’ Category

PowerHA failure scenario when dealing with SAN-booted LPARs – part III

Tuesday, April 12th, 2011

Ok, this is part III of the rather long story. As shown in last series the problem was really tricky, we cannot run anything in the event of loosing storage (you cannot read binaries if you don’t have storage). Ok, so how HACMP/PowerHA deals with it?

If you lose all storage Virtual FibreChannel connections, this is going to reported as a loss of quorum for a VG in AIX’s error reporting facility. In theory this error should be propagated to HACMP/PowerHA failover mechanism so that RG is evacuated from the affected node. Sometimes this behaviour is titled as “selective failover behavior”. Technical details are pretty interesting as it shows the lack of real/solid integration between AIX kernel and HACMP/PowerHA. Primary source of information is AIX RAS reporting facility (see “man errpt” for more info) and the errdemon process. Errdemon takes proper actions based on the configuration of “errnotify” ODM objects. The installation of PowerHA adds special hooks so that any special messages about failing hardware/OS/probes (in this case loosing of quorum on VGs) are propagated to the HACMP/PowerHA scripts. This can be verified as follows:

root@jkwha001d : /home/root :# odmget errnotify | tail -17

errnotify:
        en_pid = 0
        en_name = "clSF1"
        en_persistenceflg = 1
        en_label = "LVM_SA_QUORCLOSE"
        en_crcid = 0
        en_class = "H"
        en_type = "UNKN"
        en_alertflg = ""
        en_resource = "LVDD"
        en_rtype = "NONE"
        en_rclass = "NONE"
        en_symptom = ""
        en_err64 = ""
        en_dup = ""
        en_method = "/usr/es/sbin/cluster/diag/clreserror $9 $1 $6 $8"
root@jkwha001d : /home/root :#
root@jkwha001d : /home/root :# /usr/lib/errdemon -l
Error Log Attributes
--------------------------------------------
Log File                /var/adm/ras/errlog
Log Size                1048576 bytes
Memory Buffer Size      32768 bytes
Duplicate Removal       true
Duplicate Interval      10000 milliseconds
Duplicate Error Maximum 1000
root@jkwha001d : /home/root :#

.. so the method of rescuing from losing SAN storage (also on SAN booted hosts) seems to be the script /usr/es/sbin/cluster/diag/clreserror. Of course this script is also located on SAN in that particular case…

root@jkwha001d:# ls -l /usr/es/sbin/cluster/diag/clreserror
-r-x------    1 root     system        13813 Feb 24 2009  /usr/es/sbin/cluster/diag/clreserror
root@jkwha001d:# file /usr/es/sbin/cluster/diag/clreserror
/usr/es/sbin/cluster/diag/clreserror: shell script  - ksh (Korn shell)
root@jkwha001d:# head -45  /usr/es/sbin/cluster/diag/clreserror
#!/bin/ksh93
[..]
#
#  Name:        clreserror
#
#   Notify method in errnotify stanzas that are configured for Selective
#   Fallover triggered by (AIX) Error Notification.
#   This function is merely a wrapper for clRMupdate.
#
#   Argument validation is performed in the clRMapi.
#
#  Arguments:
#
#       $1      Error Label, corresponds to errnotify.en_label
#       $2      Sequence number of the AIX error log entry
#       $3      Resource Name, corresponds to errnotify.en_resource
#       $4      Resource class, corresponds to errnotify.en_class
#
#       Argument validation is performed in the clRMapi.
#
#       Environment:  PATH
#
root@jkwha001d:#

And rest assured that this script calls a lot of other scripts, of course that can be unavailable if the rootvg is on the same physical storage as the affected VG.

There are two good findings here actually. First one is that if you are going to loose all SAN-based hdisks you are going to be flooded with thousands entries in errpt facility. Those can be undetected by errdemon because of the overflowing the log in memory. Workaround for the first case seems to be trivial, just enlarge the error log buffer. This is documented here:

root@jkwha001d : /home/root :# /usr/lib/errdemon -B 1048576
0315-175 The error log memory buffer size you supplied will be rounded up
to a multiple of 4096 bytes.
root@jkwha001d : /home/root :# /usr/lib/errdemon -l
Error Log Attributes
--------------------------------------------
Log File                /var/adm/ras/errlog
Log Size                1048576 bytes
Memory Buffer Size      1048576 bytes
Duplicate Removal       true
Duplicate Interval      10000 milliseconds
Duplicate Error Maximum 1000
root@jkwha001d : /home/root :#

Additionaly it seems to have some value mirroring some of the LVs on the affected VGs. This might add some stability to the detection of loosing LVM quorum, i.e – this shows properly mirrored LVM loglv4 across 2 hdisks(PVs)…

root@jkwha001d:# lsvg -p jw1data1vg
jw1data1vg:
PV_NAME           PV STATE          TOTAL PPs   FREE PPs    FREE DISTRIBUTION
hdisk5            active            479         30          00..00..00..00..30
hdisk3            active            479         30          00..00..00..00..30
root@jkwha001d:# lsvg -l jw1data1vg
jw1data1vg:
LV NAME             TYPE       LPs     PPs     PVs  LV STATE      MOUNT POINT
jw1data1            jfs2       896     896     2    open/syncd    /oracle/JW1/sapdata1
loglv04             jfs2log    1       2       2    open/syncd    N/A
root@jkwha001d:# lslv loglv04
LOGICAL VOLUME:     loglv04                VOLUME GROUP:   jw1data1vg
LV IDENTIFIER:      0001d2c20000d9000000012b86756f99.2 PERMISSION:     read/write
VG STATE:           active/complete        LV STATE:       opened/syncd
TYPE:               jfs2log                WRITE VERIFY:   off
MAX LPs:            512                    PP SIZE:        32 megabyte(s)
COPIES:             2                      SCHED POLICY:   parallel
LPs:                1                      PPs:            2
STALE PPs:          0                      BB POLICY:      relocatable
INTER-POLICY:       minimum                RELOCATABLE:    yes
INTRA-POLICY:       middle                 UPPER BOUND:    128
MOUNT POINT:        N/A                    LABEL:          None
MIRROR WRITE CONSISTENCY: on/ACTIVE
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?:     NO
root@jkwha001d:#

Second finding is that even with all those changes it has very high probability of failing (i.e. PowerHA RG move won’t work). Personally the risk is so high that for me it is nearly a guarantee. The only proper solution of this problem that i am able to see is to add special handler to the err_method in errdemon code. Something like err_method = KILL_THE_NODE. This KILL_THE_NODE should be implemented internally by running all the time errdemon process. The process should be running with memory protected from swapping (something like mlockall())… because currently it is not running that way.

root@jkwha001d:# svmon -P 143470
-------------------------------------------------------------------------------
     Pid Command          Inuse      Pin     Pgsp  Virtual 64-bit Mthrd  16MB
  143470 errdemon         24851    11028        0    24700      N     N     N

     PageSize                Inuse        Pin       Pgsp    Virtual
     s    4 KB                 323          4          0        172
     m   64 KB                1533        689          0       1533

    Vsid      Esid Type Description              PSize  Inuse   Pin Pgsp Virtual
   7c05d         d work fork tree                    m   1041   233    0    1041
                   children=4d6cdc, 0
    8002         0 work fork tree                    m    492   456    0     492
                   children=802760, 0
   4d0f7         2 work process private             sm    131     4    0     131
   35309         3 clnt mapped file,/dev/hd9var:69   s    107     0    -       -
   110c0         f work shared library data         sm     41     0    0      41
   1d023         1 clnt code,/dev/hd2:44058          s     25     0    -       -
   45115         - clnt /dev/hd2:63900               s     11     0    -       -
   3902a         - clnt /dev/hd9var:1015             s      8     0    -       -
root@jkwha001d:#

Hope that IBM fixes that. There is no APAR because I’m lazy… i even don’t want to fight with 1st line of support… BTW: i’m not the first one who noticed this problem, please see here for a blog post about the same from Ricardo Gameiro

-Jakub.

PowerHA failure scenario when dealing with SAN-booted LPARs – part II

Monday, January 24th, 2011

Ok, this is part II of the post PowerHA failure scenario when dealing with SAN-booted LPARs – part I.

The first scenario we have performed was to disable 100% of the MPIO storage paths to the active HACMP node by un-mapping Virtual FibreChannel adapters (vadapters from both VIOS protecting the active node [LPAR]). On both VIOS we have performed the following command:

$ vfcmap -vadapter vfchostXYZ -fcp

where the vfchostXYZ was server-side (or better VIOS side) vadapter handling FC I/O traffic for this LPAR. The result? The LPAR with HACMP/PowerHA active Resource Groups on it (jkwha001d) after some time evicted itself from the HACMP cluster, and the old passive node jkwha002d became the active one (Oracle started there). The root cause of the jkwha001d LPAR is the following one:

root@jkwha001d : /home/root :# errpt
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
AFA89905   0103071611 I O grpsvcs        Group Services daemon started
97419D60   0103071611 I O topsvcs        Topology Services daemon started
A6DF45AA   0103071111 I O RMCdaemon      The daemon is started.
67145A39   0103070911 U S SYSDUMP        SYSTEM DUMP
F48137AC   0103070911 U O minidump       COMPRESSED MINIMAL DUMP
DC73C03A   0103070911 T S fscsi0         SOFTWARE PROGRAM ERROR
9DBCFDEE   0103070911 T O errdemon       ERROR LOGGING TURNED ON
26623394   0103070511 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070511 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070511 T H fscsi1         COMMUNICATION PROTOCOL ERROR
(..)
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk4         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk8         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk6         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk3         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk2         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk0         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DE3B8540   0103070411 P H hdisk5         PATH HAS FAILED
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
(..)
A39F8A49   1210235210 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
D5676F6F   1210235210 T H fscsi1         ATTACHED SCSI TARGET DEVICE ERROR
DE3B8540   1210235210 P H hdisk6         PATH HAS FAILED
D5676F6F   1210235210 T H fscsi0         ATTACHED SCSI TARGET DEVICE ERROR
DE3B8540   1210235210 P H hdisk1         PATH HAS FAILED
D5676F6F   1210235210 T H fscsi1         ATTACHED SCSI TARGET DEVICE ERROR
(..)

As you can see AIX6.1 generated a “SYSTEM DUMP” which is actually a system panic indicator. Normally AIX saves the memory image of it’s kernel on runtime to the Logical Volumes configured with “sysdump” type and then reboots. This allows investigation why the machine crashed, but here you won’t see anything. You’ve lost FC storage connectivity (even rootvg) , so it even couldn’t save it . So from where it knows it? Probably the state of the LPAR crash can be saved somewhere in POWERVM firmware area. It’s one of the RAS things in AIX/POWER. So far, so good, the HACMP/PowerHA was able to recover services…

OK, but we wanted to have a real proof, so we have performed a double-check. We suspected that VIOS can have some magic way to communicate with LPAR. We wanted exclude that factor, and we’ve performed simulation of disconnecting the FC from storage level. The initital state was stable HACMP cluster, with RG active on jkwha002d (and jkwha001d being passive), all MPIO paths to Netapp cluster (storage1, storage2) were reported as “Enabled” by “lspath”. The igroup term in Netapp concept is “inititator group” and is responsible for things like LUN masking. If you remove access to some FC WWPNs on the LUN, it is going to end-like in situation in which AIX will have hdisks point to non-existing SCSI adapters and AIX SCSI stack will get a lot of errors (FC is just wrapper around SCSI).

On the 2nd storage controller (storage2) the igroup jkwha002d_boot was controlling access to the OS level LUNs (rootvg, etc):

storage2> lun show -g jkwha002d_boot
        /vol/jkwha002d_boot/boot/rootvg   40.0g (42953867264)   (r/w, online, mapped)
storage2>

Also on 2nd storage there was an igroup responsbile for SAP and Oracle LUNs:

storage2> lun show -g jkwha002d_cluster
        /vol/sap_jw1_mirrlog/mirrlog/mirrlog01.lun      1g (1073741824)    (r/w, online, mapped)
        /vol/sap_jw1_oraarch/oraarch/oraarch01.lun     51g (54760833024)   (r/w, online, mapped)
        /vol/sap_jw1_sapdata/sapdata2/sapdata02.lun     15g (16106127360)   (r/w, online, mapped)
storage2>

The same igroup was present on 1st storage system, controlling the remaining LUNs (yes, this is active-active configuration, where reads/writes are being managed by AIX LVM/VG consisting on LUNs on two controllers):

storage1> lun show -g jkwha002d_cluster
        /vol/clus_001d_002d_hb/hb/hbvg     20m (20971520)      (r/w, online, mapped)
        /vol/sap_jw1_origlog/origlog/origlog01.lun      1g (1073741824)    (r/w, online, mapped)
        /vol/sap_jw1_sapbin/sapbin/sapbin01.lun    100g (107374182400)  (r/w, online, mapped)
        /vol/sap_jw1_sapdata/sapdata1/sapdata01.lun     15g (16106127360)   (r/w, online, mapped)
storage1>

Let’s do it:

storage1> igroup remove
usage:
igroup remove [ -f ] <initiator_group> <node> ...
  - removes node(s) from an initiator group. The node may also
    specified as its alias. See "fcp wwpn-alias" for help with aliases
    By default, if lun maps exist for the initiator group, the node is
    not removed from the group. The -f option can be used to force the
    removal.
For more information, try 'man na_igroup'
storage1>
storage1> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:de
storage1> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:e0
storage2> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:de
storage2> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:e0
storage2> igroup remove -f jkwha002d_boot c0:50:76:00:61:20:00:de
storage2> igroup remove -f jkwha002d_boot c0:50:76:00:61:20:00:e0

.. and HACMP failover won’t work. The active LPAR node (jkwha002d) is going end up in zoombie state. If you have opened SSH sessions before, everything would indicate that is working, ls on /etc, some commands utils too. But the reason is because some things are cached in AIX’s filesystem cache. Everything is going to fail with I/O errors, lspath will cry too (it won’t display all MPIO paths as failed but this is story for another post), examples:

root@jkwha002d:# topas
bash: /usr/bin/topas: There is an input or output error.
root@jkwha002d:# nmon
bash: /usr/bin/nmon: There is an input or output error.
root@jkwha002d:# lspath
Failed  hdisk0 fscsi1
Failed  hdisk1 fscsi1
Enabled hdisk0 fscsi1
Failed  hdisk1 fscsi1
Failed  hdisk0 fscsi0
Failed  hdisk1 fscsi0
Failed  hdisk0 fscsi0
Enabled hdisk1 fscsi0
Failed  hdisk2 fscsi0
Failed  hdisk2 fscsi0
Failed  hdisk2 fscsi0
Failed  hdisk2 fscsi0
Enabled hdisk2 fscsi1
Failed  hdisk2 fscsi1
Failed  hdisk2 fscsi1
Failed  hdisk2 fscsi1
Failed  hdisk3 fscsi0
Failed  hdisk4 fscsi0
Failed  hdisk5 fscsi0
Failed  hdisk6 fscsi0
Failed  hdisk7 fscsi0
Failed  hdisk8 fscsi0
Failed  hdisk3 fscsi0
Failed  hdisk4 fscsi0
Failed  hdisk5 fscsi0
Failed  hdisk6 fscsi0
Failed  hdisk7 fscsi0
Failed  hdisk8 fscsi0
Failed  hdisk3 fscsi0
Failed  hdisk4 fscsi0
Failed  hdisk5 fscsi0
Failed  hdisk6 fscsi0
Failed  hdisk7 fscsi0
Failed  hdisk8 fscsi0
Failed  hdisk3 fscsi0
Failed  hdisk4 fscsi0
Failed  hdisk5 fscsi0
Failed  hdisk6 fscsi0
Failed  hdisk7 fscsi0
Failed  hdisk8 fscsi0
Failed  hdisk3 fscsi1
Failed  hdisk4 fscsi1
Enabled hdisk5 fscsi1
Enabled hdisk6 fscsi1
Failed  hdisk7 fscsi1
Failed  hdisk8 fscsi1
Enabled hdisk3 fscsi1
Enabled hdisk4 fscsi1
Failed  hdisk5 fscsi1
Failed  hdisk6 fscsi1
Failed  hdisk7 fscsi1
Failed  hdisk8 fscsi1
Failed  hdisk3 fscsi1
Failed  hdisk4 fscsi1
Failed  hdisk5 fscsi1
Failed  hdisk6 fscsi1
Enabled hdisk7 fscsi1
Failed  hdisk8 fscsi1
Failed  hdisk3 fscsi1
Failed  hdisk4 fscsi1
Failed  hdisk5 fscsi1
Failed  hdisk6 fscsi1
Failed  hdisk7 fscsi1
Enabled hdisk8 fscsi1
Missing hdisk9 fscsi0
Missing hdisk9 fscsi0
Missing hdisk9 fscsi1
Missing hdisk9 fscsi1
root@jkwha002d:# errpt | more
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
B6267342   0103074511 P H hdisk0         DISK OPERATION ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
B6267342   0103074511 P H hdisk0         DISK OPERATION ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
EA88F829   0103074511 I O SYSJ2          USER DATA I/O ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
00B984B3   0103074511 U H hdisk0         UNDETERMINED ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
B6267342   0103074511 P H hdisk0         DISK OPERATION ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW

Oracle is going to panic (which is good behaviour in my opinion):

root@jkwha002d:# tail alert_JW1.log
Mon Jan  3 07:39:21 2011
Errors in file /oracle/JW1/saptrace/background/jw1_lgwr_831500.trc:
ORA-00345: redo log write error block 7 count 1
ORA-00312: online log 13 thread 1: '/oracle/JW1/origlogA/log_g13m1.dbf'
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: -1
Additional information: 512
Mon Jan  3 07:39:23 2011
Errors in file /oracle/JW1/saptrace/background/jw1_ckpt_368764.trc:
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogA/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogB/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
Mon Jan  3 07:39:23 2011
Errors in file /oracle/JW1/saptrace/background/jw1_ckpt_368764.trc:
ORA-00221: error on write to control file
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogA/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogB/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
Mon Jan  3 07:39:23 2011
CKPT: terminating instance due to error 221
Instance terminated by CKPT, pid = 368764
root@jkwha002d:#

but what’s most interesting is that Oracle will ask AIX to write it to the alert log file – and will be available to read by commands like “tail”, but “cat” command won’t work (you won’t be able to read whole alert log file becasue you don’t have I/O!). What’s even more interesting is that you won’t see those messages after rebooting! (after kernel memory is gone!). If you don’t have I/O how you are going to write/fsync this file???

Another additional thing is that the active HACMP node still will be active, it will be handling Resource Groups, etc. Failover won’t happen. Possible solutions to this problem should be an kernel-based check that would verify that at least /etc is accesible. Why kernel-based? Because you have to have some form of heartbeat in memory (like AIX kernel module or uncachable binary always present in RAM, running in Real-Time scheduling priority) that would continuesly . If it would fail for several times it should reboot the node (that would trigger Resource Group failover to 2nd node in the cluster).

Note: typical HACMP scripting – at least in theory – is not enough, even if it would force running /sbin/halt, how can you be sure that /sbin/halt all required libc contents are in memory??

Perhaps i’ll write a part III of this article :)

-Jakub.

PowerHA failure scenario when dealing with SAN-booted LPARs – part I

Sunday, January 9th, 2011

Together with Jedrzej we’ve exposed rather interesting weaknees in IBM PowerHA 5.5 solution (in the old days it was called HACMP). Normally you would assume that in case major cataclysm such as *complete* storage disappear on the active node, PowerHA or AIX has internal mechanism to prevent downtime by switching the services to the next active node (as defined in PowerHA policies/confguration). This is starting to be really interesting when we start talking about SAN BOOT-ed AIX LPARs. As everybody knows any form of assumption is bad (this is starting to be my mantra), and as we show it here avoid this pitfall requires requires SOME PLANNING AND CONFIGURATION to avoid ending in long downtime….

The enviorniment was consisting of:

  • Two LPARs (jkwha001d = active; jkwha002d = passive) running on separate Power frames.
  • Each LPAR on each frame was protected by separate pair of VIOS (to provide redundancy for NPIV-enabled FC fabrics and enable to have Shared Ethernet Failovers).
  • AIX 6.1 TL3
  • Oracle 10gR2
  • PowerHA 5.5SP2
  • Netapp DataOntap 7.3 storage (Netapp cluster, 2 controllers) – yes, AIX LPARs are also SAN BOOT-ed from this storage
  • 1 network-based (IPv4) heartbeat
  • 1 disk-based (on FC LUN) heartbeat
  • Properly configured & running scripts for starting and stopping Oracle
  • No application monitors configured

We’ve performed two experiments:

  1. Simulating total storage losing only on jkwha001d (active one, hosting Oracle) by unmapping Virtual Fabre Channel (NPIV) host adapters (vadapters) on both VIOS protecting jkwha001d.
  2. Simulating total storage losing only on jkwha002d (active one, hosting Oracle — after failover) by forcing disconnection of jkwha002d Fibre Channel initiators (WWNs) on Netapp cluster.

both of which simulated the same thing… Virtual LPAR running without storage at all. More on result in next post…

-Jakub.

MAA challenge lab, part #4

Friday, April 30th, 2010

No posts on my blog for long time, need to change that. So i was trying to get MAA (Maximum Availability Architecture by Oracle) lab again in shape for writing Master of Science thesis…

Somewhere near Janurary/Februrary this year:

  • Primary VM RAC nodes prac1, prac2, prac3 are working again, but database db3 is not (unable to archive to db3dg on srac1,srac2). Main root cause was that experiments in April of 2009 with log_archive_min_succeed_dest=2 setting caused losing sync betweeen primary and standby
  • Standby VM RAC nodes srac1, srac2 VMs are up again, but without db3dg database and ASM instances

So some actions were performed (Feburary/March):

  • ASM +DATA1 recreated on both new srac1 & srac2 VMs (on dedicated VG/LV: vg01/db3dg on physical synapse server, on dedicated 15k RPM SCSI drive[4th in system] named sdb [sda is RAID5 for OS/XEN/VMs]).
  • Fixed some XEN configuration files for VMs with “w!” (disabled shared locks mode) & losetup bug in /etc/xen/scripts/block.
  • Standby rebuilded (from RMAN duplicate backup + FAL gap fetching), so finally i’ve working MAA again :)

Problematic thing is that after failover/switchover due to differences in primary (64-bit) and standby (32-bit) i have to invalidate & recompile all PL/SQL packages (very time consuming on old hardware! and seems that Broker is unable to handle that case):

So last week:

  • prac7, prac8 32-bit VMs were launched.
  • ‘ve switched to using FreeNX server from typical X11 SSH forwarding (much more interactive on slow OpenVPN!).
  • Also i’ve upgraded Grid Control (OMS) to 10.2.0.5 (next step will be to 11g), OMS database repository to 11.1.0.6 (fro 10.1.0.4).
  • Next i deployed 32-bit clusterware (11.1) on those notes, played a little bit with OCR corruptions [metalink note ID 399482.1] after hitting mysterious listener outages (OCR corruptions were not the case for it, it was permission issue on single directory – doh!)
  • Created clustered ASM, and created 32-bit RAC database named “db4.lab1″.
  • I’ve also deployed OEM/GC agents on prac7, prac8 (10.2.0.5 versions) directly from GC (awesome for mass deployments!).
  • Yesterday I’ve exported using expdp 1 schema from db3.lab1 database (64-bit RAC) and imported it into “db4.lab1″…

Planned soon:

  • Upgrading GC to 11g and upgrading agents to 11g too.
  • Building MAA for “db4.lab1″, plan is to create DataGuard for it on srac1, srac2 VMs (32-biit too, they already host DataGuard for “db3.lab1″). But this one is going to use DataGuard Broker to get FailStart Failovers working (2 node primary RAC with 2 node standby RAC)
  • Extending primary RAC to prac9 (to be created), so to have 3 node primary RAC for “db4.lab1″ protected by DataGuard broker with 2 node standby RAC
  • Fun with Snapshot standby and Real Application Testing
  • Detailed measurements of failover times with SYNC DataGuard and Broker – orajdbcstat (my little utility for testing JDBC/FCF failover times, perhaps i’ll uplift it for UCP from 11.2g)

MAA challenge lab, part #3

Wednesday, April 15th, 2009

11/04/2009:
1) Finally got some time for cleaning up Grid Control (dropping ora2 and ora3). Secured all agents (on VMs: gc, prac1, prac2). I’ve also cleaned up XEN dom0 (from quadvm). These VMs are not needed anymore. db3.lab (RAC on prac1, prac2) is in GC. Installed 10.2.0.5 32-bit agent on srac1 (single node standby).
2) Testing application of single-node RAC standby for differences in Standby Redo Logs processing (verifcation performed by using read-only mode).
3) LNS (ASYNC=buffers_number in LOG_ARCHIVE_DEST_2 parameter) performance fun.
Prepared srac2 for future RAC extension (to two nodes: srac1, srac2). Also installed GC agent on srac2 (10.2.0.5).
4) prac3: cloning and adding it into the Clusterware prac_cluster (addNode.sh from prac2). Deploying GC 10.2.0.5 agent on this node (prac1 and prac2 are both 10.2.0.4, in future I’ll try to upgrade it via GC). Later manually creating +ASM3 and db33 instances (redo, undo, srvctl, etc.). It means that I have 3 node primary RAC :)
5) srac2: Plan is to add it to the srac_cluster and make it 2 node standby RAC. +ASM2 was running, but more work is needed (mainly registrations in CRS/OCR).
6) Flash Recovery Area on standby ASM’s diskgroup +DATA1 was exhausted (thus MRP0 died) so I performed full RMAN backup with archivelogs to QUADVM dom0′s NFS and afterwards I’ve deleted archivelogs to reclaim some space. On SRAC standby I’ve changed archivelog deletion policy (in RMAN) and then restarted MRP0.
Unfortunatley I’ve lost my RAID5 array on synapse (dom0 hosting srac_cluster: srac1, srac2; it’s and old LH 6000R HP server) — 2 drives have failed, so my standby RAC is doomed until I’ll rebuild synapse on new SCSI drives (to be ordered) :(
UPDATE: I’ve verified backups of my srac1 and srac2 VMs but the backups for ASM diskgroup +DATA1 failed.  Also my OCR and voting disks are lost. It will be real fun & challenge to recover this standby RAC enviorniment (this will be also pretty like restoring non DataGuarded RAC enviorniment after site crash). I belive I won’t have to rebuild my standby from primary, because I’ve backuped this standby earlier. OCR hopefully can be restored from Clusterware auto-backup location.

15/03/2009:
Finally two node RAC {prac1,prac2} is being protected by DataGuard single-node standby RAC {srac1}.

21/02/2009:
XEN shared storage and oracle software storage on synapse for srac1 and configuring it as /u01 on srac1
Clusterware, ASM +DATA1, database (RAC) installation on srac1 (x86_32).

MAA challenge lab, part #2

Tuesday, February 3rd, 2009

25/01/2009: Cloning prac2 and adding it to Clusterware cluster (with prac1). Debugging blogged earlier Clusterware bug.

26/01/2009: Reading about migration of single instance ASM to full clustered ASM/RAC. Experiments with NOLOGGING and RMAN recovery on xeno workstation (db TEST).

27/01/2009: I’ve managed to migrate to full working RAC for db3.lab1 {nodes prac1 and prac2} with ASM storage (ASM migration done using DBCA; RAC migration performed by using rconfig). Deployed GC agent on prac2.

29/01/2009: Experiments with DBMS_REPAIR, NOLOGGING and RMAN recovery on xeno workstation (db TEST).

Brief description of work on my MAA challenge lab… #1

Saturday, January 24th, 2009

This picture below (click to enlarge) shows what I’m planning to deploy in my home lab in order to prepare better for OCP certification. It can be summarized as full Maximum Availbility Architecture implementation… Grid Control is being used to increase productivity, but I don’t want to integrate Oracle VM into the stack, just systems and databases:

MAA challenge lab

17/01/2008: Installation and fight with Grid Control on VM gc. Preparing VM linux template named 01_prac1 from which other machines are going to be cloned (simple as recursive “cp” in dom0).

18/01/2008: Installation & fight with Grid Control after I’ve dropped “emrep” database (main GC repository database). This happened while I was playing with cloned database “misc1″ from “emrep”. I didn’t read message while running “DROP DATABASE ..”  from RMAN and I’ve sent both to /dev/null, because the DBID was the same for the orginal one and the “misc1″ clone.  The primary reason was that I wanted misc1 cloned from emrep but it failed :) ). Did I say that I’ve also deleted backups? ;) After new, sucessfull fresh installation of GC, I’ve taken full backup (from XEN dom0) of 01_gc VM for future purposes. I’m starting to regret that I’ve haven’t used LVM in dom0 for snapshot/fast-backup purposes…

20/01/2008: Setting up 09_ora1 VirtualMachine from VM template 01_prac1. Installing single 11g database named “db1.lab1″ with dedicated tablespace & user for sysbench version 0.4.8 (0.4.10 doesn’t work with Oracle).

23/01/2008: Cloning ora2 from ora1. Changing hostnames, IPs (trick: the same ethernet MACs but on different XEN bridges, changes performed from console:)). Uninstalling Agent10g, vanishing db on ora2. Setting up DNS server on quadvm (in XEN dom0) for whole lab. Installing GC agent on ora2 – something is wrong… GC console doesn’t catch up new targets (primary I’m looking for “Host” component). Agent is itself discovered by GC…. starting from the beginning (rm -rf 08_ora2 :o ) and so on…
Finally got ora3 up running instead of ora2. Then it turned out that Laurent Schneider has blogged post about metalink note in which the procedure of agent removal is described. So finally I’ve got ora2 back in GC (with gc, ora1 and ora3).

Next step was setting up host prac1 as for single instance non-RAC ASM database “db3.lab1″. First  clusterware has been installed. I wanted to have it 32-bit version, because my future standby RAC hardware is only 32-bit capable but it appears that I would have to install 32-bit userland RPMs, so I decided to try in the long term 64-bit primary RAC with 32-bit standby RAC… Also Grid Control agent was installed on prac1.

24/01/2008: Raised RAM for 01_prac1 to 1.6GB from 1.2GB (due to frequent swapping occuring, I want my 11g memory_target for that fresh db3.lab1 database to be left at 450MB). Succesfull migration of ASM storage /dev/hde to /dev/hdi (simulating storage migration – thanks to ASM it is straightforward. I’m going to blog about it in my next post).

My article about Extended RAC is finally public (on OTN)

Tuesday, November 11th, 2008

In case you would like to experiment with Extended RAC cluster my article on OTN should be helpful. Enjoy!

B.Sc.

Thursday, February 28th, 2008

It’s time to sum up several things:

  • I definetley need a rest(!). It’s my priority one. The problem is that I’m addicted to DOING something…
  • On 08.02.2008 I successfully got my Bachelor in Science. Basically we have implemented cluster using Solaris, Solaris Cluster, Oracle, Linux Virtual Servers, Apache2 and JBoss (I had to use Oracle DataGuard instead of RAC… as all of it was implemented in-home, see below for diagram). I’ll probably release webpanel for Solaris Jumpstart+FLARs (MySQL, PHP) some day under GPL. It was one of add-on projects for that engineering work.
  • Since about 15.02.2008 I’m studing for Master of Science on Computer Science, on Data Processing Technologies speciality track (emphasis is put on all databases related stuff)
  • Currently I’m preparing for Oracle Certified Associate (Database Administrator) exam…
  • It’s time to refresh my site after some period of inactivity.

Final architecture of cluster (it was my playground for testing some unknown for me software and features; it is NOT the real architecture I would suggest anyone to use ;) ):

Final architecture of cluster


Dilbert Engineer series:

Have fun! :)

Automatic failover with Oracle DataGuard (Fast-Start Failover)

Wednesday, December 5th, 2007

This post will demonstrate beautiful software… Oracle DataGuard :)

Quick intro for non-Oracle people out there… Oracle DataGuard is High Availability solution for Oracle Database. For thousands pages of documentation, concept guides, troubleshooting, HOWTOs about it please visit docs.oracle.com ;)

Some facts:
1) Max Availability mode (requirrement of Fail-Start Failover)
2) Flashback for database is on (also req.)
3) Physical standby
4) All configured from CLI (sqlplus and dgmgrl; no OEM)
5) Oracle version: 10gR2 EE(10.2.0.1)
6) All done on single host(2 instances + 1 observer + 1 listener)

Ok, let’s start observer (element which tests instances and decides when to failover to secondary database):

[oracle@xeno ~]$ export ORACLE_SID=xeno1
[oracle@xeno ~]$ dgmgrl
DGMGRL for Linux: Version 10.2.0.1.0 - Production
Copyright (c) 2000, 2005, Oracle. All rights reserved.
Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys/abc
Connected.
DGMGRL> show configuration verbose;
Configuration
Name: DGxeno
Enabled: YES
Protection Mode: MaxAvailability
Fast-Start Failover: ENABLED
Databases:
xeno1 - Primary database
xeno3 - Physical standby database
- Fast-Start Failover target
Fast-Start Failover
Threshold: 30 seconds
Observer: xeno.localdomain
Current status for "DGxeno":
SUCCESS
DGMGRL> START OBSERVER
<-- session hangs...

From another terminal we will insert some data:

[oracle@xeno ~]$ echo $ORACLE_SID
xeno1
[oracle@xeno ~]$ sqlplus vnull/abc
SQL*Plus: Release 10.2.0.1.0 - Production on Tue Dec 4 18:48:39 2007
Copyright (c) 1982, 2005, Oracle. All rights reserved.
Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning and Data Mining options
SQL> insert into dgtest values ('30000000');
1 row created.
SQL> commit;
Commit complete.
SQL>

Great, now we are going to prepare small holocaust for our xeno1 database… we just simply instant SIGKILL all processes of oracle xeno1:

[root@xeno ~]$ for P in `ps auxw | awk '/ora_[0-9a-z]+_xeno1/ { print $2 }' `; do kill -9 $P; done

Simple and efficient! ;)

Screenshot from DGMGRL observer:
DGMGRL observer failover

Now we can see from alert_xeno3.log the following:

[..]
<-- simulated failure of xeno1!
Tue Dec 4 19:01:19 2007
RFS[14]: Possible network disconnect with primary database
Tue Dec 4 19:01:19 2007
RFS[15]: Possible network disconnect with primary database
Tue Dec 4 19:01:19 2007
RFS[13]: Possible network disconnect with primary database
<-- failover starts!
Tue Dec 4 19:02:45 2007
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE
Tue Dec 4 19:02:45 2007
Terminal Recovery: Stopping real time apply
Tue Dec 4 19:02:45 2007
MRP0: Background Media Recovery cancelled with status 16037
Tue Dec 4 19:02:45 2007
Errors in file /u01/app/oracle/admin/xeno3/bdump/xeno3_mrp0_7206.trc:
ORA-16037: user requested cancel of managed recovery operation
Managed Standby Recovery not using Real Time Apply
Recovery interrupted!
Recovered data files to a consistent state at change 2316040
Tue Dec 4 19:02:45 2007
Errors in file /u01/app/oracle/admin/xeno3/bdump/xeno3_mrp0_7206.trc:
ORA-16037: user requested cancel of managed recovery operation
Tue Dec 4 19:02:45 2007
MRP0: Background Media Recovery process shutdown (xeno3)
Tue Dec 4 19:02:46 2007
Terminal Recovery: Stopped real time apply
Tue Dec 4 19:02:46 2007
Attempt to do a Terminal Recovery (xeno3)
Tue Dec 4 19:02:46 2007
Media Recovery Start: Managed Standby Recovery (xeno3)
Managed Standby Recovery not using Real Time Apply
Terminal Recovery timestamp is '12/04/2007 19:02:46'
Terminal Recovery: applying standby redo logs.
Terminal Recovery: thread 1 seq# 95 redo required
Terminal Recovery: /u04/oracle/xeno3/standby_redo01.log
Identified End-Of-Redo for thread 1 sequence 95
Tue Dec 4 19:02:46 2007
Incomplete recovery applied all redo ever generated.
Recovery completed through change 2316041
Tue Dec 4 19:02:46 2007
Media Recovery Complete (xeno3)
Terminal Recovery: successful completion
Begin: Standby Redo Logfile archival
End: Standby Redo Logfile archival
Resetting standby activation ID 3036789101 (0xb501b96d)
Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE
Tue Dec 4 19:02:51 2007
ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN
Tue Dec 4 19:02:51 2007
ALTER DATABASE SWITCHOVER TO PRIMARY (xeno3)
If media recovery active, switchover will wait 900 seconds
Standby terminal recovery start SCN: 2316040
SwitchOver after complete recovery through change 2316041
Online log /u04/oracle/xeno3/redo01.log: Thread 1 Group 1 was previously cleared
Online log /u04/oracle/xeno3/redo02.log: Thread 1 Group 2 was previously cleared
Standby became primary SCN: 2316039
Converting standby mount to primary mount.
Tue Dec 4 19:02:51 2007
Switchover: Complete - Database mounted as primary (xeno3)
Completed: ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN
Tue Dec 4 19:02:51 2007
ARC6: STARTING ARCH PROCESSES
Tue Dec 4 19:02:51 2007
ALTER SYSTEM SET standby_archive_dest='' SCOPE=BOTH SID='xeno3';
Tue Dec 4 19:02:51 2007
ARC7: Becoming the 'no SRL' ARCH
Tue Dec 4 19:02:51 2007
ALTER SYSTEM SET log_archive_dest_1='location="/u04/oracle/xeno3/arch"','valid_for=(ONLINE_LOGFILE,ALL_ROLES)' SCOPE=BOTH SID='xeno3';
Tue Dec 4 19:02:51 2007
ALTER SYSTEM SET log_archive_dest_state_1='ENABLE' SCOPE=BOTH SID='xeno3';
Tue Dec 4 19:02:51 2007
ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY
Tue Dec 4 19:02:51 2007
Completed: ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY
Tue Dec 4 19:02:51 2007
ALTER DATABASE OPEN
Tue Dec 4 19:02:51 2007
Assigning activation ID 3036795877 (0xb501d3e5)
LGWR: Primary database is in MAXIMUM AVAILABILITY mode
Tue Dec 4 19:02:51 2007
Destination LOG_ARCHIVE_DEST_2 is SYNCHRONIZED
LGWR: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR
Tue Dec 4 19:02:51 2007
ARCa: Archival started
ARC6: STARTING ARCH PROCESSES COMPLETE
ARCa started with pid=13, OS id=7969
LNSb started with pid=29, OS id=7971
Error 12514 received logging on to the standby
Tue Dec 4 19:02:58 2007
Errors in file /u01/app/oracle/admin/xeno3/bdump/xeno3_lgwr_6565.trc:
ORA-12514: TNS:listener does not currently know of service requested in connect descriptor
Tue Dec 4 19:02:58 2007
LGWR: Error 12514 verifying archivelog destination LOG_ARCHIVE_DEST_2
Tue Dec 4 19:02:58 2007
Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
LGWR: Continuing...
Thread 1 advanced to log sequence 97
LGWR: Waiting for ORLs to be archived...
LGWR: ORLs successfully archived
Thread 1 opened at log sequence 97
Current log# 2 seq# 97 mem# 0: /u04/oracle/xeno3/redo02.log
Successful open of redo thread 1
Tue Dec 4 19:03:01 2007
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Tue Dec 4 19:03:01 2007
SMON: enabling cache recovery
Tue Dec 4 19:03:02 2007
Successfully onlined Undo Tablespace 1.
Dictionary check beginning
Dictionary check complete
Tue Dec 4 19:03:02 2007
SMON: enabling tx recovery
Tue Dec 4 19:03:02 2007
Database Characterset is WE8ISO8859P1
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=29, OS id=7973
Tue Dec 4 19:03:03 2007
LOGSTDBY: Validating controlfile with logical metadata
Tue Dec 4 19:03:03 2007
LOGSTDBY: Validation complete
Tue Dec 4 19:03:04 2007
Completed: ALTER DATABASE OPEN
[..]


[oracle@xeno ~]$ export ORACLE_SID=xeno3
[oracle@xeno ~]$ dgmgrl
DGMGRL for Linux: Version 10.2.0.1.0 - Production
Copyright (c) 2000, 2005, Oracle. All rights reserved.
Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys/abc
Connected.
DGMGRL> show configuration;
Configuration
Name: DGxeno
Enabled: YES
Protection Mode: MaxAvailability
Fast-Start Failover: ENABLED
Databases:
xeno1 - Physical standby database (disabled)
- Fast-Start Failover target
xeno3 - Primary database
Current status for "DGxeno":
Warning: ORA-16608: one or more databases have warnings
DGMGRL> show database xeno3;
Database
Name: xeno3
Role: PRIMARY
Enabled: YES
Intended State: ONLINE
Instance(s):
xeno3
Current status for "xeno3":
Warning: ORA-16817: unsynchronized Fast-Start Failover configuration
DGMGRL> show database xeno1;
Database
Name: xeno1
Role: PHYSICAL STANDBY
Enabled: NO
Intended State: ONLINE
Instance(s):
xeno1
Current status for "xeno1":
Error: ORA-16661: the standby database needs to be reinstated
DGMGRL>

We should check now our data:

[oracle@xeno ~]$ export ORACLE_SID=xeno3
[oracle@xeno ~]$ sqlplus vnull/abc
SQL*Plus: Release 10.2.0.1.0 - Production on Tue Dec 4 19:16:17 2007
Copyright (c) 1982, 2005, Oracle. All rights reserved.
Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning and Data Mining options
SQL> select * from dgtest;
ID
----------
[..]
777
30000000 <--- our data
[..]

DataGuard saved the day :)

My engineering work…

Wednesday, August 15th, 2007

As of May I’m very busy architecting & implementing cluster for Java Enterprise Edition on comodity hardware (mainly x86_32 based) for my engineering work – to obtain BEng title. Our subject is:
“Web service based on scalable and highly available J2EE application cluster”. We have team consisting of 4 persons in which I’m responsible for all kind of systems/hardware scaling/clusters/load balancing/databases/networking/tunning everything :) . What kind of portal we are creating is to be decided by developers (it will likely be some kind of Web 2.0 portal).
Rest of the team is dedicated to J2EE programming. We are mainly playing with technology.
Currently rock-solid base core cluster architecture looks like this:

Cluster architecture

We are utilizing:

  • Load balancers: Linux Virtual Servers with DirectRouting on CentOS5 (configured as a part of Redhat Cluster Suite)
  • Database: Oracle10g R2
  • Middleware: JBOSS 4.2.0 (EJB3) running in a cluster based on JGroups + Hibernate(JPA) + JBOSS Cache
  • Frontend: Apache2 webservers with Solaris Network Cache Accelerator and AJP proxy to JBOSS servers
  • Solaris Jumpstart to setup new systems really fast with our selfwritten application in PHP for maintaing systems.
  • NFS for providing static content for web servers from Oracle server (yay! dedicated NetApp would be great! ;) )
  • LDAP to synchronize admins accounts inside cluster.
  • SNMPv2(LVS,OSes,JBOSS,Oracle) to monitor everything with single (selfwritten) Java application which graphs everything in realtime.

As this is basic configuration with database as an single point of failure, in Septemer I’m going to setup DataGuard for Oracle. Also I’m testing more advanced scale up. Currently I’m in process of setting up Solaris Cluster with Oracle RAC 10gR2 implemented on iSCSI storage provided by third node based on Solaris Nevada with iSCSI target to test Transparent Application Failover. I’ve been scratching my head over this one for awhile now. Yeah, it is real hardcore… more over that’s not the end of the story – Disaster Recovery with some other interesting bits of technology is going to be implemented later on… all on x86_32 comodity hardware :) Also we are going to put C-JDBC(Sequoia project) under stress…

(Sun) Solaris Cluster 3.2 on x86-32bit … on VMware – screenshot

Monday, May 21st, 2007

booting_vulcan1_32bit.png

Nice site: benchmarks of GFS,GNBD,OCFS2

Friday, March 23rd, 2007

Mostly English site dedicated to benchmarking clustering filesystems DistributedMassStorage

Kilka ciekawych linkow (Clustered Samba, CCIE.PL ciekawy post, drogie urzadzenia SAN ;) )

Sunday, March 18th, 2007

Samba nigdy zbytnio mnie nie interesowala ( bo nie interesowal mnie fakt udostepniania przestrzeni Windowsom ;) ) aczkolwiek ciekawie zapowiada sie klastrowana Samba.

Na forum CCIE.PL jest bardzo ciekawy wpis ( autor: pjeter ) ktory opisuje mniej wiecej sciaganie obrazu z tunera TV z kablowki i rzucanie tego na multicast w czasie rzeczywistym ( z kodekiem ) na zwyklym PC z Linuxem dzieki czemu obraz TV jest dostepny dla innych komputerow w sieci lokalnej…

Architektury sieci SAN mnie interesuja ( np link ) … ale niestety sa poza zasiegiem budzetowym:

  • MDS 9216 kosztuje ok 120000zl
  • o MDS 95xx lepiej wcale nie pisac ;)

Skalowanie sie MySpace – link

Sunday, March 18th, 2007

Fajny tekst o MySpace i ich “ewolucji skalowania”.