Installation Steps for raid1_data_integrity_patch


Issue(s) addressed by this patch (PL-9146):

An issue was introduced into the md/raid1 kernel driver that causes failed writes not to be reported.
As a result, DataKeeper was unable to rewrite the failed data blocks when a mirror was resynchronized,
causing the target to be out of sync with the source.  This patch installs a fix developed by SIOS to
the md/raid1 kernel driver on SPS for Linux versions 9.3.2 - 9.5.1.  The SIOS fix has been submitted
and accepted by the maintainer of the md/raid1 kernel driver.  Bug reports have been opened with the
Linux OS vendors to incorporate this fix into future kernels.

NOTE: If you are installing SPS for Linux on a new cluster with any of the affected kernels, you will
need to perform steps 1 - 5 and 9 after installing SPS for Linux and BEFORE creating your DataKeeper resources.


Patch Description:

This patch provides an updated md/raid1 kernel module that will properly report all writes that fail.
LifeKeeper will verify that the proper md/raid1 module with the SIOS fix is loaded.

NOTE: The updated md/raid1 kernel module requires Secure Boot to be disabled.


Data Integrity check:

It is important to verify that the resource is in-service on the server with the correct or best data before
doing the full resync in step 12 of the instructions below.  If there is corrupted data on all servers then
a restore from backup may be necessary.


Getting the patch:

This patch can be found on the SIOS FTP site at the following location:
http://ftp.us.sios.com/pickup/HOTFIX-PL-9146-raid1_data_integrity_patch/

To download the patch and associated files on Linux perform the following steps:
wget http://ftp.us.sios.com/pickup/HOTFIX-PL-9146-raid1_data_integrity_patch/raid1_data_integrity_patch
wget http://ftp.us.sios.com/pickup/HOTFIX-PL-9146-raid1_data_integrity_patch/raid1_data_integrity_patch.md5sum
wget http://ftp.us.sios.com/pickup/HOTFIX-PL-9146-raid1_data_integrity_patch/readme.txt

NOTE: Alternative download methods can be used but must include all files.


IMPORTANT NOTE: Do NOT perform a rolling upgrade during initial installation of this patch.  The instructions
below require LifeKeeper to be running on all servers and DataKeeper resources out-of-service on all servers to
install the patch.  Do not bring DataKeeper resources in-service until all servers in the cluster have installed the patch.


Patch Installation:

On each cluster node perform the following steps:

1) Download the patch file and md5sum file:

   raid1_data_integrity_patch
   raid1_data_integrity_patch.md5sum

2) Verify the download by running the following command:

   # md5sum -c raid1_data_integrity_patch.md5sum

3) The raid1_data_integrity_patch must be executable:

   # chmod +x raid1_data_integrity_patch

4) Verify that you have the correct DataKeeper version that requires this patch:

   # rpm -q steeleye-lkDR

   steeleye-lkDR-9.3.2-6863.x86_64
   steeleye-lkDR-9.4.0-6959.x86_64
   steeleye-lkDR-9.4.1-6983.x86_64
   steeleye-lkDR-9.5.0-7075.x86_64
   steeleye-lkDR-9.5.1-7154.x86_64

   Where one of these packages is installed.  The patch is not intended or needed for any other version of LifeKeeper.

5) Verify that you are running an affected kernel version:
   
# uname -r

   Distribution: Affected Kernels
   ---------------------------------------
   RHEL/CentOS/OEL 8.2: All
   RHEL 8.3: All
   OEL 7.x UEK 5: 4.14.35-2025.400.8 <= Kernel <= 4.14.35-2047.504.1
   SUSE 12 SP 4: Kernel >= 4.12.14-95.51
   SUSE 12 SP 5: 4.12.14-122.20 <= Kernel <= 4.12.14-122.74.0
   SUSE 15 SP 1: Kernel >= 4.12.14-197.37.1
   SUSE 15 SP 2: 5.3.18-22.2 <= Kernel <= 5.3.18-24.67.1

   Note: Please check the following documentation page for the most up-to-date list of affected kernel ranges:
   https://docs.us.sios.com/Linux/current/LK4L/important-raid1-kernel-issue

   If the kernel version you are running is not affected, then this patch is not needed.


6) Resume any paused DataKeeper mirrors:

   # /opt/LifeKeeper/bin/mirror_action <datakeeper tag> resume

7) Stop all DataKeeper resources in the cluster.  On each server run:

   # /opt/LifeKeeper/bin/perform_action -a remove -t <datakeeper tag>

   If there are DataKeeper resources in-service the patch will fail with the following error:

   ERROR: All DataKeeper resources must be out of service before applying the patch. Please refer to the patch procedure in the documentation.

8) LifeKeeper should be running on all nodes in the cluster while installing the patch.

   # /opt/LifeKeeper/bin/lcdstatus -q

   Output should show the list of resources.  The DataKeeper resources should be OSU.

9) Disable Secure Boot by taking one of the following actions

  a) Disable Secure Boot in the UEFI configuration
  b) Disable signature verification with the “mokutil --disable-validation” command.  See mokutil documentation for details.

10) Install the patch (self extracting binary) on all servers:

  a) Install using the default HADR packages delivered in patch

     # ./raid1_data_integrity_patch

     If the default HADR packages do not support the currently loaded kernel the following error will occur:

     ERROR: Unable to locate a kernel module package for running kernel <kernel>. Please contact SIOS Customer Support (support@us.sios.com)

     Use the command provided in step 10b to install the patch using the SIOS provided HADR package.

  b) Execute the following only if you encountered an error in step 10a.

     Install using a custom HADR package provided by SIOS
     
# ./raid1_data_integrity_patch --addHADR <hadr-rpm-file>

     The patch installs a patched raid1 kernel module, an nbd kernel module (on RHEL, CentOS, and OEL), and LifeKeeper changes to verify the proper md/raid1 module is loaded. 
    
     The following rpm packages are installed:
     
     steeleye-lkHOTFIX-DR-PL-9510-9.5.1-7154.x86_64
     HADR-generic-9.5.2-7273.x86_64
     HADR-<VENDOR>-<KERNEL>-9.5.2-<REVISION>.x86_64

     NOTE: <VENDOR> is RHAS, SuSE, etc, <KERNEL> is the version that the HADR modules are built for, and <REVISION> is the LifeKeeper HADR revision.

11) Bring the DataKeeper resources in-service on the server where the data is correct.  This is most likely the server where the DataKeeper resource was last in service.

    # /opt/LifeKeeper/bin/perform_action -a restore -t <datakeeper tag>

NOTE: This will bring all resources in the hierarchy in-service.  Include the ‘-b’ option to bring only the DataKeeper resource in-service, if you do not want all resources active.

12) It is important to verify that the data is correct.  If there have been partial resyncs and switchovers then the data may be corrupt on both servers. 
   
    If there is corrupted data on all servers then a restore from backup may be necessary.

13) Force a full resync on the server where each DataKeeper resource is in-service:

    # /opt/LifeKeeper/bin/mirror_action <datakeeper tag> pause
    # /opt/LifeKeeper/bin/mirror_action <datakeeper tag> fullresync

14) When the full resync is complete any inconsistencies between the source and target will be resolved.


Upgrading SIOS Protection Suite for Linux after the patch is applied:

  Note: If upgrading both the Linux kernel and SIOS Protection Suite
        for Linux within the same maintenance window, please follow
        the steps in the “Performing a planned kernel upgrade after
        the patch has been installed” section below, then follow the
        steps in this section to perform the upgrade of SPS-L.

  To upgrade SIOS Protection Suite for Linux after applying the patch,
  perform the following steps.

  1) Resume any paused DataKeeper mirrors.
     
# /opt/LifeKeeper/bin/mirror_action <datakeeper tag> resume

  2) Either take all mirror resources out of service on all servers or bring
     all resources in-service on a single cluster server (the "primary server").

  3) Perform the following steps on each backup server. To avoid potential
     issues when using quorum functionality, only one backup server may be
     upgraded at a time.

     a) Uninstall the steeleye-lkHOTFIX-DR-PL-9510-9.5.1-7154.x86_64 package
        and delete the DR-PL-9146 directory.
       
# rpm -e steeleye-lkHOTFIX-DR-PL-9510-9.5.1-7154.x86_64
        # rm -rf /opt/LifeKeeper/SIOS_Hotfixes/DR-PL-9146

     b) Mount the SIOS Protection Suite for Linux installation image and run
        the setup script, ensuring that all required Application Recovery Kits
        are selected for installation.
       
# mkdir /media
        # mount sps.img /media -t iso9660 -o loop
        # /media/setup

     c) IMPORTANT: If upgrading to SIOS Protection Suite for Linux 9.5.1 or
        earlier, the PL-9146 patch (raid1_data_integrity_patch) must be
        reapplied. This step is not required when upgrading to a version of
        SIOS Protection Suite for Linux later than 9.5.1.
       
# chmod +x raid1_data_integrity_patch
        # ./raid1_data_integrity_patch

  4) If performing a rolling upgrade, perform a manual switchover of all
     protected resources to one of the upgraded backup servers. Repeat steps
     (a)-(c) given in step 3 to upgrade SIOS Protection Suite for Linux on
     the original primary server.

  5) Resources may now be brought in-service on any desired server.


Uninstalling the patch:

    To uninstall the patch, perform the following steps on each cluster node:

1) Resume any paused DataKeeper mirrors.
# /opt/LifeKeeper/bin/mirror_action <datakeeper tag> resume

2) Take all DataKeeper mirrors out of service.
# /opt/LifeKeeper/bin/perform_action -a remove -t <datakeeper tag>

3) Remove the LifeKeeper HOTFIX (LifeKeeper startup raid1 check):
# rpm -e steeleye-lkHOTFIX-DR-PL-9510-9.5.1-7154.x86_64

4) Find and remove the HADR package with the SIOS patched md/raid1 module:
# rpm -qa | grep HADR-
# rpm -e HADR-<VENDOR>-<KERNEL>-9.5.2-<REVISION>.x86_64

NOTE: Please be aware that uninstalling the patch while running on an affected
      kernel will expose you to potential data corruption.  The <VENDOR>, <KERNEL>,
      and <REVISION> are specific to the HADR package that was installed and should
      match the relevant package found in the ‘
rpm -qa | grep HADR-’ output.

5) Remove PL-9146 README:
# rm -f /opt/LifeKeeper/SIOS_Hotfixes/DR-PL-9146/README.txt
# rmdir /opt/LifeKeeper/SIOS_Hotfixes/DR-PL-9146

6) For RHEL, CentOS, and OEL, reinstall the currently installed version of
   SIOS Protection Suite for Linux. This step is required in order to reinstall
   the distribution-specific HADR package included with the particular SPS-L release.

Resources may now be brought in-service on any desired server.


Performing a planned kernel upgrade after the patch has been installed:

To upgrade the running Linux kernel after applying the patch, perform the following steps.

1) Resume any paused DataKeeper mirrors.

   # /opt/LifeKeeper/bin/mirror_action <datakeeper tag> resume

2) Either take all mirror resources out of service on all servers or
   bring all resources in-service on a single cluster server (the “primary server”).

3) Perform the following steps on each backup server. To avoid
   potential issues when using quorum functionality, only one backup
   server may be upgraded at a time.

   a) Delete the DR-PL-9146 directory and uninstall the
      HADR-<VENDOR>-<KERNEL>-9.5.2-<REVISION>.x86_64 and
      steeleye-lkHOTFIX-DR-PL-9510-9.5.1-7154.x86_64 packages.

      # rm -rf /opt/LifeKeeper/SIOS_Hotfixes/DR-PL-9146
      # rpm -e $(rpm -qa | grep HADR- | grep -v generic)
      # rpm -e steeleye-lkHOTFIX-DR-PL-9510-9.5.1-7154.x86_64

   b) Upgrade the kernel and reboot.

   c) Important: If the upgraded kernel version is still in the range
      affected by PL-9146, then the PL-9146 patch must be reapplied.

      # ./raid1_data_integrity_patch

      If the raid1 kernel module provided by the upgraded kernel
      package is no longer affected by the issue described in PL-9146
      (see https://docs.us.sios.com/Linux/current/LK4L/important-raid1-kernel-issue),
      then reinstallation of the PL-9146 patch is not required.
      However, users running RHEL, CentOS, or OEL RHCK must re-run the
      SIOS Protection Suite for Linux setup script for their currently
      installed SPS-L version to reinstall the required
      distribution-specific HADR package. This step is not required on
      SLES or OEL UEK.

      # mkdir /media
      # mount sps.img /media -t iso9660 -o loop
      # /media/setup

4) If performing a rolling kernel upgrade, perform a manual switchover
   of all protected resources to one of the upgraded backup servers.
   Repeat steps (a)-(c) given in step 3 to upgrade the kernel on the
   original primary server.

5) Resources may now be brought in-service on any desired server in the cluster.


Recovering from an unplanned kernel upgrade:

Performing an inadvertent kernel upgrade (i.e., without following the
steps in the “Performing a planned kernel upgrade after the patch has
been installed” section) will cause the OS vendor-provided raid1
kernel module to be reloaded. The PL-9510 hotfix (installed as part of
the PL-9146 patch) will perform a check during LifeKeeper startup to
ensure that the SIOS-provided patched raid1 module is still loaded.
When this check fails, LifeKeeper startup will fail until the issue is
corrected. In this situation, the user must either:

1) roll back to the previous kernel,
2) reinstall the PL-9146 patch (if the upgraded kernel version is
   still within the range of kernel versions affected by PL-9146), or
3) uninstall the PL-9146 patch (if the upgraded kernel version is no
   longer affected by PL-9146). See the “Uninstalling the patch”
   section above for more details.

Note: Please check the following documentation page for the most up-to-date
      list of affected kernel ranges:
      https://docs.us.sios.com/Linux/current/LK4L/important-raid1-kernel-issue

If necessary, please refer to your Operating System documentation for
steps to restrict automatic kernel updates. Also, you may refer to
Solution 995 in the SIOS Self Service portal for steps to restrict
automatic kernel updates on RHEL (log into the Customer Portal first
to access).