Navigating the Rescue Mode for Linux
This document will take you through the process of booting your Linux server into rescue mode to identify and fix the problem(s) that may be causing it to be unresponsive.
This guide will instruct you on how to:
- Log into rescue mode
- Identify disk partitions
- Detect physical disk problems
- Detect and fix file system errors
- Access and recover data
Logging into rescue mode
If your Linux dedicated server is unresponsive and fails to come online after a reboot, you can boot the server into rescue mode from the Simply Cloud control panel to identify and fix the problem.
- Once rescue mode has been started on your dedicated server, log into the system via SSH using your servers usual IP address and the root password that was set when the system was first installed (you can find this in your Simply Cloud control panel). You can also access the server in graphical mode using VNC if you have a VNC client installed.
Please be aware that the rescue mode system will have a different SSH host key to your normal server. If you are using PuTTY you will see a warning like Screen 1:
- Accept the warning by clicking the 'Yes' button and logging in. If you are using SSH from a Linux or Mac shell, you may need to remove the old version of the SSH key from your known hosts file before logging in. Once you have finished with rescue mode and booted your server normally, it will return to using its usual SSH host key and you will see a similar warning again.
You should see a window similar to Screen 2 once you are logged in:
Identifying your disk partitions
- Identify your disk partitions before recovering your system. Get a list of all of the disks connected to the system and their partitions by running the command 'fdisk –l' as noted in Screen 3:
- The exact output from this will vary depending on the number of disk in your server, the number of partitions on each disk, and whether or not your system uses software RAID. Screen 3 shows one disk (/dev/sda) that contains four partitions (numbered 1, 2, 5 and 6). The first partition (/dev/sda1) is marked as bootable, so this would be the partition mounted under /boot.
The second partition (/dev/sda2) is an extended partition and is only used as a container for the other two partitions. It is not mountable. The third partition (/dev/sda5) is the swap space, and the fourth partition (/dev/sda6) is the root partition, normally mounted as /. If your server has two disks the output will look something like Screen 4:
If your system uses software RAID, it will look something like Screen 5:
- If your system uses software RAID, there are additional steps you will need to take before attempting to fix disk issues or access your data. Please refer to the separate software RAID instructions in the following sections.
If no disks are displayed (or an incorrect number of disks are displayed) then the disk(s) may have already suffered a catastrophic failure. In such an event, you will need to ask Simply Cloud Support to arrange for a replacement disk / server and then restore any backups.
Detecting physical disk problems
- Your disk(s) may have physical errors that cannot be corrected and would require a disk replacement. You can use the smartctl program to test the disk to see if this is the case. First, check that the disk has its SMART capability enabled with the command 'smartctl –i /dev/diskname', swapping diskname for the correct device as shown in Screen 6.
This command should be successful as all Simply Cloud disks have SMART enabled. If this command does not successfully return the disk(s), a catastrophic failure may have occurred and the disk(s) will need to be replaced.
- Run a test on the disk using 'smartctl –t short /dev/diskname'. Further options are available (use 'man smartctl' to see them). You will see a message that the test will take around one minute to complete as shown in Screen 7:
- After waiting a minute, use 'smartctl –a /dev/diskname' to see the results displayed as a table with the number of disk failures that have occurred over the disk's lifetime. The example in Screen 8 does not show any major errors:
- Look out for a high error count next to any of the errors with the type 'Pre-fail' as these may be an indication that the disk is going to fail soon. If any of your disks have this type of error, please contact Simply Cloud Support.
- Smartctl can be used on systems with multiple disks by running the above sequence of commands for each disk (not each partition).
There are no separate instructions required for this section.
Detecting and fixing file system errors
- Your server may fail to boot if there are errors with the file system. You can identify and correct these errors using the fsck tool. For example, if you have seen errors in the systems logs indicating partition problems on the root disk (/dev/sda6 as shown in Screen 9), you can try to correct this by running the command 'fsck /dev/sda6'. This must be done before the disk has been mounted.
- In Screen 9, there are a few minor errors that fsck has fixed. For more severe errors, fsck may ask if you would like to fix them through a prompt. To avoid being prompted and simply accept the default options, run the fsck command with the –a flag. Further details are available from the fsck manual (type 'man fsck').
- If you fix any disk errors, exit rescue mode and attempt to boot the system normally. If the system still fails to boot, or can’t fix the disk errors, you may need to recover any data that you did not back up (see the section on recovering data).
Perform fsck on the RAID device rather than on the member partitions to check the file system on both disks simultaneously. The RAID device will likely be either /dev/md0 or /dev/md1, whichever is the largest (the smaller RAID device will be swap space). In Screen 10, minor errors have been corrected.
Access and recover data
- If your disks did not show any errors, or you know your system did not boot due to disk related reasons (e.g., incorrectly enabled firewall, incorrectly modified grub, etc.) you will need to access your disk(s) to either correct the problem or recover the data before reimaging. To do this, the disk(s) needs to be mounted.
- From earlier steps, you should have already established the root partition. In our one disk example shown in Screen 11, it is /dev/sda6 and in our RAID example sin Screen 12 it is /dev/md0. For servers with multiple disks, you may want to access the partition on the second disk, although problems that prevent a server booting will normally be on the partition mounted at /.
- To access the data on the root partition, create a mount point for the partition. For our one disk system, it will be created at /mnt/sda6. We then mount the disk on this mount point, and cd into the directory to view our system as shown below in Screen 11:
- Use the chroot command to change the root of the rescue system to the root on the disk. This is needed if you wanted to use the 'passwd' program to reset one of your system passwords.
- Then use 'chroot mountpoint' to change the root to the partition you have mounted. In Screen 13 we used 'chroot /mnt/sda6' or 'chroot /mnt/md0'. You may see an error such as:
chroot: failed to run command `/bin/zsh': No such file or directory
This indicates that the zsh shell used by the rescue system is not available to run (i.e., it is not installed) on your dedicated server. In this case, modify the command to run the bash shell:
'chroot mountpoint bash'
- Finally, run any remaining commands (such as passwd), and use exit to come out of the chroot.
Recovering your data
- If you are unable to fix your server, you will need to copy any data that is not backed up before requesting a reimage from the Simply Cloud control panel. If you have access to another server that runs FTP or SSH, use the command line FTP or SCP tools to upload your data to that server. Otherwise, you can connect an SCP client (such as WinSCP for Windows) to the rescue mode server, navigate to the point where you mounted the disk and download the data to your local system.
- When you have finished making changes, unmount the disk, and end rescue mode by rebooting the server from the control panel as shown in Screen 15. If necessary, reimage the server via the Simply Cloud control panel.
* The offer is £100 Cloud Hosting credit when purchasing any Cloud Hosting plan using the displayed voucher code. This credit is only redeemable for 30 days following the qualifying purchase. This offer is restricted to new customers only, cannot be applied to renewals and used in conjunction with any other offer and may be withdrawn at any time at the discretion of Simply Cloud Limited. Any customers who do not use the voucher code - 100CREDIT, will receive £10 credit, this credit is only redeemable for 30 days following the qualifying purchase. All prices displayed are exclusive of VAT, please note, for EU customers VAT rates payable will be subject to your country of residence.