In Real Application Clusters recovery, the amount of recovery processing required after node failures is proportional to the number of failed nodes. In general, data blocks become available immediately after they are recovered.
When an instance fails and the failure is detected by another Oracle instance, Oracle performs the following recovery steps:
- During the first phase of recovery, which is the GES reconfiguration, Oracle first reconfigures the GES enqueues. Then Oracle reconfigures the GCS resources. During this time, all GCS resource requests and write requests are temporarily suspended. However, processes and transactions can continue to modify data blocks as long as these processes and transactions have already acquired the necessary enqueues.
- After the reconfiguration of enqueues that the GES controlled, a log read and the remastering of GCS resources occur in parallel. At the end of this step the block resources that need to be recovered have been identified.
- Buffer space for recovery is allocated and the resources that were identified in the previous reading of the log are claimed as recovery resources. Then, assuming that there are PIs of blocks to be recovered in other caches in the cluster database, resource buffers are requested from other instances. The resource buffers are the starting point of recovery for a particular block.
- All resources and enqueues required for subsequent processing have been acquired and the Global Resource Directory is now unfrozen. Any data blocks that are not in recovery can now be accessed. Note that the system is already partially available.
- The cache layer recovers and writes each block identified in step 2, releasing the recovery resources immediately after block recovery so that more blocks become available as cache recovery proceeds.
- After all blocks have been recovered and the recovery resources have been released, the system is again fully available. Recovered blocks are available after recovery completes.
In summary, the recovered database or recovered portions of the database become available earlier, and before the completion of the entire recovery sequence. This makes the system available sooner and it makes recovery more scalable.
If neither the PI buffers nor the current buffer for a data block are in any of the surviving instances' caches, then Oracle performs a log merge of the failed instances. As mentioned for recovery in general, the performance overhead of a log merge is proportional to the number of failed instances and to the size of the redo logs for each instance. You can, however, control the size of the log with Oracle's checkpoint features. With its advanced design, Real Application Clusters recovery can manage multiple simultaneous failures and sequential failures. The shared server feature is also resilient to instance failures during recovery.