timerMsgQueueId : The integer
variable stores the id of the message queue for
communicating with the timer process.
msgQueueId : The integer variable
stores the id of the message queue, for communicating
with the user programs.
monitorNode : Contains the id of the co-ordinator
that is being monitored
pidArr : An global array defined externally consisting
of pids of all the user-processes.
Here we discuss the case when a coodinator fails.
When the co-ordinator fails (failures of following kinds: host-machine failure/coordinator process failure/link failure: all assumed to be fail stop), the following sequence of operations lead to the discovery of its failure:
Let N be the co-ordinator that has failed and let M be the monitor
process that is monitoring N. Note that N and M are running on different
machines.
Following is the RPC service:
void* broadcast_fail( int* id)
Argument: The id of the node that has failed.
On the client side :
The client is that coordinator
which has been informed about the node
failure from the monitor, running on the same machine as the coordinator.
On the server side
Each of the coordinator
Q on receiving this function call do the
following:
1) In the node array, mark the entry corresponding to "id" as -2,
that indicates that the co-ordinator whose id was id has failed.
2) If the failed node was the last one in the ring, then the co-ordinator
Q checks to find whether it is the last node.
3) If Q is the last node, then set the value of the flag in Q, globLastMe
= true. thus Q becomes the publisher.
4) All the user process that have submitted their computation with the
co-ordinator must be informed of this node failure. For this, the id of
the failed coordinator is send to the user message queue.
5) If the monitor R present at the same machine as Q was monitoring the
failed node N then R must now start monitoring that node which the monitor
present at the site of coordinator N was monitoring.
For this, the list L is searched to find the next co-ordinator in the ARC
ring and the monitor R is given its id. This is done by keeeping the info
in the message queue and sending the signal to R.
6) Monitor R upon receiving the signal reads the message, comes to know
about the failure of the node and starts monitoring the new node.
Following is the helper function :
void* node_failed( int* id)
Argument: The id of the node that has failed.
On the client side :
The monitor process
calls the function HandleNodeFailure(int id) which in turn calls
this function locally on the coordinator. This function is called when
the monitor process detects that the co-ordinator it was monitoring has
failed. The failure is brought to the notice of coordinator through this
function.
On the server side :
The call is received
by the co-ordinator, say P at the same node as the
monitor. In this function
:
1) Set the entry corresponding to id as -2. The entry -2 indicates
that the node has failed. It is to be noted that the entry for the failed
node is not deleted.
2) If the failed node was the last one in the ring, then the local
co-ordinator P checks to find whether it is the last node ,out of the remaining
co-ordinators.
3) If P is the last node, then set the value of the flag in P globLstMe
= true.
4) At this stage only the local co-ordinator knows that some coordinator
has failed. Other coordinators must also be given this information. For
this, P calls the RPC function broadcast_fail(int* id) on all the
other coordinators.
5) The above function call returns null value.
6) All the user process that have submitted their computation with
the co-ordinator must be informed of this node failure. For this the id
of the failed coordinator is send to the user message queue.
7) The monitor M must now start monitoring that node which the monitor
present at the site of coordinator N was monitoring. For this, the list
L is searched to find the next co-ordinator in the ARC ring and the monitor
M is given its id( By keeeping the info in the message queue and sending
the signal to M).
8) Monitor M starts monitoring the new node.