Caution

Under construction.

The idea

The idea behind this problem is, how could we optimally partition code on a HPC distributed system, in a fully automated manner, by the compiler, with complete knowledge of how the entire system interacts.

Definitions

We have to start off with a couple definitions first:

The basics

  • We define as the set of all the functions in the system.
  • We define as the set of all the nodes we want to partition the functionality to.
  • We define a calling pair as an ordered pair of functions , where function directly calls .
  • We define as the set of all calling pairs.
  • We define a partitioning as a function .
  • We define a node boundary as , where , given that is the chosen partitioning.
  • We define as the set of all the node boundaries, given the partitioning .
  • We define as the average number of times function has to call .
  • We define as the cost of calling from , given partitioning . We can further expand on this, as follows:

Here, , , and are the marshalling and un-marshalling costs associated with the RPC, for the call and response respectively. is the transport cost of a marshalled message of length between nodes and , which has to be added twice: once for the call and once for the response.

  • We define as the total number of times that function is called, during a normal, as-close-to-the-real-use scenario as possible, in a given time frame. We assume that it is possible to run instrumented binaries in a staging state, routing a statistically representative fraction of the real traffic to those instrumented binaries, and then estimating the total number of times those functions would be called from that.

Introducing distributed computing

As of right now, if we were to optimise for the cost function defined above, we would find quite a simple answer: . This works out well for small systems where the load is small, so a single node can handle all the functionality of the app, but we are going for distributed systems, so we will also introduce the following 5 additional concepts:

  • We define a load criterion as some computing resource that would be required for running that function. This does not include the load incurred by function calls within that function, either to other functions or any form of recursion. Example of load criterions could include busy CPU time, memory usage, external storage usage (for example, persistent storage through disks). The list of criterions considered is predefined and static.
  • We define as the set of all load criterions.
  • We define as the average load that node would incur if we were to put function on it, from a single function call, with regards to the load criterion ; we are going to assume that function might take up different loads if put on different nodes, for example, because of certain specialised hardware that would allow heavy computations to run faster on a node when they are present as opposed to another node where they aren’t.
  • We define as the capacity of node , that is, the total load, with regards to load criterion , that it can handle before it gets “overwhelmed,” and performance starts to suffer drastically.
  • We define as the total load incurred if we were to put function on node , with regards to load criterion . is for a single function call, whereas this would be for all calls to the function , so we expect it to be about .

The actual problem statement

Given all the definitions outlined above, we are finally able to give a formal definition of the problem:

Find a set of nodes and a partitioning of the functions in the system such that the following conditions all hold simultaneously:

No unused nodes

Formally,

No overload

We can roughly describe a no-overload condition as:

Or, the total load of a node always stays below, or is at most equal to the capacity of the node, with regards to every load criterion.

For now, we assume that the following holds:

There are functions that could overload a single node by themselves if we were to put them on only one, therefore this doesn’t always hold, and so we will make some minor adjustments in the partitioning algorithm, as well as introducing load balancing, in a later chapter, but for now we will stick with this.

Minimize function call cost

Formally, find partitioning :