Writing code to be explicitly concurrent is hard and, I'm convinced, almost always the wrong model. We all know threading is hard, and leads to all sorts of subtle errors, but I'm pretty sure that no layer of dressing this up nicely will really get rid of the fundamental problems.
My particular bugbear is composability of concurrent code - how do a I write a component that executes concurrently but in a way that suits the concurrency needs of any arbitrary caller while maintaining the encapsulation of its functionality. Let's say my component has its concurrency implemented properly and in particular it can perform its work concurrently to make best use of a multicore or multi machine environment - but how does it know how much concurrency to use? I don't know if, to the caller, the execution of my component is the critical path and should use as many resources as it can to complete in the absolute shortest time possible, or whether it's not at all critical and should execute using the minimum concurrent resources as they're intended to be dedicated to some other components which are the critical path.
With explicit threading, people tend to simply make the component split its work into as many threads as it can, and then let the machine scheduler decide which threads will run, but even if thread creation and admin overhead were zero (which it isn't) the fact is that the machine scheduler doesn't arbitrarily know which threads are critical and which aren't and so the more threads (and IPC calls and communication resources etc) each component makes, the more resources they'll be effectively stealing from the true critical components.
So one suggested approach is to write components that, in addition to their inputs, also take some sort of control block that specifies what resources it can use and how it should use them ("maxthreads" and "thread priority" values), but this is a very poor composability model - what if my component then needs to use a number of others as part of its execution, without knowing precisely how those components work, execute, and consume resources, then there's no good way to take the control block I've been given and interpret how to specify control blocks to those subcomponents.
And of course, if the concurrency requirements and performance are influenced by some non-trivial characteristic of the input data, say not just the number of items in a list but depends on the values involved, and it in fact requires some partial solution of the problem to determine how much concurrency can be achieved, then the composition of an overall solution from encapsulated components becomes horribly more complex.
Another way to address this is for components to examine their environment and adjust their execution depending on what they find, but the critical path issue arises again. Just because there are execution units available and idle (assuming I can determine this) doesn't mean that, for optimal execution of the system overall, that I should go ahead and use them... it may well be that those units are being 'held' to be used by some other component that's not ready to execute yet (it's waiting for its inputs to arrive) but that will be the critical calculation when it's ready to run.
Without composition, I have no way to build components that can be used to construct large ad-hoc programs for a truly efficient concurrent execution. Anything more complex than an embarrassingly parallel model becomes a one-off special case construct, or is made in way that shoehorns my code into some concurrency framework that probably cripples my way to express a problem and delivers only a fraction of the performance that could be achieved given a particular problem and a particular set of resources.
So I devised a little model to avoid this - a functional concurrent language model where everything is composable. There's no explicit concurrency because the model itself can determine the concurrence capability of every single statement (much like a modern compiler and CPU can see chances to boost performance with out-of-order execution) and then assemble larger concurrent blocks. Every component can be written as simply a statement of the calculations required including use of other components, and the concurrency is determined dynamically in advance of execution but only when a complete composition is executed. The same components can be used at multiple points in such a solution, but the concurrency demanded of each one is independent of the others. A program written in this way will dynamically change its behaviour depending on the execution environment - it balances the load across the given resources aiming to find an execution schedule that completes in the minimum time possible, that time being the time required for the critical path as determined for those execution resources (including the cost of communicating between execution units where required). Running the same program (with the same inputs etc) on two different execution environments that differ in available resources will result in different critical paths.
It's not a perfect solution of course, as it stands it makes some simplifying assumptions, and as a small but working model it has certain limitations - in particular the model would require expanding to deal with arbitrary dimensionality of the problem expression. And while functions are composable, it imposes certain limitations on operations such as passing functions themselves as parameters or return values.
But it has some fascinating characteristics, certain whole program optimisations become available, and as it balances the loads it takes communication times into account. So it doesn't just blindly assign chunks of work to arbitrary execution units but can take account of the fact that you typically have a grid of multicore machines and making data such as inputs and outputs and other artefacts (task completion and related admin info, memoisation caches etc) available between cores is orders of magnitude quicker than communicating the same information between machines. So it can decide when, for example, it's worth doing local (per machine) or global memoisation of certain operations. And given its whole program optimisation, it knows the point at which no further resources will boost execution any further - given an excess of execution resources it will determine the a minimal subset of those resources it needs to meet the calculation demands. This is, admittedly, possibly a sub-optimal minimal set (a better or exhaustive scheduler may be able to do better), but it at least knows that it has no way to use further resources and in fact will organise itself to load the execution units in an optimal way and leave other resources completely spare rather than spread itself too thinly.
And given its strongly functional constraints it's also ideally suited to repeated calculations with perturbations to the inputs - the calculation and memoisation model can be tuned so that it performs only the minimal partial recalculations required for any changes. And those repeated partial recalculations can take account of, but are in no way constrained to, the resources that were used for the original calculations. The partial recalculations will each have their own critical path depending on the data changed. This has, needless to say, huge benefits for certain classes of problems.
But where next ? It's an intriguing prospect, but is the sort of thing that I should probably investigate in a more academic environment. It's only a small model, a prototype to illustrate the principles and show what could be done... anyone fancy sponsoring a PhD for a truly concurrent language and execution model ?
No.. thought not... oh well, back to re-inventing the way we organise ourselves and maybe I'll publish the language and code sometime.
My particular bugbear is composability of concurrent code - how do a I write a component that executes concurrently but in a way that suits the concurrency needs of any arbitrary caller while maintaining the encapsulation of its functionality. Let's say my component has its concurrency implemented properly and in particular it can perform its work concurrently to make best use of a multicore or multi machine environment - but how does it know how much concurrency to use? I don't know if, to the caller, the execution of my component is the critical path and should use as many resources as it can to complete in the absolute shortest time possible, or whether it's not at all critical and should execute using the minimum concurrent resources as they're intended to be dedicated to some other components which are the critical path.
With explicit threading, people tend to simply make the component split its work into as many threads as it can, and then let the machine scheduler decide which threads will run, but even if thread creation and admin overhead were zero (which it isn't) the fact is that the machine scheduler doesn't arbitrarily know which threads are critical and which aren't and so the more threads (and IPC calls and communication resources etc) each component makes, the more resources they'll be effectively stealing from the true critical components.
So one suggested approach is to write components that, in addition to their inputs, also take some sort of control block that specifies what resources it can use and how it should use them ("maxthreads" and "thread priority" values), but this is a very poor composability model - what if my component then needs to use a number of others as part of its execution, without knowing precisely how those components work, execute, and consume resources, then there's no good way to take the control block I've been given and interpret how to specify control blocks to those subcomponents.
And of course, if the concurrency requirements and performance are influenced by some non-trivial characteristic of the input data, say not just the number of items in a list but depends on the values involved, and it in fact requires some partial solution of the problem to determine how much concurrency can be achieved, then the composition of an overall solution from encapsulated components becomes horribly more complex.
Another way to address this is for components to examine their environment and adjust their execution depending on what they find, but the critical path issue arises again. Just because there are execution units available and idle (assuming I can determine this) doesn't mean that, for optimal execution of the system overall, that I should go ahead and use them... it may well be that those units are being 'held' to be used by some other component that's not ready to execute yet (it's waiting for its inputs to arrive) but that will be the critical calculation when it's ready to run.
Without composition, I have no way to build components that can be used to construct large ad-hoc programs for a truly efficient concurrent execution. Anything more complex than an embarrassingly parallel model becomes a one-off special case construct, or is made in way that shoehorns my code into some concurrency framework that probably cripples my way to express a problem and delivers only a fraction of the performance that could be achieved given a particular problem and a particular set of resources.
So I devised a little model to avoid this - a functional concurrent language model where everything is composable. There's no explicit concurrency because the model itself can determine the concurrence capability of every single statement (much like a modern compiler and CPU can see chances to boost performance with out-of-order execution) and then assemble larger concurrent blocks. Every component can be written as simply a statement of the calculations required including use of other components, and the concurrency is determined dynamically in advance of execution but only when a complete composition is executed. The same components can be used at multiple points in such a solution, but the concurrency demanded of each one is independent of the others. A program written in this way will dynamically change its behaviour depending on the execution environment - it balances the load across the given resources aiming to find an execution schedule that completes in the minimum time possible, that time being the time required for the critical path as determined for those execution resources (including the cost of communicating between execution units where required). Running the same program (with the same inputs etc) on two different execution environments that differ in available resources will result in different critical paths.
It's not a perfect solution of course, as it stands it makes some simplifying assumptions, and as a small but working model it has certain limitations - in particular the model would require expanding to deal with arbitrary dimensionality of the problem expression. And while functions are composable, it imposes certain limitations on operations such as passing functions themselves as parameters or return values.
But it has some fascinating characteristics, certain whole program optimisations become available, and as it balances the loads it takes communication times into account. So it doesn't just blindly assign chunks of work to arbitrary execution units but can take account of the fact that you typically have a grid of multicore machines and making data such as inputs and outputs and other artefacts (task completion and related admin info, memoisation caches etc) available between cores is orders of magnitude quicker than communicating the same information between machines. So it can decide when, for example, it's worth doing local (per machine) or global memoisation of certain operations. And given its whole program optimisation, it knows the point at which no further resources will boost execution any further - given an excess of execution resources it will determine the a minimal subset of those resources it needs to meet the calculation demands. This is, admittedly, possibly a sub-optimal minimal set (a better or exhaustive scheduler may be able to do better), but it at least knows that it has no way to use further resources and in fact will organise itself to load the execution units in an optimal way and leave other resources completely spare rather than spread itself too thinly.
And given its strongly functional constraints it's also ideally suited to repeated calculations with perturbations to the inputs - the calculation and memoisation model can be tuned so that it performs only the minimal partial recalculations required for any changes. And those repeated partial recalculations can take account of, but are in no way constrained to, the resources that were used for the original calculations. The partial recalculations will each have their own critical path depending on the data changed. This has, needless to say, huge benefits for certain classes of problems.
But where next ? It's an intriguing prospect, but is the sort of thing that I should probably investigate in a more academic environment. It's only a small model, a prototype to illustrate the principles and show what could be done... anyone fancy sponsoring a PhD for a truly concurrent language and execution model ?
No.. thought not... oh well, back to re-inventing the way we organise ourselves and maybe I'll publish the language and code sometime.
Archived Comments
Back when this article was on posterous, it provoked the following comment and response
IvorHewitt (Twitter) responded:
I think if the tasks you are running concurrently are large enough that it's worth considering them having knowledge of how many threads/resources they can use then the tasks simply need subdividing into smaller work units until that's no longer a consideration. On the other hand breaking the decisions about task scheduling down to the language level the question is whether that's still viable when you are running concurrently on multiple machines. i.e. do you get killed with the overhead of allocating work and collecting results.
meercat responded:
Breaking them down is the composition problem - how do I write my individual routines so that it's not my routine that decides if its task is short enough (with respect to what the caller considers significant) to run sequentially or if it should run concurrently.. and if so to what degree it should run concurrently.
That's why you then find the case where you've got loads of threads, typically many more than you have cores, with no idea in advance (other than hard coding on a per problem basis) which are worth running on other machines, and where the threads are all competing at certain times for no real overall benefit, because you find noticeable periods where there's only a single runnable thread, which is itself taking longer to run because it was starved of resources. This isn't a problem for whole classes of concurrent programs (Monte Carlo simulations where you simply know you have hordes of independent calculations of broadly the same duration), but solving so called embarrassingly parallel problems is pretty well known (hence the name).
The approach I'm suggesting instead effectively assembles individual threads to optimally pack the available processing time across all the execution units available on a dynamic basis.
To put it another away, breaking it down to the language level doesn't mean you're forced to go extremely fine-grained with synchronisation primitives... I started to see my mini language as a tree (which it then then resolves to a graph) based configuration language for assembling individual units which need no intrinsic knowledge of threading or synchronisation - they simply look like functions in that they start execution when invoked, and return values when done. (All languages could arguably be viewed as configuring base calculations like this, but this particular approach made it even more so).
So your base units could be much chunkier than simple operations - the language then becomes an automatic concurrency scheduling composition mechanism that just happens to be expressed in code (sort of subverts the whole data-driven coding style - you're still doing data driven composition but the data format just happens to look like code).
As for allocating work and collecting results - the first one is an issue only as far as you want to take it, the second is no worse and, I'd say, better than the other options.
First off the problem as expressed can be optimised at "compile" time (it's not a true compilation as it reduces the problem to a partially concrete graph - essentially the same form as used for building intermediary 'libraries' of components). At the moment I was then, when the 'program' is executed, doing a second 'compilation' stage to further collapse and optimise the structures as far as possible - this is not strictly necessary but it's nice to get any further gains it makes.
When you have a fully concrete graph, I then have the option to, with appropriate meta info, build an execution plan of what will be done on what worker units, or just blindlly fire off threads. The first could be further improved by performing a code generation and compilation to directly hook components together.
Sort of thing best debated over whiteboards rather than blogs really :)
IvorHewitt (Twitter) responded:
I think if the tasks you are running concurrently are large enough that it's worth considering them having knowledge of how many threads/resources they can use then the tasks simply need subdividing into smaller work units until that's no longer a consideration. On the other hand breaking the decisions about task scheduling down to the language level the question is whether that's still viable when you are running concurrently on multiple machines. i.e. do you get killed with the overhead of allocating work and collecting results.
meercat responded:
Breaking them down is the composition problem - how do I write my individual routines so that it's not my routine that decides if its task is short enough (with respect to what the caller considers significant) to run sequentially or if it should run concurrently.. and if so to what degree it should run concurrently.
That's why you then find the case where you've got loads of threads, typically many more than you have cores, with no idea in advance (other than hard coding on a per problem basis) which are worth running on other machines, and where the threads are all competing at certain times for no real overall benefit, because you find noticeable periods where there's only a single runnable thread, which is itself taking longer to run because it was starved of resources. This isn't a problem for whole classes of concurrent programs (Monte Carlo simulations where you simply know you have hordes of independent calculations of broadly the same duration), but solving so called embarrassingly parallel problems is pretty well known (hence the name).
The approach I'm suggesting instead effectively assembles individual threads to optimally pack the available processing time across all the execution units available on a dynamic basis.
To put it another away, breaking it down to the language level doesn't mean you're forced to go extremely fine-grained with synchronisation primitives... I started to see my mini language as a tree (which it then then resolves to a graph) based configuration language for assembling individual units which need no intrinsic knowledge of threading or synchronisation - they simply look like functions in that they start execution when invoked, and return values when done. (All languages could arguably be viewed as configuring base calculations like this, but this particular approach made it even more so).
So your base units could be much chunkier than simple operations - the language then becomes an automatic concurrency scheduling composition mechanism that just happens to be expressed in code (sort of subverts the whole data-driven coding style - you're still doing data driven composition but the data format just happens to look like code).
As for allocating work and collecting results - the first one is an issue only as far as you want to take it, the second is no worse and, I'd say, better than the other options.
First off the problem as expressed can be optimised at "compile" time (it's not a true compilation as it reduces the problem to a partially concrete graph - essentially the same form as used for building intermediary 'libraries' of components). At the moment I was then, when the 'program' is executed, doing a second 'compilation' stage to further collapse and optimise the structures as far as possible - this is not strictly necessary but it's nice to get any further gains it makes.
When you have a fully concrete graph, I then have the option to, with appropriate meta info, build an execution plan of what will be done on what worker units, or just blindlly fire off threads. The first could be further improved by performing a code generation and compilation to directly hook components together.
Sort of thing best debated over whiteboards rather than blogs really :)