A discussion of the new dependency code, version 0.2
by Michael Meeks <mmeeks@gnu.org>


The dependency code is a comparatively conceptualy simple part of the gnumeric
code. The code is designed to determine which objects depend on which cells.
The main use of this is triggering recomputation for things that depend on
something that has changed.

1.     Overview of the Dependencies

       The majority of the code related to dependencies can be found in module
eval.c, and this should be the first reference for functions.

1.1    Data structures and their meaning

       The main dependency data is anchored on a per-sheet basis using the
structure DependencyData. This stores all the dependencies for the sheet
in two hash tables.

       There are two types of dependencies, single and range. Loosely these
describe single (ie. = A1 ) cell references vs. large ( ie. = SUM (A1:Z500) )
range references. The two hash tables store DependencyRange structures in the
'range_hash' member, and DependencySingle structures in the 'single_hash'
member.

       The DependencyRange structure defines a range reference. This lists the
Dependencies ( not neccessarily in this sheet ), that depend on the range
specified.  Hence to find the cells that depend on a cell you have just altered
you must search all the range structures in the range_hash for that sheet.

       The DependencySingle structure mapping stores the degenerate case of
a DependencyRange. Essentialy it stores the cells that depend on a unit range.
This allows for extremely fast constant time hashed lookup. This contrasts with
the Range hash, all of which has to be traversed per dependency calculation.
NB. the DependencySingle has to use CellPos' since there is no garentee that
a cell will exist at a given position in the sheet that is depended on.


2.     Dependent Evaluation

       The routine dependent_eval will evaluate a dependent and if its value
changes it will queue cells that depend on it, these are obtained via
cell_forech_dep, which looks up in the single_hash and tranverses the
range_hash to obtain its result. Often cell recalculation happens through
workbook_recalc which works through the workbook's eval_queue re-evaluating
the cells there.

2.1    Evaluation queue vs. recursion

       There are two ways in which a cell's recalculation can occur. Firstly
simply traversing the ExprTree (expr.h) will result in many cells being
evaluated. Essentialy each dereference in the ExprTree ( eg. =A1 ) will cause
the re-calculation of A1's ExprTree before we can continue ( recursively ).
This is actually fairly expensive when the expression contains a reference like
    =sum(a1:a65535)
Each dependent can be in two states


3.     Dependencies the bottleneck

       Since dependencies tend to fan out, and massively increase the area that
needs to be recomputed, it is clearly advantagous to spend time culling the
dependency tree as intelligently as possible. Furthermore since for each cell
that changes it is neccessary to determine its dependencies the lookup of
a cell's dependencies needs to be fast.

3.1    Why two methods

       First, consider the case where only range dependencies are used, this
is a fairly simple first implementation. If we have N cells that have
random dependencies on one other cell, then we will have approx N ranges in
the range hash. For each cell we re-calculate we need to iterate over the
entire range hash to determine its dependencies. Hence we have a fundamentally
O(N^2) algorithem, this is very bad news. This scheme spends almost all of its
time in search_cell_deps.

       To overcome this problem we partition dependencies into single cell
dependencies and range dependencies. This way for the common =A1 case, we don't
add an entry in the range_hash, we simply add an entry in the simple_hash.
Hence for the cell_forech_dep we have one less entry in the range hash to
iterate over, which saves N iterations of search_cell_deps.

	Another common case is having a lot of formulae sharing the same range
in the sheet as a dependency ( eg. cutting and pasting = SUM($A$1:$A$100) ). In
this case there is a single depedency range entry with many cells listed as its
dependancies.

3.2    Inter-sheet dependencies

       Inter sheet dependencies are managed simply by inserting the dependency
information into the sheet in which the cells that are dependended on reside.
This is essentialy exactly what is expected, given that cell's are linked to
the cell they depend on. Removing inter-sheet dependencies is also identical
to normal dependencies, excepting that it is more likely to throw up formulae
that have not been correctly unlinked in a sheet destroy.

3.3    What is hashed

       Whilst the two hashes ( range_hash, simple_hash ) are both GHashTables
what is hashed is quite different.

3.3.1  What does the range hash do ?

       The hashing on the range_hash is merely used to determine if there is
already a range in the hash with the same dimensions as a new dependency range
being added. This accelerates insertion of dependencies, the hash is traversed
as a simple un-ordered list at all other times.

3.3.2  Why not a direct Cell * -> GList * mapping for DependencySingle ?

       This is not done since there is no garentee that cells that have
dependencies are in existance yet. Hence it is quite valid for A1 to be '=A2'
before A2 exists. If A2 does not exist then A2 has no Cell structure yet. This
could be obviated by creating depended cells, but this would be inelegant.


4.     How dependencies are generated and removed

       The dependencies are both generated and mostly removed by the
handle_tree_deps function. This traverses the ExprTree either adding or
removing dependencies on cells as they are met.

4.1    Handling ExprTrees

       The ExprTree is recursively traversed by handle_tree_deps and may
terminate with handle_value_deps terminating in either adding or removing
dependencies, according to the 'add' parameter.

4.2    Removal of dependencies

       The removal of single dependencies is performed by traversing the
ExprTree again, this saves a search of every cell in the dependency hash
looking for this Cell's position. This relies on sheet_cell_remove_from_hash,
and sheet_cell_add_to_hash dropping and adding dependencies, and
formula_unlink / link when a cell's formula is changed, since the original
ExprTree is needed to remove its dependencies correctly. The current code
implements this correctly internaly, but this needs bearing in mind if
extensive work is done to cell.c.

4.3    Special cases

4.3.1  Implicit intersection

       This is as yet unimplemented, but will further reduce the number of
ranges to clip against. Essentialy an implicit intersection reduces a range
to an adjacent single reference under certain circumstances.

4.3.2  Array Formula

       These luckily have a simple dependency structure, since the formula
stored is identical in each cell, the cells may all depend on the corner cell
using a fast single mapping.

4.3.3  The INDIRECT function

       This is rather a special case; this function returns a value that
references a different cell, hence the dependency has to be treated rather
differently. This is yet to be implemented.


5.1     Future Expansion

       There are several avenues open for future expansion. Clearly further
accelerating the range search will give big speedups on large sheets. This
could be done by clipping the dependency ranges several times against smaller
ranges homing in on the cell of interest, and storing the results for future
reference. Clearly many of the MStyle range related optimizations would be
useful here as well.

5.2    Multi-threading,

       With the current structure, it might well be possible to add multi-
threading support to the evaluation engine. The structure of this would take
advantage of the partitioning already provided by the sheet boundary. To do
this it would be neccessary to move the eval_queue to a per-sheet entity, and
putting a locking / signaling mechanism on the queue such that inter-sheet
dependencies could be pre-pended to the queue ( thus ensuring rapid
evaluation ), and waited on safely. Since each cell is evaluated but once
per re-calc, it would then be safe to read the Cell's value when it dissapeared
from the eval_queue.

5.3    ExprTree recursion

       Whether it is always entirely neccessary to re-evaluate a cell solely
on the basis that it is in the ExprTree is non-obvious to me. Clearly if this
cell is in the dependency queue it would make perfect sense, however if there
is as yet no chance that this cell has been changed, it makes little sense
to re-calculate it ( and its tree'd dependencies ). The only problem here is
determining whether any of the currently queued dependencies would alter this
cell's dependencies.
