What is source code structure?

How tediously low can you go?


When is an egg not an egg?

This blog will talk about source code structure.

A lot.

So let's try to clarify our terms. Countless professors have already greyed elegantly in such pursuit but we want to start from a blank page and baby-step our way gently towards this thing called, "Structure." If academic literature at some point taps our shoulder to point out a misstep we shall thank it and adjust our clumsy gait.

Even before considering source code we should ask: what is the structure of anything?

What, for example, is the structure of a carton of six eggs?

One thing that strikes us about a carton of eggs is that it is a composite object. It can be broken into component parts. A carton of six eggs is composed of a carton and of six eggs.

Whatever structure this carton of six eggs has we feel that it is derived from properties of that carton of six eggs. If the carton of six eggs persists over time, we expect its structure to persist. If the carton changes, we expect that the structure may change though the the degree of structural change is dependent on the actual change that took place.

If we remove an egg, for example, we certainly expect the structure to change for now there is one less component contributing properties from which the structure is derived.

This is, of course, all rather philosophical and bordering on let's-watch-YouTube-instead. We are, however, toddling so let us just try an assumption for size: structure preserves change. We now find ourselves transitioned from, "What is structure?" to, "What is change?" - which is good because we know what change is.

Set theory in motion.

A basic appeal to set theory tells us that a set is just a collection of elements. If you want to discover whether set A is the same as set B, simply enumerate over the elements of both sets and if the enumerations are the same then the sets are the same. If set A={cheese, window} and we remove the window element then set A changes to {cheese} and enumeration of its elements before and after the change are different.

Our first catch of the day: the structure of at least composite objects entails an enumeration of the component parts. So the structure of our carton of eggs is that it is a carton and six eggs.

Is that it? Is structure just a set of component parts?

Well, not quite. If we take our six eggs, crack them all open into a bowl, whisk them, add the broken shells for extra crispiness and pour the resulting sludge back into the carton, has the carton of six eggs changed?

Of course it has. But the enumeration of components has not changed. It is still a carton of six eggs but it's wetter. Runnier. Does this wetness and runiness constitute a structural change? Baby-steps again: we certainly think so. How can we most fundamentally describe the change that has occurred? "Someone been'n got medieval on them eggs," is a valiant attempt but we need something a little more white-coat.

One thing we can say that the relationship between the eggs has changed. Before, there was one; now, not so sure. Before we could point at egg 1 and say that it was to the left of egg 2; after, no deal. The eggs' physically mixing has certainly changed how the eggs relate to one another.

So is this relationship between component elements an integral part of structure? Sounds reasonable.

Thus we arrive at: the structure of a composite is the set of its parts and the set of their inter-relationships.

We feel we could go on. We could include a description of the innards of each our six eggs in the structure of the whole. A hierarchy of scale seems to present itself for exploration. Where might it stop? But wait: let's instead pause here and use our rather simplified sets as a basis on which to build. If we need to expand our definition later, we shall do it then.

Can we apply these sets to source code and answer the question of the title, "What is source code structure?"

Back to the source.

When we click over to GitHub and suck down a project we find our hard disc fills with a program containing thousands of functions. (Call 'em methods, if you like.) Our program, thus, is a composite object. So the structure of this program would be the set of its functions and the set of relationships between those functions. These two sets might look something like the following.

Set of functions:

  • a()
  • b()
  • d()
  • ...

Set of relationships:

  • a() -> b()
  • a() -> t()
  • a() -> x()
  • b()
  • c() -> x()
  • c() -> a()
  • ...

We'll allow the nature of the elements to dictate the nature of the, "Relationship," between them. For functions, this relationship is one of invocation. Thus, a() -> b() means a() invokes b(). (Admittedly, some fuzziness persists, here; we can return to this later.)

But wait!

We happen to have downloaded an object-oriented program. Our structure so far, however, makes no reference to the classes over which our functions find themselves distributed. We presumed that structure preserves change yet our program changes when we rearrange our functions over our classes. Thus we see that we have not one but two sets of elements that both contribute to our program's structure: our program's structure is a combination of its function structure and its class structure.

This presents us with little problem, though. We simply apply our definition of structure to the class-level:

Set of classes:

  • A
  • B
  • C
  • ...

Set of relationships:

  • A -> B
  • A -> T
  • A -> X
  • B
  • C -> X
  • C -> A
  • ...

But wait! The program we downloaded was Java. Which means, well, you guessed it ...

Set of packages:

  • a
  • b
  • c
  • ...

Set of relationships:

  • a -> f
  • a -> t
  • b
  • c -> g
  • c -> h
  • ...

Are we there yet? Are we though the maze?

Well, we have made two rather glaring omissions.

Firstly, we have not looked into the details of the functions themselves. Our egg experiment failed to highlight this problem because all eggs are quite similar on the inside. Not so with functions. Two programs could have all the same names and relationships on function-, class- and package-level yet might be distinguishable by having wildly different function implementations. Here, however, we again take a pragmatic step back.

It is possible that two such programs exist and differ based only function implementation but in practice it's unlikely, certainly for enterprise-grade programs. So we shall acknowledge that we are building a model of structure with the goal of pragmatism rather than fine accuracy. We shall ignore structure below that of the function in the hope that, in most practical cases, it will not matter. (Or, where is does matter, we shall introduce a new level just as we did with the class- and package-levels.)

Our second omission is somewhat more troublesome.

It's (not) all semantics.

We have really only addresses the syntactic here and have ignored the semantic. We have identified our functions by name but we have not considered the connection between those functions and any real-world meaning they represent.

We have causally noted the dependency a() -> b(). Within the text of that statement, "a()" denotes the function a() in our source code, though this denotation is so obvious that we seldom pause to think of it. The source code function a() apparently denotes nothing else: it is abstract for this reason. If we change these names to carry semantic meaning, however, we enter another realm of structure which invites further denotations which can be problematic.

Consider the function dependency chicken() -> Jupiter(). Syntactically, this is identical to our dependency above: "chicken()" denotes a function chicken() in our source code. We know, however, that the English word, "Chicken," means a bird so semantically our chicken() function denotes that bird. But then what is the relationship implied by ->?

Chickens depend on air, water, food. Few chickens eat gas giants.

The semantic structure of a program to offers more information than the syntactic. Indeed we can derive the syntactic structure from the semantic simply by replacing all the meaning-carrying names - like chicken() - with abstract, meaningless ones - like a().

But we are going to ignore this realm of semantic structure. We shall ignore it not because it is useless - almost all business logic are semantic in nature - but because the semantic information is usually specific to a problem domain. If we were to include the semantic structure of a program as a necessary part of its general structure then we would be forced to abandon the notion of comparing the structure of one program with another outside the same problem domain. We would only be able to compare the structure of a washing machine program with that of other washing machine programs. This seems too restrictive as we would ultimately like to devise a means of analyzing program structures as generally as possible.

It might appear that we can go crazy with this semantic-scrubbing and consider that, "Function," "Class," and, "Package," are merely meaning-carrying denotations which should be rejected but there is a problem with this reach. If we dump our distinctions between function, class and package then we again lose the ability to distinguish between two Java programs by comparing like with like.

We would like to ensure that the structure of two Java programs are identical only if the function structure of one matches the function structure of the other, and the class structure of one matches the class structure of the other, etc. That is, these concepts cannot be eliminated without compromising distinguishability.

So, crawling under the barbed wire of these caveats, have we reached our goal?

If so then we have stumbled upon yet another conclusion. Unable to reject language-specific elements as semantic, we find the structure of source code dependent on the language in which it is written. Source code with only functions would not have the class- and package-level structure shown above. Source code with some other element, say Cheese-Murbanks, would have a Cheese-Murbank-structure which finds no counterpart in our Java program.


So did we answer our question with an affirmative? Can we now state what, "Source code structure," is?

Well, not quite. We cannot state the structure of source code in general. Structure appears, to some degree, language-specific. We can say, however, that the structure of a composite is the set of its parts and the set of their inter-relationships. If your source code is composed of parts, therefore, we can identify the structure of that level.

If your source code has functions, we can give its function-level structure by listing the functions and their inter-relationships; if your source code has classes, we list the classes and their inter-relationships, and so on. Given enough levels, we'll eventually arrive at your entire program's structure by giving the structure of all its levels.

And this is probably good enough. It's a start. Because now that we have identified structure we can proceed to investigate what's good structure and what's bad.

Photo credit attribution.

CC Image Eggs courtesy of REL Waldman on Flickr.