algonote(en)

There's More Than One Way To Do It

Software Bugs Often Stem from Organizational Structure Rather Than the Code Itself

Can a Tech Lead Ignore the Organization?

Mention in “The Mythical Man-Month” About Organizational Structure and Bugs

Achieving a stable, bug-free system in software development is an eternal challenge. We do code reviews, write test code, and also carry out monkey tests or dogfooding to prevent bugs.

However, I get the impression that the people in the organization and its topology are often overlooked. In The Mythical Man-Month, Fred Brooks mentions that delays and defects in software systems frequently arise from miscommunication among different teams. It’s not uncommon to hear stories about high staff turnover leading to unshared knowledge and bugs, or about teams set up like government offices, or the companies with vaguely defined boundaries that assume everyone is an expert in everything—creating high cognitive load.

Microsoft Research took a statistical approach to this topic in their paper, The Influence of Organizational Structure On Software Quality: An Empirical Case Study, analyzing how organizational structure impacts quality. The research was published in 2008.

The Analyzed Code Base: Windows Vista

This paper unfortunately does not cover every software project in the world. It’s a challenge to obtain both the code and the organizational structure of commercial projects, so this particular research focused on the organization and code of Windows Vista.

They studied around 3,400 binaries generated from a 50-million-line codebase. Bugs discovered after release were taken into account.

Vista Had a Higher Ratio of Managers with Windows Development Experience

Regarding Vista’s organizational structure:

  • 33% of engineers had development experience on Windows Server 2003 or XP.
  • 61% of engineers had managers with Windows Server 2003 or XP development experience.
  • 37% of managers’ managers (director level) had Windows development experience.
  • On average, 31 people worked on changes to a single binary, of whom:
    • An average of 2 out of those 31 also worked on the same binary in Windows Server 2003.
    • An average of 15 out of those 31 worked on the previous version of Windows.
    • An average of 14 out of those 31 had experience in other Microsoft product development.

Sometimes, there’s an argument that managers don’t necessarily need to understand code, but in Vista’s development, there was a noticeable presence of managers with experience in the same area (Windows). Considering recent tendencies at places like Twitter or Meta to require managers to code, personally I suspect that as the “newness” factor in technology lessens, more engineering managers will end up working as reviewer-managers.

At first look, it may exceed the “two-pizza” rule—5 to 10 people—when 31 people are involved in a single binary. However there likely exist further subdivisions and modular structures in one group.

The development group has thousands of developers.

Six Prediction Models: Organization or Code?

To analyze which factor contributes more to bugs—organizational structure or the code itself—the paper built six models:

  • Organizational-Structure Model

    • Number of Engineers (NOE): Fewer engineers touching the code is better.
    • Number of Ex-Engineers (NOEE): Fewer engineers who left the project is better.
    • Edit Frequency (EF): Fewer code modifications is better.
    • Depth of Maximum Organizational Code Approval (DMO): Greater delegation of authority is better.
    • Proportion of Organizational Hierarchy (PO): More cohesive team ownership is better.
    • Ownership of Code Changes (OCO): Changes staying within the same team/department is better.
    • Ownership of Work (OOW): Edits made by the responsible person is better.
    • Organization Intersection Factor (OIF): Fewer outside-department changes is better.
  • Code-Change Model

    • Lines of Code Changed
    • Frequency of Code Changes
    • Consecutive Code Change Frequency
  • Code Complexity Model

    • A combination of 19 metrics including cyclomatic complexity, LOC, number of global variables, etc.
  • Dependency Model

  • Code Coverage Model
  • Pre-release Defect Model

Actually code changes are time series data, but in this paper, they appear to be randomly split for analysis. Since they look at bugs after release, there’s presumably no data leakage.

“Code-Change Model” implies that if there are fewer code changes, bugs might be fewer. That might be the same thing if data range is wrong.

Using the “Pre-release Defect Model” is somewhat late in the development process, so its practical utility for preventing bugs might be limited.

Results: Organizational Structure is the Primary Cause of Bugs

Here are the results:

They compare Precision and Recall among the models and show that the organizational-structure-based model predicts bug occurrence best. Thus, it’s fair to say that software bugs often stem more from organizational structure than from the code itself.

The code coverage model shows high Precision but quite low Recall. You should still write tests, but we could also say that having tests doesn’t necessarily prevent bugs. The code complexity and dependency models have higher Recall.

The concept behind Architecture Decision Records, in my understanding, is similarly to fill in what tests alone can’t cover, making design deliberations more explicit rather than just documentation.

Code Ownership Is Key

You might be curious which factors in the organizational structure make the biggest difference, but the paper doesn’t provide direct weights—only Spearman’s correlation coefficients:

Focusing on those above 0.70:

  • Edit Frequency (EF) and Number of Ex-Engineers (NOEE) can be approximated by Number of Engineers (NOE).
  • Depth of Maximum Organizational Code Approval (D(M)O), Engineer Ownership of Work (OOW), and Proportion of Organizational Hierarchy (PO) can be approximated by Ownership of Code Changes (OCO).

Thus, if you had to monitor just a few metrics:

  • Number of Engineers (NOE)
  • Ownership of Code Changes (OCO)
  • Organization Intersection Factor (OIF)

Number of Engineers (NOE) refers to the number of engineers who have worked on a given binary and have not left the company. Fewer people touching the code is better.

Organizational Code Ownership (OCO) is the percentage of code changes made by the team that owns the code. It’s preferable for changes to be completed within the same department.

Organization Intersection Factor (OIF) is the number of organizations that have made 10% or more of the code changes. Fewer or no changes from outside departments is better.

OCO and OIF are quite similar in concept. Combined with NOE, it generally suggests that fewer engineers from a single team consistently owning the relevant code leads to fewer bugs.

Impression

Developers who don’t want to become managers often say that technology and management are different, but in team-based software development, they can’t be separated so cleanly. On the other hand, for a small startup with just one team in its early stages, the influence of organizational structure might be lower.

Some engineers prefer always working on new things to avoid boredom, so perhaps it’s a manager’s skill to balance the ratio of “stretch assignments” versus stable ownership.