algonote(en)

There's More Than One Way To Do It

Automatically Splitting Domains by Clustering the Active Record Relationship Graph

How to automatically determine domains without relying on humans

Once you have many teams, you want a modular monolith

There are several terms used to refer to dividing a program into modules:

  • Multimodularization
  • Co-location
  • Modular monolith

"Multimodularization” is mainly used for native mobile apps, "co-location” is used primarily in web front-ends, and "modular monolith” is the term you’ll most often hear on the backend.

In general, talks of module splits tend to come up sooner in client-side app development than on the server side. The motivations for splitting into modules include:

  • (Relatively) common in client-side app development
    • Whether the language is statically typed
    • Whether the language requires a build step
  • Common to both sides
    • Whether there are two or more teams
    • Wanting to protect code from LLM-based CLIs (e.g., Claude Code) (new!)

From the typing/build perspective, client-side tends to use statically typed or compiled languages (Swift for iOS, Kotlin for Android), whereas the server side often uses dynamically typed, interpreted languages (Ruby, PHP, Python). Longer-lived client apps accumulate more complex dependencies over time, and server-side work—where you need to return an HTTP response in a few seconds for UX—often stays manageable without static types. That said, recent additions like Python’s Type Hints and Ruby’s RBS bring optional typing to scripting languages.

Regarding builds, client-side platforms sometimes demand high performance (e.g., Unity games in C#), so build steps are common. Build waiting can become a bottleneck, motivating multimodularization earlier on. On the server side, languages like Go still have builds, but they’re often used in microservice contexts where each service is its own build unit from the start.

Another motivation for multimodularization is team size—if Team A and Team B are changing the same code, merge conflicts become frequent. In script-language backends, build bottlenecks are rare, so the demand for a modular monolith usually only arises once you have two or more teams.

Recently, LLM-based CLIs like Claude Code have become powerful enough that they can accidentally break code, so in the AI era there’s even more demand to split modules on the backend and localize the impact of changes.

Is the domain boundary really something only humans can decide?

Suppose you’ve decided to split into modules. There are two main ways to think about directory structure:

  • Technology-first
    • app/models/feature_a
    • app/models/feature_b
    • app/models/feature_c
  • Feature-first
    • app/feature_a/controllers
    • app/feature_a/models
    • app/feature_a/views

Technology-first means cutting by technical layer (e.g., Rails’ default app/controllers, app/models, app/views). To split further by feature, you might do app/models/feature_aaa, app/models/feature_bbb.

Feature-first means structuring by feature at the top level: app/feature_aaa/, app/feature_bbb/, etc., each containing its own controllers, models, and views like app/feature_a/controllers、app/feature_a/modles、app/feature_a/views.

Technology-first is straightforward once you’ve decided on technical concepts, so it’s well-suited to 0→1 phases where you don’t yet know how the app will grow. As Android’s recommended architecture notes, the Domain Layer is often optional.

By contrast, feature-first can speed up CI by running tests only for changed features in a directory, making it better suited to 10→100 scaling.

Technology-first has clear rules once the layering is established; feature-first can leave gray areas about which feature a given piece of code belongs to. According to the Pareto principle, only about 20% of code is clearly feature-driven, while 80% doesn’t map neatly. Some companies handle this by having a common (or others) plus specific feature folders.

While there are well-established theories for technology-first (e.g., MVC), feature-first lacks mature theory. Some argue that only humans can set domain boundaries. Yet treating domain division as a human-only Wsacred” task without applying scientific methods feels like missing an opportunity.

Could machines really decide feature (domain) boundaries? One approach is to cluster the ORM (Active Record) relationship graph to automatically split domains, which seems underexplored—so let’s try it.

The Active Record relations form a graph

Game backends sometimes use NoSQL for performance, but for most web services MySQL or PostgreSQL are still the main data stores. NewSQL systems that offer RDB interfaces with better scalability are also emerging.

In an RDB design, tables are connected by relations. In Active Record, you declare has_one, has_many, etc. These relations can be gleaned from database schemas (foreign key constraints) or conventions (aaa_id points to aaa table). Active Record also supports more complex relations (single-table inheritance, polymorphic associations, delegated types), and its metaprogramming nature provides APIs to retrieve these relations.

Debates over ORM vs. raw SQL often center on security (e.g., SQL injection protection). But by defining relations in the ORM you also accumulate a graph structure that’s easy to analyze when considering architecture.

Extracting graph hubs (God classes) from Active Record

We’ll experiment on a codebase with enough models to analyze—Redmine (commit 068a2868ae8d5316a7c4cf9a3d1452dfab8e43a5).

First, let’s identify graph hubs (God classes) via a Rake task that lists each model’s association count:

# lib/tasks/relation_report.rake
# Rake task to list all ActiveRecord associations
# and output all models sorted by association count (descending).
# Excludes internal HABTM join classes from output.

namespace :relations do
  desc "List model relations sorted by association count"
  task report: :environment do
    # Eager load application models
    Rails.application.eager_load!

    # Collect all AR models, excluding HABTM join classes
    models = ActiveRecord::Base.descendants
                     .select { |m| m.name.present? && !m.name.start_with?("HABTM_") }

    # Map each model class to its association count
    association_counts = models.each_with_object({}) do |model, hash|
      count = model.reflect_on_all_associations.size
      hash[model] = count if count.positive?
    end

    # Sort models by association count descending
    sorted = association_counts.sort_by { |model, count| -count }

    puts "Model associations sorted by count (descending):\n"

    sorted.each do |model, count|
      puts "Model: #{model.name} (#{count} associations)"
      model.reflect_on_all_associations.each do |assoc|
        puts "  #{assoc.macro} :#{assoc.name} -> #{assoc.class_name}"
      end
      puts
    end
  end
end

Running bundle exec rake relations:report yields:

Model associations sorted by count (descending):
Model: Project (26 associations)
  belongs_to :parent -> Project
  has_many :memberships -> Member
  has_many :members -> Member
  has_many :users -> User
  has_many :enabled_modules -> EnabledModule
  has_and_belongs_to_many :trackers -> Tracker
  has_many :issues -> Issue
  has_many :issue_changes -> Journal
  has_many :versions -> Version
  belongs_to :default_version -> Version
  belongs_to :default_assigned_to -> Principal
  has_many :time_entries -> TimeEntry
  has_many :time_entry_activities -> TimeEntryActivity
  has_many :queries -> Query
  has_many :documents -> Document
  has_many :news -> News
  has_many :issue_categories -> IssueCategory
  has_many :boards -> Board
  has_one :repository -> Repository
  has_many :repositories -> Repository
  has_many :changesets -> Changeset
  has_one :wiki -> Wiki
  has_and_belongs_to_many :issue_custom_fields -> IssueCustomField
  belongs_to :default_issue_query -> IssueQuery
  has_many :attachments -> Attachment
  has_many :custom_values -> CustomValue

Model: Issue (19 associations)
  belongs_to :parent -> Issue
  has_many :reactions -> Reaction
  belongs_to :project -> Project
  belongs_to :tracker -> Tracker
  belongs_to :status -> IssueStatus
  belongs_to :author -> User
  belongs_to :assigned_to -> Principal
  belongs_to :fixed_version -> Version
  belongs_to :priority -> IssuePriority
  belongs_to :category -> IssueCategory
  has_many :journals -> Journal
  has_many :time_entries -> TimeEntry
  has_and_belongs_to_many :changesets -> Changeset
  has_many :relations_from -> IssueRelation
  has_many :relations_to -> IssueRelation
  has_many :attachments -> Attachment
  has_many :custom_values -> CustomValue
  has_many :watchers -> Watcher
  has_many :watcher_users -> Principal

Model: User (14 associations)
  has_many :members -> Member
  has_many :memberships -> Member
  has_many :projects -> Project
  has_many :issue_categories -> IssueCategory
  has_one :email_address -> EmailAddress
  has_and_belongs_to_many :groups -> Group
  has_many :changesets -> Changeset
  has_one :preference -> UserPreference
  has_one :atom_token -> Token
  has_one :api_token -> Token
  has_many :email_addresses -> EmailAddress
  has_many :reactions -> Reaction
  belongs_to :auth_source -> AuthSource
  has_many :custom_values -> CustomValue

Model: AnonymousUser (14 associations)
  has_many :members -> Member
  has_many :memberships -> Member
  has_many :projects -> Project
  has_many :issue_categories -> IssueCategory
  has_one :email_address -> EmailAddress
  has_and_belongs_to_many :groups -> Group
  has_many :changesets -> Changeset
  has_one :preference -> UserPreference
  has_one :atom_token -> Token
  has_one :api_token -> Token
  has_many :email_addresses -> EmailAddress
  has_many :reactions -> Reaction
  belongs_to :auth_source -> AuthSource
  has_many :custom_values -> CustomValue

...

Project, Issue, User, AnonymousUser emerge as hubs.

User and Company often become hubs

In data science contests like Kaggle, you start with Exploratory Data Analysis. Architecture design is similar: one classification of business models is B2B vs. B2C vs. C2C (and hybrids). Excluding sole proprietorships, a company-mediated service usually falls into three: B2B, B2C, C2C.

From experience, if you build without considering domain boundaries:

  • In B2B services, Company and CompanyUser tend to become hubs.
  • In C2C services, User tends to become the hub.

Redmine’s hub classes are Project, Issue, User, AnonymousUser, with user models at the top. Even in B2C services you get roughly half the relations of B2B/C2C, but users still emerge as hubs.

In the context of multimodularization, it’s true that most of the know-how comes from native apps—but modern native apps usually assume one user per device. There are exceptions (e.g. smart-TV apps where family members switch users, or apps on shared library terminals), but in most cases you don’t need to think much about a User- or Company-centric axis.

On the other hand, back-end systems inherently hold information about all users. In my opinion, the user model becomes noise in the graph when considering where to split feature boundaries. Since we’re going to determine feature boundaries by clustering the graph, we’ll exclude it as noise.

Listing all relations

The Active Record relation graph is a directed graph—edges like has_one and has_many carry directionality, and there’s metadata indicating 1:1 or 1:N cardinality—but for our purposes we’ll assume that even if we ignore edge direction, the central classes still accumulate a sufficiently large number of connections.

Because we’re treating the User model as noise in our EDA, we’ve removed any relations to User and AnonymousUser. Strictly speaking, a model like UserProfile could represent a cohesive feature (e.g. user settings) and be extracted as its own domain, but we’re skipping that detail for the sake of simplicity.

Here’s a Rake task to export relations as CSV:

# lib/tasks/relation_report_csv.rake
# Rake task to export ActiveRecord model relations as CSV,
# excluding User, AnonymousUser, and auto-generated HABTM join classes.
# Only one direction per model pair is output to avoid duplicates.

require 'csv'
require 'set'

namespace :relations do
  desc "Export model relations to CSV (excluding User, AnonymousUser, and HABTM join classes)"
  task export_csv: :environment do
    Rails.application.eager_load!

    models = ActiveRecord::Base.descendants
                      .select { |m| m.name.present? }
                      .reject { |m| ['User', 'AnonymousUser'].include?(m.name) }
                      .reject { |m| m.name.start_with?('HABTM_') }

    output_path = Rails.root.join('tmp', 'model_relations.csv')
    seen = Set.new

    CSV.open(output_path, 'wb') do |csv|
      csv << ['from_model', 'to_model', 'association_type']

      models.each do |model|
        model.reflect_on_all_associations.each do |assoc|
          src = model.name
          dst = assoc.class_name
          # Skip excluded models
          next if ['User', 'AnonymousUser'].include?(dst)
          next if dst.start_with?('HABTM_')

          # Use sorted pair to detect duplicates (undirected)
          pair_key = [src, dst].sort.join(':')
          next if seen.include?(pair_key)
          seen.add(pair_key)

          csv << [src, dst, assoc.macro]
        end
      end
    end

    puts "CSV exported to #{output_path}"
  end
end

Running bundle exec rake relations:export_csv produces a CSV like:

from_model,to_model,association_type
Doorkeeper::AccessToken,Doorkeeper::Application,belongs_to
Doorkeeper::AccessGrant,Doorkeeper::Application,belongs_to
WorkflowRule,Role,belongs_to
WorkflowRule,Tracker,belongs_to
WorkflowRule,IssueStatus,belongs_to
WikiRedirect,Wiki,belongs_to
WikiPage,Wiki,belongs_to
WikiPage,WikiContent,has_one
WikiPage,Attachment,has_many
WikiPage,WikiPage,belongs_to
WikiPage,Watcher,has_many
WikiPage,Principal,has_many
WikiContentVersion,WikiPage,belongs_to
WikiContent,WikiContentVersion,has_many
...

You can generate a list of relations from one model to another. In an undirected graph, the relation from model A to model B and from model B to model A would be duplicates, so we exclude one of them.

Using NetworkX for community detection

Now that we’ve extracted the graph, we’ll move on to clustering it.

While we might prefer to stay in Ruby, Python’s ecosystem for data analysis is stronger, so we’ll use NetworkX for community detection. There is a Ruby fork called NetworkX.rb, but it doesn’t currently support community detection.

Community detection is a clustering technique that groups together regions of a graph with strong interconnections. In a typical web service, you might have on the order of a thousand models, which is small compared to the graphs community-detection algorithms can handle—so differences between algorithms tend to be minor. We’ll use the Louvain method, which recent research has found to be effective.

We then create and run a script like this:

# community_detection.py
# Python script using NetworkX to perform community detection on the
# exported model relations CSV (tmp/model_relations.csv).
# Using Louvain method for higher-quality community detection.

import csv
import networkx as nx
from networkx.algorithms.community import louvain_communities

# Path to CSV generated by the rake task
csv_path = 'tmp/model_relations.csv'

# Initialize undirected graph
G = nx.Graph()

# Read CSV and build graph, skipping exclusions
with open(csv_path, newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        src = row['from_model']
        dst = row['to_model']
        # Skip any excluded models
        if src in ('User', 'AnonymousUser') or dst in ('User', 'AnonymousUser'):
            continue
        if src.startswith('HABTM_') or dst.startswith('HABTM_'):
            continue

        G.add_edge(src, dst)

# Perform community detection using Louvain method
# Optionally specify resolution parameter or random seed
communities = louvain_communities(G)

# Output detected communities
print("Detected communities (Louvain):")
for idx, comm in enumerate(communities, start=1):
    print(f"Community {idx}: {', '.join(sorted(comm))}")

Example output:

Detected communities (Louvain):
Community 1: Doorkeeper::AccessGrant, Doorkeeper::AccessToken, Doorkeeper::Application
Community 2: IssueStatus, Tracker, WorkflowPermission, WorkflowRule, WorkflowTransition
Community 3: Document, DocumentCategory, Enumeration, IssuePriority, IssueQuery, Project, ProjectAdminQuery, ProjectQuery, Query, TimeEntry, TimeEntryActivity, TimeEntryQuery, UserQuery
Community 4: Attachment, Board, Comment, Commented, Container, EnabledModule, Issue, IssueRelation, Journal, JournalDetail, Journalized, Message, News, Principal, Reactable, Reaction, Version, Watchable, Watcher, Wiki, WikiContent, WikiContentVersion, WikiPage, WikiRedirect
Community 5: Import, ImportItem, IssueImport, TimeEntryImport, UserImport
Community 6: CustomField, CustomFieldEnumeration, CustomValue, Customized, DocumentCategoryCustomField, DocumentCustomField, GroupCustomField, IssueCustomField, IssuePriorityCustomField, ProjectCustomField, Role, TimeEntryActivityCustomField, TimeEntryCustomField, UserCustomField, VersionCustomField
Community 7: Change, Changeset, Repository, Repository::Bazaar, Repository::Cvs, Repository::Filesystem, Repository::Git, Repository::Mercurial, Repository::Subversion
Community 8: EmailAddress, Group, GroupAnonymous, GroupBuiltin, GroupNonMember, IssueCategory, Member, MemberRole

Even if you can’t say that AI in the broad sense will disrupt an architect’s work, it does seem useful as supporting information when considering domain boundaries.

Conclusion

That’s it for now. I haven’t yet tuned parameters or visualized the results—stay tuned for next time!