How to automatically determine domains without relying on humans
Once you have many teams, you want a modular monolith
There are several terms used to refer to dividing a program into modules:
- Multimodularization
- Co-location
- Modular monolith
"Multimodularization” is mainly used for native mobile apps, "co-location” is used primarily in web front-ends, and "modular monolith” is the term you’ll most often hear on the backend.
In general, talks of module splits tend to come up sooner in client-side app development than on the server side. The motivations for splitting into modules include:
- (Relatively) common in client-side app development
- Whether the language is statically typed
- Whether the language requires a build step
- Common to both sides
- Whether there are two or more teams
- Wanting to protect code from LLM-based CLIs (e.g., Claude Code) (new!)
From the typing/build perspective, client-side tends to use statically typed or compiled languages (Swift for iOS, Kotlin for Android), whereas the server side often uses dynamically typed, interpreted languages (Ruby, PHP, Python). Longer-lived client apps accumulate more complex dependencies over time, and server-side work—where you need to return an HTTP response in a few seconds for UX—often stays manageable without static types. That said, recent additions like Python’s Type Hints and Ruby’s RBS bring optional typing to scripting languages.
Regarding builds, client-side platforms sometimes demand high performance (e.g., Unity games in C#), so build steps are common. Build waiting can become a bottleneck, motivating multimodularization earlier on. On the server side, languages like Go still have builds, but they’re often used in microservice contexts where each service is its own build unit from the start.
Another motivation for multimodularization is team size—if Team A and Team B are changing the same code, merge conflicts become frequent. In script-language backends, build bottlenecks are rare, so the demand for a modular monolith usually only arises once you have two or more teams.
Recently, LLM-based CLIs like Claude Code have become powerful enough that they can accidentally break code, so in the AI era there’s even more demand to split modules on the backend and localize the impact of changes.
Is the domain boundary really something only humans can decide?
Suppose you’ve decided to split into modules. There are two main ways to think about directory structure:
- Technology-first
app/models/feature_a
app/models/feature_b
app/models/feature_c
- Feature-first
app/feature_a/controllers
app/feature_a/models
app/feature_a/views
Technology-first means cutting by technical layer (e.g., Rails’ default app/controllers
, app/models
, app/views
). To split further by feature, you might do app/models/feature_aaa
, app/models/feature_bbb
.
Feature-first means structuring by feature at the top level: app/feature_aaa/
, app/feature_bbb/
, etc., each containing its own controllers, models, and views like app/feature_a/controllers、app/feature_a/modles、app/feature_a/views.
Technology-first is straightforward once you’ve decided on technical concepts, so it’s well-suited to 0→1 phases where you don’t yet know how the app will grow. As Android’s recommended architecture notes, the Domain Layer is often optional.
By contrast, feature-first can speed up CI by running tests only for changed features in a directory, making it better suited to 10→100 scaling.
Technology-first has clear rules once the layering is established; feature-first can leave gray areas about which feature a given piece of code belongs to. According to the Pareto principle, only about 20% of code is clearly feature-driven, while 80% doesn’t map neatly. Some companies handle this by having a common
(or others
) plus specific feature folders.
While there are well-established theories for technology-first (e.g., MVC), feature-first lacks mature theory. Some argue that only humans can set domain boundaries. Yet treating domain division as a human-only Wsacred” task without applying scientific methods feels like missing an opportunity.
Could machines really decide feature (domain) boundaries? One approach is to cluster the ORM (Active Record) relationship graph to automatically split domains, which seems underexplored—so let’s try it.
The Active Record relations form a graph
Game backends sometimes use NoSQL for performance, but for most web services MySQL or PostgreSQL are still the main data stores. NewSQL systems that offer RDB interfaces with better scalability are also emerging.
In an RDB design, tables are connected by relations. In Active Record, you declare has_one
, has_many
, etc. These relations can be gleaned from database schemas (foreign key constraints) or conventions (aaa_id
points to aaa
table). Active Record also supports more complex relations (single-table inheritance, polymorphic associations, delegated types), and its metaprogramming nature provides APIs to retrieve these relations.
Debates over ORM vs. raw SQL often center on security (e.g., SQL injection protection). But by defining relations in the ORM you also accumulate a graph structure that’s easy to analyze when considering architecture.
Extracting graph hubs (God classes) from Active Record
We’ll experiment on a codebase with enough models to analyze—Redmine (commit 068a2868ae8d5316a7c4cf9a3d1452dfab8e43a5
).
First, let’s identify graph hubs (God classes) via a Rake task that lists each model’s association count:
# lib/tasks/relation_report.rake # Rake task to list all ActiveRecord associations # and output all models sorted by association count (descending). # Excludes internal HABTM join classes from output. namespace :relations do desc "List model relations sorted by association count" task report: :environment do # Eager load application models Rails.application.eager_load! # Collect all AR models, excluding HABTM join classes models = ActiveRecord::Base.descendants .select { |m| m.name.present? && !m.name.start_with?("HABTM_") } # Map each model class to its association count association_counts = models.each_with_object({}) do |model, hash| count = model.reflect_on_all_associations.size hash[model] = count if count.positive? end # Sort models by association count descending sorted = association_counts.sort_by { |model, count| -count } puts "Model associations sorted by count (descending):\n" sorted.each do |model, count| puts "Model: #{model.name} (#{count} associations)" model.reflect_on_all_associations.each do |assoc| puts " #{assoc.macro} :#{assoc.name} -> #{assoc.class_name}" end puts end end end
Running bundle exec rake relations:report
yields:
Model associations sorted by count (descending): Model: Project (26 associations) belongs_to :parent -> Project has_many :memberships -> Member has_many :members -> Member has_many :users -> User has_many :enabled_modules -> EnabledModule has_and_belongs_to_many :trackers -> Tracker has_many :issues -> Issue has_many :issue_changes -> Journal has_many :versions -> Version belongs_to :default_version -> Version belongs_to :default_assigned_to -> Principal has_many :time_entries -> TimeEntry has_many :time_entry_activities -> TimeEntryActivity has_many :queries -> Query has_many :documents -> Document has_many :news -> News has_many :issue_categories -> IssueCategory has_many :boards -> Board has_one :repository -> Repository has_many :repositories -> Repository has_many :changesets -> Changeset has_one :wiki -> Wiki has_and_belongs_to_many :issue_custom_fields -> IssueCustomField belongs_to :default_issue_query -> IssueQuery has_many :attachments -> Attachment has_many :custom_values -> CustomValue Model: Issue (19 associations) belongs_to :parent -> Issue has_many :reactions -> Reaction belongs_to :project -> Project belongs_to :tracker -> Tracker belongs_to :status -> IssueStatus belongs_to :author -> User belongs_to :assigned_to -> Principal belongs_to :fixed_version -> Version belongs_to :priority -> IssuePriority belongs_to :category -> IssueCategory has_many :journals -> Journal has_many :time_entries -> TimeEntry has_and_belongs_to_many :changesets -> Changeset has_many :relations_from -> IssueRelation has_many :relations_to -> IssueRelation has_many :attachments -> Attachment has_many :custom_values -> CustomValue has_many :watchers -> Watcher has_many :watcher_users -> Principal Model: User (14 associations) has_many :members -> Member has_many :memberships -> Member has_many :projects -> Project has_many :issue_categories -> IssueCategory has_one :email_address -> EmailAddress has_and_belongs_to_many :groups -> Group has_many :changesets -> Changeset has_one :preference -> UserPreference has_one :atom_token -> Token has_one :api_token -> Token has_many :email_addresses -> EmailAddress has_many :reactions -> Reaction belongs_to :auth_source -> AuthSource has_many :custom_values -> CustomValue Model: AnonymousUser (14 associations) has_many :members -> Member has_many :memberships -> Member has_many :projects -> Project has_many :issue_categories -> IssueCategory has_one :email_address -> EmailAddress has_and_belongs_to_many :groups -> Group has_many :changesets -> Changeset has_one :preference -> UserPreference has_one :atom_token -> Token has_one :api_token -> Token has_many :email_addresses -> EmailAddress has_many :reactions -> Reaction belongs_to :auth_source -> AuthSource has_many :custom_values -> CustomValue ...
Project, Issue, User, AnonymousUser emerge as hubs.
User and Company often become hubs
In data science contests like Kaggle, you start with Exploratory Data Analysis. Architecture design is similar: one classification of business models is B2B vs. B2C vs. C2C (and hybrids). Excluding sole proprietorships, a company-mediated service usually falls into three: B2B, B2C, C2C.
From experience, if you build without considering domain boundaries:
- In B2B services, Company and CompanyUser tend to become hubs.
- In C2C services, User tends to become the hub.
Redmine’s hub classes are Project, Issue, User, AnonymousUser, with user models at the top. Even in B2C services you get roughly half the relations of B2B/C2C, but users still emerge as hubs.
In the context of multimodularization, it’s true that most of the know-how comes from native apps—but modern native apps usually assume one user per device. There are exceptions (e.g. smart-TV apps where family members switch users, or apps on shared library terminals), but in most cases you don’t need to think much about a User- or Company-centric axis.
On the other hand, back-end systems inherently hold information about all users. In my opinion, the user model becomes noise in the graph when considering where to split feature boundaries. Since we’re going to determine feature boundaries by clustering the graph, we’ll exclude it as noise.
Listing all relations
The Active Record relation graph is a directed graph—edges like has_one and has_many carry directionality, and there’s metadata indicating 1:1 or 1:N cardinality—but for our purposes we’ll assume that even if we ignore edge direction, the central classes still accumulate a sufficiently large number of connections.
Because we’re treating the User model as noise in our EDA, we’ve removed any relations to User and AnonymousUser. Strictly speaking, a model like UserProfile could represent a cohesive feature (e.g. user settings) and be extracted as its own domain, but we’re skipping that detail for the sake of simplicity.
Here’s a Rake task to export relations as CSV:
# lib/tasks/relation_report_csv.rake # Rake task to export ActiveRecord model relations as CSV, # excluding User, AnonymousUser, and auto-generated HABTM join classes. # Only one direction per model pair is output to avoid duplicates. require 'csv' require 'set' namespace :relations do desc "Export model relations to CSV (excluding User, AnonymousUser, and HABTM join classes)" task export_csv: :environment do Rails.application.eager_load! models = ActiveRecord::Base.descendants .select { |m| m.name.present? } .reject { |m| ['User', 'AnonymousUser'].include?(m.name) } .reject { |m| m.name.start_with?('HABTM_') } output_path = Rails.root.join('tmp', 'model_relations.csv') seen = Set.new CSV.open(output_path, 'wb') do |csv| csv << ['from_model', 'to_model', 'association_type'] models.each do |model| model.reflect_on_all_associations.each do |assoc| src = model.name dst = assoc.class_name # Skip excluded models next if ['User', 'AnonymousUser'].include?(dst) next if dst.start_with?('HABTM_') # Use sorted pair to detect duplicates (undirected) pair_key = [src, dst].sort.join(':') next if seen.include?(pair_key) seen.add(pair_key) csv << [src, dst, assoc.macro] end end end puts "CSV exported to #{output_path}" end end
Running bundle exec rake relations:export_csv
produces a CSV like:
from_model,to_model,association_type Doorkeeper::AccessToken,Doorkeeper::Application,belongs_to Doorkeeper::AccessGrant,Doorkeeper::Application,belongs_to WorkflowRule,Role,belongs_to WorkflowRule,Tracker,belongs_to WorkflowRule,IssueStatus,belongs_to WikiRedirect,Wiki,belongs_to WikiPage,Wiki,belongs_to WikiPage,WikiContent,has_one WikiPage,Attachment,has_many WikiPage,WikiPage,belongs_to WikiPage,Watcher,has_many WikiPage,Principal,has_many WikiContentVersion,WikiPage,belongs_to WikiContent,WikiContentVersion,has_many ...
You can generate a list of relations from one model to another. In an undirected graph, the relation from model A to model B and from model B to model A would be duplicates, so we exclude one of them.
Using NetworkX for community detection
Now that we’ve extracted the graph, we’ll move on to clustering it.
While we might prefer to stay in Ruby, Python’s ecosystem for data analysis is stronger, so we’ll use NetworkX for community detection. There is a Ruby fork called NetworkX.rb, but it doesn’t currently support community detection.
Community detection is a clustering technique that groups together regions of a graph with strong interconnections. In a typical web service, you might have on the order of a thousand models, which is small compared to the graphs community-detection algorithms can handle—so differences between algorithms tend to be minor. We’ll use the Louvain method, which recent research has found to be effective.
We then create and run a script like this:
# community_detection.py # Python script using NetworkX to perform community detection on the # exported model relations CSV (tmp/model_relations.csv). # Using Louvain method for higher-quality community detection. import csv import networkx as nx from networkx.algorithms.community import louvain_communities # Path to CSV generated by the rake task csv_path = 'tmp/model_relations.csv' # Initialize undirected graph G = nx.Graph() # Read CSV and build graph, skipping exclusions with open(csv_path, newline='') as csvfile: reader = csv.DictReader(csvfile) for row in reader: src = row['from_model'] dst = row['to_model'] # Skip any excluded models if src in ('User', 'AnonymousUser') or dst in ('User', 'AnonymousUser'): continue if src.startswith('HABTM_') or dst.startswith('HABTM_'): continue G.add_edge(src, dst) # Perform community detection using Louvain method # Optionally specify resolution parameter or random seed communities = louvain_communities(G) # Output detected communities print("Detected communities (Louvain):") for idx, comm in enumerate(communities, start=1): print(f"Community {idx}: {', '.join(sorted(comm))}")
Example output:
Detected communities (Louvain): Community 1: Doorkeeper::AccessGrant, Doorkeeper::AccessToken, Doorkeeper::Application Community 2: IssueStatus, Tracker, WorkflowPermission, WorkflowRule, WorkflowTransition Community 3: Document, DocumentCategory, Enumeration, IssuePriority, IssueQuery, Project, ProjectAdminQuery, ProjectQuery, Query, TimeEntry, TimeEntryActivity, TimeEntryQuery, UserQuery Community 4: Attachment, Board, Comment, Commented, Container, EnabledModule, Issue, IssueRelation, Journal, JournalDetail, Journalized, Message, News, Principal, Reactable, Reaction, Version, Watchable, Watcher, Wiki, WikiContent, WikiContentVersion, WikiPage, WikiRedirect Community 5: Import, ImportItem, IssueImport, TimeEntryImport, UserImport Community 6: CustomField, CustomFieldEnumeration, CustomValue, Customized, DocumentCategoryCustomField, DocumentCustomField, GroupCustomField, IssueCustomField, IssuePriorityCustomField, ProjectCustomField, Role, TimeEntryActivityCustomField, TimeEntryCustomField, UserCustomField, VersionCustomField Community 7: Change, Changeset, Repository, Repository::Bazaar, Repository::Cvs, Repository::Filesystem, Repository::Git, Repository::Mercurial, Repository::Subversion Community 8: EmailAddress, Group, GroupAnonymous, GroupBuiltin, GroupNonMember, IssueCategory, Member, MemberRole
Even if you can’t say that AI in the broad sense will disrupt an architect’s work, it does seem useful as supporting information when considering domain boundaries.
Conclusion
That’s it for now. I haven’t yet tuned parameters or visualized the results—stay tuned for next time!