Synchronization Protocol¶
This document provides a detailed technical explanation of the synchronization protocol used by argocd-agent to maintain consistency between the principal (control plane) and agents (workload clusters). The target audience is engineers and architects who need to understand the underlying mechanisms.
Overview¶
The argocd-agent synchronization protocol is built on a bidirectional streaming communication model that enables reliable data synchronization between a central principal and distributed agents. The protocol handles:
- Configuration distribution (managed mode) or status aggregation (autonomous mode)
- Failure recovery through resync mechanisms
- Conflict detection using checksums
- Event-driven updates with guaranteed ordering
Architecture¶
Communication Model¶
The protocol uses a hub-and-spoke architecture where:
- Principal: Runs on the control plane, acts as the central coordination point
- Agents: Run on workload clusters, initiate all connections to the principal
- Connections: Always initiated by agents (never principal-to-agent)
- Streams: Bidirectional gRPC streams over HTTP/2 (with HTTP/1.1 websocket fallback)
┌─────────────────┐ gRPC/HTTP2 ┌─────────────────┐
│ Agent 1 │──────────────────►│ Principal │
│ (Workload) │ │ (Control Plane) │
└─────────────────┘ └─────────────────┘
▲
┌─────────────────┐ │
│ Agent 2 │──────────────────────────┘
│ (Workload) │
└─────────────────┘
Protocol Stack¶
┌─────────────────────────────────────┐
│ Application Events │ ← Create, Delete, SpecUpdate, Status, etc
├─────────────────────────────────────┤
│ CloudEvents Format │ ← Event envelope with metadata
├─────────────────────────────────────┤
│ gRPC Streaming │ ← Bidirectional streams
├─────────────────────────────────────┤
│ HTTP/2 (or HTTP/1+WS) │ ← Transport layer
├─────────────────────────────────────┤
│ TLS + mTLS │ ← Security layer
└─────────────────────────────────────┘
Communication Protocol¶
gRPC Service Definition¶
The core communication is defined by the EventStream service:
service EventStream {
rpc Subscribe(stream Event) returns (stream Event);
rpc Push(stream Event) returns (PushSummary);
rpc Ping(PingRequest) returns (PongReply);
}
Connection Lifecycle¶
- Authentication: Agent presents JWT token with client certificate (optional)
- Authorization: Principal validates agent identity and creates queue pair
- Stream Establishment: Bidirectional gRPC stream created
- Resync: Initial synchronization based on agent mode
- Event Processing: Continuous bidirectional event exchange
- Graceful Shutdown: Connection cleanup and queue removal
Event Format¶
All synchronization data is exchanged using CloudEvents format:
{
"specversion": "1.0",
"source": "agent-name" | "principal",
"type": "io.argoproj.argocd-agent.event.*",
"dataschema": "application" | "appproject" | "resource" | "resourceResync",
"extensions": {
"resourceid": "uuid",
"eventid": "uuid"
},
"data": { /* event-specific payload */ }
}
Event Types and Flow¶
Core Event Types¶
The protocol defines several event types for different synchronization scenarios:
Resource Management Events¶
create: Create new resource (managed mode only)delete: Remove resource (managed mode only)spec-update: Update resource specificationstatus-update: Update resource status
Synchronization Events¶
request-synced-resource-list: Request list of managed resourcesresponse-synced-resource: Response with resource metadatarequest-update: Request latest version of specific resourcerequest-resource-resync: Trigger full resync process
Control Events¶
ping/pong: Keepalive mechanismprocessed: Event acknowledgment
Event Flow Patterns¶
Managed Mode Flow¶
Principal Agent
│ │
│─── create/update/delete ────► │ (Configuration)
│ │
│ ◄─── status-update ───────────│ (Status feedback)
│ │
│─── request-resource-resync ──► │ (On principal restart)
│ │
│ ◄─── request-synced-resource │ (Agent sends inventory)
│ -list │
│ │
│─── response-synced-resource ──►│ (For each resource)
Autonomous Mode Flow¶
Principal Agent
│ │
│ ◄─── create/update/delete ────│ (Configuration changes)
│ │
│─── status-update ────────────► │ (Status sync)
│ │
│─── request-synced-resource ──► │ (On principal restart)
│ -list │
│ │
│ ◄─── response-synced-resource │ (For each resource)
│ │
│─── request-update ───────────► │ (Request specific updates)
Synchronization Modes¶
Managed Mode¶
In managed mode, the principal is the source of truth for configuration:
Characteristics:
- Principal creates/updates/deletes resources
- Agent receives configuration and applies it locally
- Agent sends status updates back to principal
- Principal initiates resync after restarts
Event Flow:
- Configuration changes made on principal
- Principal sends
create/update/deleteevents to agent - Agent applies changes to local Argo CD
- Agent sends
status-updateevents back to principal - Principal updates UI/API with current status
Namespace Mapping:
- Namespace-based (default): Applications on principal are placed in namespaces named after target agents. Example: Applications in namespace
production-clustersync to agentproduction-cluster. - Destination-based: Applications use
spec.destination.nameon the principal to specify the target agent, allowing multiple namespaces per agent. Example: An application withdestination.name: production-clustersyncs to agentproduction-clusterregardless of its namespace.
See Agent Mapping Modes for detailed configuration.
Autonomous Mode¶
In autonomous mode, the agent is the source of truth for configuration:
Characteristics:
- Agent creates/updates/deletes resources locally
- Principal receives configuration updates and mirrors them
- Principal acts as read-only observer with limited control capabilities
- Agent initiates resync after restarts
Event Flow:
- Configuration changes made on agent (via Git, local API, etc.)
- Agent detects changes via informers
- Agent sends
create/update/deleteevents to principal - Principal mirrors changes for UI/API visibility
- Principal can still trigger sync/refresh operations
Resync Process¶
Purpose¶
The resync process ensures data consistency when either the principal or agent restarts and may have missed events.
Resync Triggers¶
Agent Restart (Managed Mode)¶
- Agent connects and is detected as needing resync
- Principal sends
request-resource-resyncevent - Agent responds with
request-synced-resource-listcontaining checksum - Principal compares checksums and sends missing/updated resources
Principal Restart (Autonomous Mode)¶
- Principal detects agent connection after restart
- Principal sends
request-synced-resource-listwith its checksum - Agent compares checksums and sends
response-synced-resourcefor each resource - Principal rebuilds its state from agent responses
Principal Restart (Managed Mode)¶
- Principal detects agent connection after restart
- Principal sends
request-resource-resyncevent - Agent sends
request-synced-resource-listwith checksum - Principal validates and sends any needed updates
Resync State Management¶
The principal maintains resync state to avoid redundant resync operations:
type resyncStatus struct {
mu sync.RWMutex
agents map[string]bool // tracks which agents have been resynced
}
Checksum-based Sync Detection¶
Resource Identification¶
Resources are uniquely identified using a composite key:
type ResourceKey struct {
Name string // Resource name
Namespace string // Resource namespace
Kind string // Resource type (Application, AppProject)
UID string // Source UID for tracking
}
Source UID Tracking¶
In managed mode, every resource synced to an agent carries a source UID annotation (argocd.argoproj.io/source-uid). This annotation records the Kubernetes UID of the resource as it existed on the principal at the time of creation. The agent uses it to detect identity changes between principal and agent copies of the same resource.
A source-UID mismatch occurs when the agent receives an event for a resource that already exists locally, but the incoming source UID differs from the one recorded in the annotation. This typically means the resource was deleted and recreated on the principal side (producing a new UID) without the agent being notified of the deletion.
Mismatch Policy¶
The agent's behavior on source-UID mismatch is configurable via --source-uid-mismatch-policy:
| Policy | Behavior |
|---|---|
recreate |
Delete the existing resource, then create the incoming one (default) |
upsert |
Update the existing resource in-place without deleting it |
The policy can be overridden per-resource using the annotation argocd.argoproj.io/source-uid-mismatch-policy set on the resource on the principal. The annotation takes precedence over the global flag. Unknown annotation values are logged as a warning and the global policy is used as a fallback.
The recreate default is appropriate for most resources, as it guarantees the agent copy is an exact reflection of the principal. The upsert policy is useful for sensitive resources (such as argocd-gpg-keys-cm) where destructive replacement is undesirable and in-place update is safe.
Mismatch handling applies to all managed resource types: Applications, AppProjects, Repositories, and GPG keys.
Checksum Calculation¶
Checksums are calculated from resource keys to efficiently detect synchronization drift:
func (r *Resources) Checksum() []byte {
resources := make([]string, 0, len(r.resources))
for res := range r.resources {
// Namespace omitted for cross-cluster compatibility
resources = append(resources, fmt.Sprintf("%s/%s/%s",
res.Kind, res.Name, res.UID))
}
sort.Strings(resources) // Ensure deterministic order
hash := sha256.Sum256([]byte(strings.Join(resources, "")))
return hash[:]
}
Sync Detection Flow¶
- Checksum Exchange: Peer sends checksum of all managed resources
- Comparison: Recipient compares with local checksum
- Delta Sync: If different, recipient requests specific resource updates
- Resource-level Checksums: Individual resources compared via spec checksums
Resource-level Sync¶
For individual resources, spec checksums detect when updates are needed:
type RequestUpdate struct {
Name string
Namespace string
UID string
Kind string
Checksum []byte // SHA256 of resource spec
}
Queue Management¶
Queue Architecture¶
Each agent connection has a dedicated queue pair:
type QueuePair struct {
sendQ workqueue.TypedRateLimitingInterface[*cloudevents.Event]
recvQ workqueue.TypedRateLimitingInterface[*cloudevents.Event]
}
Queue Operations¶
Send Queue (Principal → Agent)¶
- Purpose: Buffer outgoing events to agent
- Processing: FIFO order with rate limiting
- Blocking: Sender blocks if queue is full
Receive Queue (Agent → Principal)¶
- Purpose: Buffer incoming events from agent
- Processing: Parallel processing with semaphore control
- Ordering: Per-queue ordering maintained via named locks
Event Processing Pipeline¶
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Receive │───►│ Queue │───►│ Process │
│ Event │ │ Event │ │ Event │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ Send │
│ Response │
└─────────────┘
Queue Lifecycle¶
- Creation: Queue pair created when agent connects
- Processing: Continuous event processing while connected
- Cleanup: Queues drained and removed on disconnect
- Persistence: Events may be held during brief disconnections
Error Handling and Reliability¶
Connection Resilience¶
Reconnection Logic¶
- Exponential Backoff: Agent implements backoff strategy for reconnections
- State Preservation: In-flight events preserved across reconnections
- Automatic Resume: Processing resumes where it left off
Error Classifications¶
- Temporary Errors: Network issues, temporary unavailability
- Permanent Errors: Authentication failures, invalid configurations
- Recoverable Errors: Resource conflicts, validation failures
Event Reliability¶
Delivery Guarantees¶
- At-least-once delivery: Events may be processed multiple times
- Ordering guarantees: Per-queue FIFO ordering maintained
- Idempotency: Event handlers designed to be idempotent
Failure Scenarios¶
Agent Disconnection¶
- Send queue preserves pending events
- Receive queue continues processing
- On reconnection, resync process handles missed events
Principal Restart¶
- All agent connections dropped
- Agents automatically reconnect
- Resync process rebuilds principal state
Network Partitions¶
- Agents operate autonomously during partition
- Resync resolves conflicts when connectivity restored
- Checksums detect and resolve data drift
Monitoring and Observability¶
Metrics¶
- Connection metrics: Active connections, connection duration
- Event metrics: Events sent/received, processing latency
- Error metrics: Failed events, reconnection attempts
- Queue metrics: Queue depth, processing rate
Logging¶
- Structured logging: JSON format with contextual fields
- Trace IDs: Event correlation across components
- Debug levels: Configurable verbosity for troubleshooting
Performance Characteristics¶
Scalability¶
- Concurrent agents: Principal can handle hundreds of simultaneous agents
- Event throughput: Thousands of events per second per agent
- Memory usage: Bounded by queue sizes and connection count
Latency¶
- Event propagation: Sub-second latency for most events
- Resync duration: Proportional to resource count and network latency
- Connection establishment: Typically < 5 seconds including auth
Tuning Parameters¶
- Queue sizes: Configurable per-queue limits
- Processing concurrency: Adjustable semaphore limits
- Connection timeouts: Configurable keepalive and timeout settings