Skip to content

adds containerd-shim-lcow-v1 shim#2627

Open
rawahars wants to merge 4 commits intomicrosoft:mainfrom
rawahars:lcow-shim-draft
Open

adds containerd-shim-lcow-v1 shim#2627
rawahars wants to merge 4 commits intomicrosoft:mainfrom
rawahars:lcow-shim-draft

Conversation

@rawahars
Copy link
Contributor

@rawahars rawahars commented Mar 10, 2026

Summary

This pull request introduces a new implementation of the containerd-shim-lcow-v1 for Windows, used for running Linux Containers on Windows.

The changes cover the main entrypoint, shim manager logic, plugin registration, shared service implementation, and a manifest for proper Windows compatibility. The uber changes are-

  • The main entrypoint and supporting logic to start + serve the shim server.
  • The stubs for Task Service and Shimdiag service.
  • The implementation for Sandbox service.
  • The implementation for VM Controller which manages the VM lifecycle and operations.

This shim follows the upstream pattern of using containerd/pkg/shim for managing it's lifecycle. It uses plugin mechanism for registering service with containerd and is initialized during startup.

This PR is dependent upon PR #2612 and PR #2616

@rawahars rawahars requested a review from a team as a code owner March 10, 2026 11:31
// the guest Linux stacks – to the logrus logger tagged with the sandbox ID.
// This provides an out-of-band diagnostic snapshot without requiring the shim
// to be paused or restarted.
func ETWCallback(sourceID guid.GUID, state etw.ProviderState, level etw.Level, matchAnyKeyword uint64, matchAllKeyword uint64, filterData uintptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do the etw setup here do we need to export this fn or can we just have the plugin do the etw registration and callback registration?

events chan interface{}

// sandboxID is the unique identifier for the sandbox managed by this Service instance.
sandboxID string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In LCOWs case the sandbox is always the UVM id right? Can we just call it that here or at least say that in the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the UVM ID is not this identifier, though it is derived from this one. This sandboxID corresponds with the ID which is stored and used by containerd. The VM's ID is "SandboxID@vm".

Maybe we can add in the comments-

For LCOW shim, sandboxID corresponds 1-1 with the UtilityVM managed by the shim.

sandboxID string

// vmHcsDocument holds the HCS compute-system document used to create the VM.
vmHcsDocument *hcsschema.ComputeSystem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I that the whole point of the VM controller was to have things like this hidden away so they are always up to date always ref counted etc when a modification was made?

span.AddAttributes(
trace.StringAttribute("sandbox-id", request.SandboxID),
trace.StringAttribute("bundle", request.BundlePath),
trace.StringAttribute("net-ns-path", request.NetnsPath))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The property is netns one word. Should we use a prop constants file to control this for later updates and consistency?

// atomicState is a concurrency-safe VM state holder backed by an atomic int32.
// All reads and writes go through atomic operations, so no mutex is required
// for state itself.
type atomicState struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine but seems excessive

// We will create the guest connection via GuestManager during StartVM.
c.guest = guestmanager.New(ctx, uvm)

c.vmState.store(StateCreated)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See you already had the mutex, so setting this didnt need to be atomic

c.mu.Lock()
defer c.mu.Unlock()

if c.vmState.load() == StateRunning {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these I'd prefer a single read and then handle the if cases if we stay with atomic. If its just a property guarded by the mutex its fine

// use parent context, to prevent 2 minute timout (set above) from overridding terminate operation's
// timeout and erroring out prematurely
_ = c.uvm.Terminate(pCtx)
return fmt.Errorf("failed to start VM: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldnt cases like this set it to state stopped as well? I wonder if you want any error from this method to set state stopped?

}

for _, driver := range drivers {
_ = driver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh?

In this commit, we are adding the initial draft of `containerd-shim-lcow-v1`. This shim is used for running Linux Containers on Windows.

As part of this commit, we are adding the following-
- The main entrypoint and supporting logic to start + serve the shim server.
- The stubs for Task Service and Shimdiag service.
- The implementation for `Sandbox` service.
- The implementation for `VM Controller` which manages the VM lifecycle and operations.

Signed-off-by: Harsh Rawat <harshrawat@microsoft.com>
@rawahars rawahars force-pushed the lcow-shim-draft branch 5 times, most recently from 6777136 to 1bb1938 Compare March 15, 2026 11:31
Signed-off-by: Harsh Rawat <harshrawat@microsoft.com>
events chan interface{}

// sandboxID is the unique identifier for the sandbox managed by this Service instance.
// For LCOW shim, sandboxID corresponds 1-1 with the UtilityVM managed by the shim.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we not planning on adding multi-pod support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So multipod support will be added using podControllers where each controller owns the lifecycle of the specific pod. The sandboxID corresponds with the UVM which hosts the pods.

The flow from CRI would be-

- Call SandboxController.Create -> Creates the UVM aka Sandbox to host all the pods
- Call NewTask while passing in the SandboxID via WithSandbox -> Creates a pod in the sandbox

This way the annotation of io.kubernetes.cri.sandbox-id on the CreateTask request tells the story of the pod wherein the container needs to be added.

Comment on lines +25 to +29
func (s *Service) diagTasksInternal(ctx context.Context, request *shimdiag.TasksRequest) (*shimdiag.TasksResponse, error) {
_ = ctx
_ = request
return nil, errdefs.ErrNotImplemented
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (s *Service) diagTasksInternal(ctx context.Context, request *shimdiag.TasksRequest) (*shimdiag.TasksResponse, error) {
_ = ctx
_ = request
return nil, errdefs.ErrNotImplemented
}
func (s *Service) diagTasksInternal(_ context.Context, _*shimdiag.TasksRequest) (*shimdiag.TasksResponse, error) {
return nil, errdefs.ErrNotImplemented
}

same for the below 2, and methods in service_task_internal.go

// diagExecInHostInternal is the implementation for DiagExecInHost.
//
// It is used to create an exec session into the hosting UVM.
func (s *Service) diagExecInHostInternal(ctx context.Context, request *shimdiag.ExecProcessRequest) (*shimdiag.ExecProcessResponse, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

// document used to create an LCOW Utility VM.
func BuildSandboxConfig(
ctx context.Context,
owner string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why hard-code the shim name here and not pass it from cmd/containerd-shim-lcow-v2 code?
we do set different values for owners in uvmboot and internal test code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Let me revert it back,

vmState State

// mu guards the concurrent access to the Manager's fields and operations.
mu sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync.RWMutex might make sense here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome suggestion. I have incorporated the same.

c.mu.Unlock()
return "", fmt.Errorf("cannot dump stacks: VM is in incorrect state %s", c.vmState)
}
c.mu.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably best to keep c.mu locked while making the DumpStacks guest request so the uVM doesn't get mutated/terminated during the operation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Implemented the change.

Comment on lines +67 to +68
Width = "width"
Height = "height"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we use these anywhere besides the one location in service_task.go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope there are not used elsewhere but Justin was of the opinion that we should standardize them in logfields.

Version = "version"
ShimPid = "shim-pid"
TaskPid = "task-pid"
ExecSpanID = "exec-id"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theres already an ExecID above, and ExecSpanID doesn't really make sense since its used to log the exec ID, and not an associated trace/span

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Implemented the change.

Signed-off-by: Harsh Rawat <harshrawat@microsoft.com>
Copy link
Contributor

@shreyanshjain7174 shreyanshjain7174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of the critical items — each comment is a question for clarification.

Comment on lines +64 to +69
func NewController() *Manager {
return &Manager{
logOutputDone: make(chan struct{}),
vmState: StateNotCreated,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this returning *Manager instead of Controller interface ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this aligns with Go’s principle of “accept interfaces, return concrete types.”

close(c.logOutputDone)
}
}()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if _, ok := <-c.logOutputDone; ok { close(c.logOutputDone) } pattern looks broken — logOutputDone is an unbuffered chan struct{} that nobody ever sends on, only closes. So the receive here blocks forever until someone else closes it, at which point ok == false and the close branch is skipped.

This means Wait() will hang on WCOW because logOutputDone is never signaled from this goroutine?

Should this just be close(c.logOutputDone) after wg.Wait(), matching the LCOW pattern in vm_lcow.go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. This seems broken in old shim too.
Fixed at both places.

logrus.WithError(err).Fatal("failed to listen for windows logging connections")
}

// Use a WaitGroup to track active log processing goroutines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logrus.Fatal calls os.Exit(1) — this kills the entire shim process if the log listener fails to bind. Isn't this a recoverable/non-critical failure? Should this use logrus.Error + close(c.logOutputDone) instead, so Wait() doesn't block and the shim can continue operating without log forwarding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the same.

Args: request.Args,
Workdir: request.Workdir,
Terminal: request.Terminal,
Stdin: request.Stdin,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DumpStacks only reads vmState and calls guest.DumpStacks() — there's no state mutation here. Should this be mu.RLock() / mu.RUnlock() like ExecIntoHost and Wait do, to avoid unnecessarily blocking concurrent readers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sure. We can implement that.

return "", nil
}

// Wait blocks until the VM exits and all log output processing has completed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stats() takes mu.Lock() because of the lazy vmmemProcess initialization, but this blocks all other operations (including ExecIntoHost, Wait, DumpStacks). Would it make sense to initialize vmmemProcess during StartVM, or use a separate sync.Once for the lazy init so Stats() can use RLock for the state check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of RLock in Stats but vmmemProcess init should remain in this method as that simplifies the LM scenario.

// Create the serve command.
cmd, err := newCommand(ctx, id, opts.Address, address, f)
if err != nil {
return params, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses stdlib log.Fatalf (not logrus), which calls os.Exit(1). Shouldn't this return the error instead — return params, fmt.Errorf("failed to create event: %w", err) — so the caller can handle it gracefully?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure that works.

Signed-off-by: Harsh Rawat <harshrawat@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants