Trilogy AI COE Logo Trilogy AI Center of Excellence
Home Research Publications Podcast Resources Bibliography About Us Contact
Home Research Publications Podcast Resources Bibliography About Us Contact

Research

Explore our ongoing and completed research initiatives across key AI domains.

Research image

Autonomous Coding Agents

Autonomous Coding Agents

Investigating AI-driven software development with agent harnesses, coding assistants, review loops, and engineering workflows that can operate across real repositories.

Key Findings:

  • Claude Code, Codex, OpenCode, OpenHands, Pi, Cursor, Windsurf, and similar harnesses are converging on tool-using developer agents.
  • Effective agentic engineering requires repo-owned rules, repeatable review loops, and explicit handoff paths.
  • Agent performance must be evaluated on shipped work quality, recovery, cost, and auditability, not just benchmark task completion.
  • The strongest workflows combine human planning and review with autonomous execution inside bounded workspaces.
Status: Ongoing Ephor Project View related publications →

Autonomous Coding Agents

Research Summary

This research initiative explores how autonomous coding agents transform software development when they are used on real repositories rather than toy prompts. The scope includes Claude Code, Codex, OpenCode, OpenHands, Pi, Cursor, Windsurf, and other agent harnesses that can inspect code, call tools, make edits, run tests, and participate in review loops.

The current framing is informed by OpenSymphony, Claude Code practice, Codex workflows, Cursor automations, and articles such as From Spec-Driven Work to Work Orchestration. The central question is how to turn AI coding from a chat interaction into a repeatable engineering system.

Key areas of investigation include:

  • Coding Harness Capability: Compare how agent tools handle repository navigation, edits, tests, shell commands, browser work, and review.
  • Quality of Results: Evaluate correctness, maintainability, security, and architectural fit of generated changes.
  • Human Review Loops: Study how planning, checkpoints, comments, and pull requests improve agent output.
  • Model and Tool Flexibility: Compare provider and harness switching across tasks that require different context, latency, and cost profiles.
  • Engineering Reliability: Track recovery, reproducibility, state management, and handoff quality across longer work sessions.

By addressing these areas, we aim to identify the practices, tools, and evaluation methods that make AI-driven software development dependable enough for production engineering teams.

Related Publications

  • Human-Near-the-Loop
  • Frontier Code Intelligence
  • Fixing Visual AI Slop
  • Why I'm Bullish on OpenAI
  • [How-To] Agent Factory
  • How to Use Claude Code like a Claude Code Engineer

View all related publications

Research image

Personal Agent Runtimes

Personal Agent Runtimes

Assess personal agent runtimes such as OpenClaw, Hermes, Operator, Manus, and browser/desktop agents that act inside user-controlled environments.

Key Findings:

  • Personal agents need durable memory, tool access, permission boundaries, and recovery paths to become reliable daily infrastructure.
  • Gateway and runtime design matters as much as model capability because credentials, local context, and approvals live at this layer.
  • Remote and local deployments create different trade-offs around cost, isolation, latency, and user trust.
Status: Ongoing View related publications →

Personal Agent Runtimes

Research Summary

This research investigates personal agent runtimes: systems that operate on behalf of a user inside local, remote, desktop, browser, or chat-controlled environments. The scope includes OpenClaw, Hermes, GasTown, OpenAI Operator, Manus and OpenManus, browser-use patterns, and other runtimes that combine tools, memory, permissions, and execution state.

We explore generalist agents like OpenAI Operator and Manus, open-source counterparts such as OpenManus, browser automation frameworks such as Browser Use, and local/remote agent gateways that can coordinate with IDEs, terminals, calendars, chat systems, and web browsers.

The objective is to analyze the architecture, robustness, security model, user experience, and operational burden of these runtimes so teams can decide when personal agents should run locally, remotely, inside an IDE, behind a gateway, or as part of a larger automation platform.

Related Publications

  • The Plumbing Wars - Are Claude Managed Agents Worth It?
  • [Opinion] Microsoft Just Unified the Agent Stack, And Forgot the Personal Layer
  • Your first agent, done right
  • Give Your Brains Hands
  • [Technical Deep Dive] Hermes vs. OpenClaw: Two Approaches to Personal AI Infrastructure
  • [Deep Dive] Gastown

View all related publications

Research image

Agent Federation & Protocols

Agent Federation Protocols

Explore open protocols for AI interoperability (A2A, MCP, OGP) to enable reliable task handoff, tool access, and gateway federation across agents.

Key Findings:

  • Interoperability requires standardized task communication (A2A), agent data/tool access (MCP), and gateway federation (OGP).
  • A2A provides a framework for general agent coordination, discovery (Agent Cards), and secure communication.
  • OGP adds signed peer-to-peer messaging, bilateral trust, approval flows, and relay into local agent systems.
  • These protocols aim to enable collaboration between specialized agents, breaking down vendor/framework silos.
  • Adoption hurdles, security, and standardization competition are key challenges for these emerging coordination frameworks.
Status: Ongoing Ephor Project View related publications →

Agent Federation & Protocols

Research Summary

This research focuses on the design, comparison, and evaluation of communication and coordination strategies for multi-agent systems (MAS), tackling the challenge of interoperability between heterogeneous agents and frameworks.

We analyze diverse intra-framework architectural patterns (e.g., graph-based orchestration, agent-centric messaging, structured teams) and investigate emerging inter-framework open standards critical for breaking down silos. Key protocols studied include A2A (Agent-to-Agent) for task-oriented communication and discovery, MCP (Model Context Protocol) for standardized tool/data access, and OGP (Open Gateway Protocol) for signed gateway-to-gateway federation, approval flows, and controlled relay into local agent systems.

The research aims to evaluate the effectiveness of these strategies and protocols in enabling reliable knowledge exchange, dynamic task allocation, efficient coordination, and robust error handling, considering aspects like protocol compliance, cost, and reliability as highlighted by MAS benchmark gaps. We also explore adoption hurdles and security considerations for these evolving coordination frameworks.

Related Publications

  • [Framework] Breaking Up with OpenClaw: How OGP Learned to Play with Others
  • [Framework] How Two Agents Collaborated Without Sharing a Repo, Login, or Secret
  • [Framework] Why Shared Expert Knowledge Usually Fails, and the Federation Pattern That Could Make It Work
  • [Opinion] OGP Is the Walkie-Talkie for Agents
  • [Technical Deep Dive] OGP, A2A, and MCP: Three Lanes, Same Highway
  • [Case Study] Building a Protocol in Public: 100 Builds, 7 Days, and What Actually Works

View all related publications

Research image

Agent Security Boundaries & Sandboxing

Agent Security Boundaries

Explore execution boundaries, credential handling, tenant isolation, and simulated environments for safe agent operation.

Key Findings:

  • Agent runtimes must separate tool access, credentials, user approvals, and execution state.
  • Sandboxing choices affect both safety and productivity because agents need enough access to complete real work.
  • Supply-chain and remote-execution risks make auditability and isolation part of the core agent architecture.
Status: Ongoing View related publications →

Agent Security Boundaries & Sandboxing

Research Summary

This research surveys the infrastructure and governance patterns needed to run AI agents safely. As agents interact with codebases, execute commands, access credentials, browse the web, and call enterprise tools, the boundary between productivity and risk becomes an architectural concern rather than a policy footnote.

We investigate containerization, virtualization, tenant isolation, remote execution, credential brokering, approval flows, and specialized cloud platforms like E2B. We also explore standardized development environment platforms such as Daytona that enable consistent and rapidly provisioned workspaces for agent execution.

The goal is to evaluate the capabilities, limitations, security guarantees, and integration patterns of these techniques across coding agents, personal agent runtimes, browser agents, and enterprise automations.

Related Publications

  • Agent Vault keeps secrets out of AI agents' hands
  • Vercel Has a Confirmed Breach
  • The 15.7 Tbps DDoS That Should Scare AI Teams More Than Model Benchmarks
  • Agentic AI in the Wild: Lessons from Anthropic’s GTG-1002

View all related publications

Research image

Open Model Strategy & Training

Open Model Strategy

Track open and local model strategy across Qwen, Gemma, Kimi, DeepSeek, model routing, fine-tuning limits, and training economics.

Key Findings:

  • The practical model question has shifted from single-model selection to routing across frontier, open, local, and task-specialized models.
  • Fine-tuning is valuable only when it preserves base capability and beats retrieval or hybrid inference on cost, latency, and reliability.
  • Open-weight model progress is increasingly relevant for coding agents, multimodal workflows, and private enterprise deployments.
Status: Ongoing View related publications →

Open Model Strategy & Training

Research Summary

This research tracks the fast-moving model strategy landscape with a practical enterprise lens. Rather than treating model development as one planned distillation project, the current work compares open-weight and closed models, local and hosted inference, training and retrieval, and single-model versus routed systems.

Key areas of investigation include:

  • Open Model Selection: Track Qwen, Gemma, Kimi, DeepSeek, OLMo, Granite, Nemotron, and other open or local candidates for coding, reasoning, multimodal, and workflow tasks.
  • Training vs. Retrieval: Evaluate when fine-tuning, adapters, synthetic data, retrieval, context engineering, or hybrid inference produce better results.
  • Model Routing Economics: Compare cost, latency, privacy, reliability, and quality when routing tasks across local, open, and frontier models.
  • Architecture Watch: Continue tracking architectures such as diffusion language models, Mamba-style hybrids, Titans-style memory, long-context models, and multimodal-native small models when they affect deployability.

The goal is to help teams choose the least expensive model strategy that is still good enough for the task, while preserving a path to train or adapt models when the evidence supports it.

Related Publications

  • The Gap Closes Again - and This Time It's on Chinese Silicon
  • Kimi K2.6 Is the Open Model Release OpenClaw Users Were Waiting For
  • Qwen 3.6 Open vs Opus 4.7 vs Gemma 4
  • Moonshot Kimi K2.5 on OpenRouter
  • [News Brief] The Resurgence of US Open LLMs
  • [News Brief] Three Significant Open Releases for AI

View all related publications

Research image

Reference-Free LLM Evaluation with G-Eval

LLM Evaluation Methods

Explore G-Eval metrics for reference-free evaluation using LLMs as evaluators. Examine challenges of reliability and objectivity.

Key Findings:

  • G-Eval enables reference-free evaluation using LLMs as judges, guided solely by task description and criteria.
  • Utilize structured reasoning (Chain-of-Thought) to produce form-based scores, showing strong alignment with human judgment.
  • Effective evaluation relies on high-quality, diverse datasets specifically generated for LLM-based assessment.
  • LLM-as-judge methods may exhibit bias towards LLM-generated text.
  • Future directions include exploring streamlined reasoning ('Chain-of-Draft') and iterative self-improvement.
Status: Ongoing Ephor Project View related publications →

Reference-Free LLM Evaluation with G-Eval

Research Summary

This research investigates the G-Eval framework, leveraging advanced LLMs as judges for robust, reference-free evaluation of language tasks. A core focus is the development of high-quality, diverse synthetic evaluation datasets, generated using techniques like diverse prompting and role-play to create varied inputs (prompts, questions) without relying on reference outputs.

We explore how G-Eval utilizes LLMs’ reasoning capabilities to establish evaluation criteria and perform nuanced, multi-dimensional assessments that show stronger correlation with human judgment than traditional metrics. The integration of these synthetic datasets into evaluation pipelines is examined, alongside critical analysis of challenges such as potential evaluator bias and reliability, and strategies for mitigation.

Related Publications

  • [Case Study] "Negative Prompting" for Code Review. Hype or Real?
  • The 7B vs 34B Reality: When DSPy Can't Save You
  • Useful or Not: Declarative Self-improving Python
  • Quantifying Expertise Inflation
  • Clash of the Titans
  • Agentic Retrieval Deepdive

View all related publications

Research image

Agent Reliability & Evaluation

Agent Reliability Evaluation

Benchmark agent reliability, step completion, tool-calling robustness, recovery, workflow cost, and human-agent collaboration quality.

Key Findings:

  • Capability demos do not predict operational reliability without measuring task completion, recovery, cost, and auditability.
  • Agent evaluation needs workflow-level measures, including step skipping, tool errors, routing quality, and human handoff.
  • A dual-layer evaluation framework separates individual agent competence from system-level coordination and runtime reliability.
Status: Ongoing Ephor Project View related publications →

Agent Reliability & Evaluation

Research Summary

This research addresses critical gaps in current agent benchmarks, which often overlook operational reliability, step completion, recovery, cost, and human handoff. To enable more rigorous evaluation and optimize human-AI teamwork, we propose and investigate a Dual-Layer Evaluation Framework. This framework distinctly assesses:

  • (A) Individual Specialist Agent Competence (accuracy, efficiency, robustness, protocol interaction) in isolated tasks,
  • (B) System-Level Coordination (task routing, workflow efficiency, communication effectiveness, error recovery, overall task success) in complex, collaborative scenarios.

This approach aims to provide deeper insights into both component strengths and emergent system behaviors.

Related Publications

  • Taming Tool Calling with Kimi K2.5
  • [3Qs with AI CoE] Guest Chintan Parekh
  • A Practical Guide to LLM & Agent Evaluation
  • Useful or Not: DeepAgent
  • Agentic Frameworks
  • Navigating the Agent Framework Maze

View all related publications

Research image

AI Media Production Workflows

AI Media Production

Analyze AI media workflows for narrated videos, text-to-video, image generation, music models, and reusable production tooling.

Status: Ongoing Ephor Project View related publications →

AI Media Production Workflows

Research Summary

This research initiative focuses on AI media production workflows, including narrated article videos, text-to-video generation, image generation, music generation, and reusable tooling for turning source material into polished media assets.

Key activities will include:

  • Comparative Model Analysis: We will systematically evaluate image, music, and text-to-video models, assessing their strengths, weaknesses, and suitability for various applications.
  • Technical Specification Tracking: We will maintain a detailed record of technical specifications for key models. This includes parameters such as output resolution, frame rates, supported styles, and control mechanisms.
  • Results Curation and Evaluation: A significant part of this research will involve curating galleries of video outputs. These galleries will be used to compare models based on critical factors like:
    • Prompt Adherence: How faithfully the video output reflects the input text prompt.
    • Reasoning and Coherence: The logical consistency of depicted actions and scenes, and the model’s ability to infer and represent complex relationships.
    • Text Capabilities: The quality and accuracy of any text rendered within the generated videos.

By systematically investigating these aspects, we aim to provide valuable insights into the current capabilities and limitations of AI-generated media and the tooling required to make it production-useful.

Related Publications

  • Reve 2.0's Innovation in Image Generation
  • The Bug That Kept Cutting Our AI Videos Off Mid-Sentence
  • First Contact With Hyperframes
  • We Turn an Article Into a Narrated Video in the Time It Takes to Render (Part 1)
  • How the Machines Finally Learned to Draw
  • ChatGPT Images 2.0 Explained

View all related publications

Research image

Realtime Multimodal AI

Multimodal Model Capabilities

Exploring audio/video interactions with realtime streaming, ambient agents, desktop collaboration, and augmented reality.

Status: Ongoing View related publications →

Realtime Multimodal AI

Research Summary

This research investigates real-time multimodal AI, focusing on seamless, context-aware interactions powered by live audio and video streaming. We analyze capabilities offered by platforms like the Gemini Multimodal Live API, which enable the development of sophisticated “ambient agents” capable of understanding and responding to their environment continuously (e.g., Ephor Coach, Highlight). A key focus is exploring applications of audio/video streaming and interaction specifically for Extended Reality (XR) environments, enhancing collaboration and contextual awareness across devices.

Related Publications

  • Late Interaction: ColBERT to Wholembed v3
  • [Deep Dive] Qwen 3.5 Brings Native Multimodality and Long Context to Small Open Models
  • Any Chatbot Can Become a Living Expert

View all related publications

Research image

Generating Engaging Visuals for Education

Educational Visual Generation

Presents a pipeline for generating adaptive educational visuals using LLM-driven multimodal models to enhance learning retention.

Key Findings:

  • AI enables generation of educational visuals (diagrams, animations, simulations, assets) directly and via code generation.
  • LLMs can generate code for specialized libraries (TikZ, Manim, p5.js, three.js), automating complex visual creation.
  • Multimodal models directly create assets (illustrations, textures) from prompts, augmenting traditional workflows.
  • A multi-tool strategy is advised: specialized libraries for precision and interactivity, prompt-based AI for asset generation.
  • Effective use requires clear guidelines, training, and mastering LLMs for both code and asset generation.
Status: Completed Ephor Project View related publications →

Generating Engaging Visuals for Education

Research Summary

This research explores AI-powered methods for generating engaging and adaptive educational visuals on-demand. We investigate two primary approaches:

  1. Utilizing Large Language Models (LLMs) to generate code for specialized visualization libraries (e.g., TikZ, Manim, p5.js) to create precise diagrams, animations, and interactive elements; and
  2. Employing state-of-the-art multimodal generative models (like GPT-4o) for direct image creation from text prompts, suitable for illustrations and assets.

The project evaluates the effectiveness, feasibility, and integration strategies of these tool-based and direct generation techniques to enhance learning experiences and retention.

Related Publications

  • Generating Engaging Visuals for Education

View all related publications

Research image

Empowering Learners with AI Tutors

AI Tutoring

Develop adaptive AI-driven tutoring workflows providing personalized feedback and guidance to improve learning outcomes.

Key Findings:

  • AI Tutors provide personalized scaffolding, adaptive feedback, and tailored paths, showing significant learning gains.
  • Core ITS components (Domain, Student, Tutoring Models, UI) enable dynamic adaptation to learner needs.
  • AI tutors can foster Self-Regulated Learning; Open Learner Models enhance student self-awareness.
  • Synergy with personalized and self-directed learning principles creates engaging and effective educational models.
  • Challenges include development complexity, pedagogical validation, data privacy, and balancing support vs. autonomy.
Status: Ongoing Ephor Project View related publications →

Empowering Learners with AI Tutors

Research Summary

This research investigates how AI Tutors can transform education through personalized and self-directed learning. By leveraging NLP, ML, and adaptive algorithms, AI tutors customize learning pathways, content delivery (including dynamic text and image generation), and feedback in real-time. Our focus is on building and evaluating prototypes featuring these capabilities, including adaptive assessment modules, to enhance learner autonomy, engagement, and knowledge retention. We analyze the impact of these AI-driven approaches on learning outcomes and explore the technical and pedagogical considerations for effective implementation.

Related Publications

  • [How-To] Breaking the Speed Limit with Bedrock & Learners Lens
  • Ready User One: LearnLens
  • DSPy Unleashed: We Built a Self-Improving System That Teaches Anything to Anyone
  • Scientific Discourse for Builders
  • The One Rule That Made My AI Tutor 3× Cheaper (Without Losing Accuracy)
  • Beyond Adoption: Defining Real AI Impact at Trilogy

View all related publications

AI Center of Excellence on X

© 2025 Trilogy AI Center of Excellence. All Rights Reserved. © Trilogy

Practical AI signal in your inbox:

Subscribe on Substack