Research Results & Evaluation

RQ1: LLM Interpretation Accuracy

This section evaluates how accurately the LLM can interpret natural-language delivery modification requests and convert them into structured, validated action schemas.

Overall Performance Metrics

0.9691

F1-score

0.94

Precision

1

Recall

0.953

Average Confidence Score

Outcome Classification by Action Type

The stacked bar chart provides a granular look at the classification outcomes across the 100 test messages, categorizing results into True Positives (TP), False Positives (FP), and False Negatives (FN). This distribution is critical for understanding the system's reliability in identifying actionable requests versus misinterpreting noise.

Performance Across Operational Domains

The chart illustrates the performance of SmartAdmin across the four primary operational domains. The Average Confidence Score serves as a composite metric reflecting the system’s reliability in instruction matching, action selection, parameter extraction and schema formatting.

Key Findings: These results validate that an instruction-centric RAG architecture effectively grounds LLM reasoning within uParcel's operational SOPs. While the tendency to assume parameters in ambiguous messages specifically when domain keywords like "profile" are missing remains an area for refinement, the overall high F1-score confirms that SmartAdmin provides a robust and accurate foundation for automated logistics administration

RQ2: Safety Mechanisms Effectiveness

Evaluation of safety mechanisms including SOP-based validation, risk classification, human-in-the-loop review, and audit logging.

Safety Layer Performance

82.1%

HITL Intervention Rate

0

Unsafe Action Executed

100%

Activity/Action Logged

Detailed Safety Metrics

Metric	Calculation	Result
False Negative Rate	Unsafe Actions Executed ÷ Total Unsafe Actions Submitted	0% (0/24)
HITL Intervention Rate	Actions Requiring Human Approval ÷ Total Actions	82.1% (23/28)
Traceability Completeness	Actions with Complete Logs ÷ Total Actions	100% (28/28)
Safety Test Pass Rate	Safety Test Cases Passed ÷ Total Safety Test Cases	100% (28/28)

Key Findings:

Selective HITL Enforcement: The system's 82.1% HITL Intervention Rate is a direct result of the test design, where each of the 5 action domains (Edit Remarks, Address, Urgency, Job Analysis, Profile) included exactly one "whitelisted" scenario.
Whitelisted Execution Logic: In these 5 specific whitelisted cases (e.g., RQ2-R06, RQ2-A06), the system correctly bypassed the manual approval gate and directly executed the task via Playwright.
Persistent Traceability: Crucially, even when an action was whitelisted for direct execution, the Audit Logging remained fully functional, capturing the identity, timestamp, and automation outcome for 100% of those events.
Default-Block Reliability: For the remaining 23 non-whitelisted or high-risk cases, the "Safety Layer" maintained a 100% success rate in blocking or escalating requests to the admin dashboard, confirming the reliability of the role-based access control (RBAC).

RQ3: Workflow Efficiency & Time Savings

Comparison of AI-assisted automation versus baseline manual workflows for delivery modification tasks.

Time Reduction Metrics

82.3056%

Average Interaction Steps Reduction

-25.622%

Average Completion Time Reduction

Interaction Steps Comparison

Across the 20 evaluated tasks, SmartAdmin consistently reduced interaction steps to a single confirmation step, while manual execution typically required between 4 and 12 discrete interactions per task. This yields a high Average Human Effort Reduction of 82.3056%, indicating that operators perform substantially fewer clicks and field edits when using the system.

Task Completion Time Comparison

[Task Completion Time] compares raw completion times between manual and SmartAdmin executions for each task. For task T01 – T10, the SmartAdmin completion time hovers around 19-31 seconds, whereas manual completion can spike up to 78 seconds for the most complex address changes. For T11 - T20, manual completion drops to around 9 - 18 seconds but SmartAdmin remains close to its typical 21 - 28 second window which results in negative completion time reductions despite the lower interaction effort in the previous chart (Human Effort Reduction).

Key Findings: RQ3 provides a balanced and informative outcome where SmartAdmin achieves substantial reductions in manual interaction steps (82.3056% Average Human Effort Reduction) but currently increases average completion time (-25.622% Average Completion Time Reduction) across the full task set. This demonstrates that AI-assisted workflow automation is not universally superior to manual execution on every dimension. Instead, it introduces clear trade-offs between reduced operator workload and potential latency overhead. These findings suggest that SmartAdmin is most valuable for higher-volume or more complex routine tasks where click reduction and cognitive load matter, while very simple, low-latency operations may remain more time-efficient when performed manually unless further performance optimisations are implemented.