Refactor app_trace locking to use this function.
Further improves performance: No contention -> 134 cycles Recursion -> 117 cycles Contention -> 323 cycles