hdmark :
Can anyone who knows jump in and explain what is complicated about this? Or potentially details on how it works?
People have demonstrated SLAM on smartphones for a while. Here's a video describing sensor fusion on Android, dated 6 years ago:
https://www.youtube.com/watch?v=C7JQ7Rpwn2k
I think what Dacuda isn't telling you is that the environment needs a certain amount of visual clutter. If you're standing in a small room with bare plain walls, then you'll need to hang some posters. This is why Oculus, MS, ... everyone else needs multiple cameras. They at least need to be able to see things like multiple corners, where the walls meet the ceiling or floor, in order to reliably extract pose information accurate enough for AR or VR.
IMO, it's not surprising they're running this on a phone. The questions are: what is the platform requirement and how much of the CPU/GPU is left for your app? The mid-range/low-end phones have been improving at a pretty good pace, so most 64-bit phones will probably have adequate performance. As Jeff says, battery life & heat will be issues, if the app depends on their SLAM engine continuing to run at full frame-rate, for very long.