Lock free programs have their own issues to overcome. Using your own links:
http://www.eetimes.com/design/embedded/4214763/Is-lock-free-programming-practical-for-multicore-?pageNumber=2
What CAS does is to compare the current content of var with an expected value. If it is as expected, then var is updated to a new value. If it is not as expected, var is not updated. All of this is done as a single noninterruptable, nonpreemptable operation.
The way this is used in the development of lock-free code is to begin by copying the original value of a variable (or pointer) into expected. Then a second copy is used in a (possibly long and complex) calculation to arrive at a new value for the variable. But during the time of the processing, another task—perhaps on another core—may have already modified the variable from its original value. We would like to update the variable with the new value only if it has not been "tampered with" by another task in the interim. CAS does the combined check and update together, without the need for a locking mechanism.
Emphasis mine. So if two different threads want to access my geometry matrix, you are telling me I have to make a copy OF THE ENTIRE WORLD GEOMETRY? Good luck with that. Sure, assuming infinite memory, trivial times to copy large data structures, you could get away with it. For large data structures, this approach is infeasible. Regardless, memory usage will increase linearly with the number of threads. The more threads, the more memory you use. The more memory you use, the more likely data will be paged to disk. The more data paged to disk, the more likely you'll lose 100,000 or so clock cycles cleaning up the pagefault that results.
Secondly, you have the dreaded ABA problem:
http://en.wikipedia.org/wiki/ABA_problem
For simple operations (increment/decrement), we already have Interlocks that are guaranteed to be atomic, so for simple operations at least, locking isn't a significant issue.
For reference:
http://en.wikipedia.org/wiki/Lock-free_and_wait-free_algorithms
All the algorithms have their own problems. Theres no magic bullet here.
Wait-Free: Slow, and eats RAM
Lock-Free: Executing thread slows down (Is the cost of decreased thread performance worth it in order to avoid locks? Depends on the app in question)
Obstruction-freedom (or Optimistic Concurrency Control, for you database guys): Sucks performance if a conflict is detected (fast otherwise, but the performance cost in the case of a conflict is SIGNIFICANTLY more then the cost of locks in most cases)