Here's what I see as the primary challenges that need to be solved:
- Syntactical expression of pipelines, that enable re-use of processing blocks
- There are too many technologies to learn for a developer:
- Algorithms
- Operating systems (for scheduling the algorithms to take advantage of multi-core architectures)
- SIMD / Vectorization (for taking advantage of special architectures for vision processing)
- GPU usage - either for simple parallelism or with a deep learning framework.
- The need to iterate quickly - as algorithms get updated and workloads shift with addition of custom hardware.
I built the audio concurrency framework that's been running on all of Qualcomm's 7K (from the first Android phone onwards), 8K and 9K platforms. The challenges there were: audio and voice concurrency, the need to handle different sampling rates and the long list of post-processing blocks. I believe that we solved it elegantly with the right balance of re-configurability and simplicity, in resource constrained (MHz and Memory) platforms.  I see similar challenges in CV based solutions for ADAS. 
It will be a fun ride.
Kuntal. 
