Sotto Voce: Differentially Private Federated Learning for Speech Recognition

Our goal is to build an end-to-end speech recognition system using a federation of audio corpus training sets, through differentially private learning, to provide a robust, privacy-preserving scalable implementation, free from membership and reconstruction attacks, and also offering confidential model construction to the user analyst.

Sotto Voce is a system for creating speech recognition models that solve the four fundamental privacy barriers to building an effective system for collaborative speech modeling. This framework allows speech data owners to combine their respective training resources for increased model accuracy 1] without the need to share raw data and 2] without revealing the set of models each partner needs to train, and guarantee to produce models that 3] will not leak who is in the data, or 4] anything they have been recorded saying.

Speech Recognition
The modern speech recognizer is a data-hungry beast. The tremendous increase in accuracies these systems have achieved in the last ten years is matched only by their demand for ever larger datasets. While industrial systems feast on 1000's of hours of training data, recognition systems built for the smaller niche use cases are forced to survive on tiny morsels. This disparity occurs when systems are expected to work in scenarios and languages for which data is scarce. While small pockets of such data are owned by various organizations, barriers commonly exist that prevent sharing. This lack of sharing capability can manifest itself in real-world financial costs to locate or create alternative data sources. Federated learning with differential privacy provides a framework for building models collaboratively, without revealing or leaking the raw sensitive data outside of the control of the individual data owners.