ANALYZING AUDIO FEATURES TO DISTINGUISH HUMAN VOICE FROM AI

Authors

  • Muxriddin Abduganiev Author
  • Mamura Uzakova Author
  • Aygul Burxanova Author

Keywords:

speech synthesis, acoustic features, MFCC, zero crossing rate, spectral centroid, librosa, human-computer interaction, voice forensics

Abstract

Telling a real human voice apart from AI-generated speech is
becoming incredibly important for things like security, forensics, and everyday tech.
In this study, a more detailed analysis was performed by studying three acoustic
features: Mel-Frequency Cepstral Coefficients (MFCC), Zero Crossing Rate (ZCR),
and Spectral Centroid. The audio was preprocessed in Python using the librosa
library, with the silent parts of the audio removed. The results were quite clear: the
spectral centroid of real human speech is higher, and the variation of the MFCCs is
much greater. Interestingly the ZCR values did not differ greatly between the two. In
conclusion, the results presented here demonstrate that simple audio parameters can
be effectively used for automatic detection of synthetic voices

Downloads

Published

2026-05-30