SpeeCheck: Self-Contained Speech Integrity Verification via Embedded Acoustic Fingerprints

Anonymous Authors

Overview illustration
SpeeCheck system overview

Abstract

Advances in audio editing have made public speeches increasingly vulnerable to malicious tampering, raising concerns for social trust. Existing speech tamper- ing detection methods remain insufficient: they often rely on external references or fail to balance sensitivity to attacks with robustness against benign operations like compression. To tackle these challenges, we propose SpeeCheck, the first self-contained speech integrity verification framework. SpeeCheck can (i) effec- tively detect tampering attacks, (ii) remain robust under benign operations, and (iii) enable direct verification without external references. Our approach begins with utilizing multiscale feature extraction to capture speech features across dif- ferent temporal resolutions. Then, it employs contrastive learning to generate fingerprints that can detect modifications at varying granularities. These fin- gerprints are designed to be robust to benign operations, but exhibit significant changes when malicious tampering occurs. To enable self-contained verifica- tion, these fingerprints are embedded into the audio itself as a watermark. Fi- nally, during verification, SpeeCheck retrieves the fingerprint from the audio and checks it with the embedded watermark to assess integrity. Extensive experiments demonstrate that SpeeCheck reliably detects tampering while maintaining robust- ness against common benign operations. Real-world evaluations further confirm its effectiveness in verifying speech integrity.

Real-world dataset examples

Original Audio (Audio1)

Transcript: "The board has decided they can not approve the new budget."

Hamming Distance: 12
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio1, Deletion)

Transcript: "The board has decided they can not approve the new budget."

Hamming Distance: 131
(Threshold: 42 → Verdict: TAMPERED)

Original Audio (Audio2)

Transcript: "Our analysis shows this investment is not a secure option."

Hamming Distance: 6
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio2, Silencing)

Transcript: "Our analysis shows this investment is not a secure option."

Hamming Distance: 107
(Threshold: 42 → Verdict: TAMPERED)

Original Audio (Audio3)

Transcript: "Based on the evidence, the suspect is innocent."

Hamming Distance: 1
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio3, Substitution)

Transcript: "Based on the evidence, the suspect is guilty."

Hamming Distance: 123
(Threshold: 42 → Verdict: TAMPERED)

Original Audio (Audio4)

Transcript: "Based on the evidence, the suspect is guilty."

Hamming Distance: 22
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio4, Substitution)

Transcript: "Based on the evidence, the suspect is innocent."

Hamming Distance: 118
(Threshold: 42 → Verdict: TAMPERED)

Original Audio (Audio5)

Transcript: "I never said she stole the company's data."

Hamming Distance: 11
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio5, Reordering)

Transcript: "She stole the company's data, I never said."

Hamming Distance: 128
(Threshold: 42 → Verdict: TAMPERED)

Original Audio (Audio6)

Transcript: "We will begin the product launch immediately."

Hamming Distance: 15
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio6, Substitution)

Transcript: "We will delay the product launch immediately."

Hamming Distance: 100
(Threshold: 42 → Verdict: TAMPERED)

Original Audio (Audio7)

Transcript: "We will delay the product launch immediately."

Hamming Distance: 9
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio7, Substitution)

Transcript: "We will begin the product launch immediately."

Hamming Distance: 121
(Threshold: 42 → Verdict: TAMPERED)

Original Audio (Audio8)

Transcript: "I believe it's a good idea, but we need more time."

Hamming Distance: 2
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio8, Splicing)

Transcript: "I never said I believe it's a good idea, but we need more time."

Hamming Distance: 125
(Threshold: 42 → Verdict: TAMPERED)

Original Audio (Audio9)

Transcript: "I never said she stole the company's data."

Hamming Distance: 15
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio9, Text-to-Speech)

Transcript: "This is authentic audio, not deepfake."

Hamming Distance: 135
(Threshold: 42 → Verdict: TAMPERED)

Original Audio (Audio10)

Transcript: "I never said she stole the company's data."

Hamming Distance: 21
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio10, Voice Conversion)

Transcript: "I never said she stole the company's data."

Note: Voice Timbre Changed

Hamming Distance: 126
(Threshold: 42 → Verdict: TAMPERED)