VeriSpeak SDK - Standard Edition
Voice / Speaker identification for PC and Web solutions
VeriSpeak voice identification technology is intended for biometric systems developers and integrators. The text-dependent speaker recognition algorithm assures system security by checking both voice and phrase authenticity. Voiceprint templates can be matched in 1-to-1 (verification) and 1-to-many (identification) modes.
VeriSpeak is available as a software development kit that allows development of PC- and Web-based solutions on Microsoft Windows, Linux and Mac OS X platforms.
Advantages of VeriSpeak
- Text-dependent algorithm prevents unathorized access with a covertly recorded user voice.
- Two-factor authentication by checking voice biometrics and pass phrase authenticity.
- Regular microphones are suitable for recording user voices.
- Available as a multiplatform SDK that supports multiple programming languages.
- Reasonable prices, flexible licensing and free customer support.
VeriSpeak Algorithm Features and Capabilities
All performance tests were made on Intel Core2 processor with 4 cores running at 2.66 GHz.
VeriSpeak is a PC-based speaker recognition algorithm designed for biometric system integrators. The VeriSpeak algorithm implements voice enrollment and voiceprint matching using proprietary sound processing technologies:
- Text-dependent algorithm. The text-dependent speaker recognition is based on saying the same phrase for enrollment and verification. The VeriSpeak algorithm determines if a voice sample matches the template that was extracted from a specific phrase. During enrollment, one or more phrases are requested from the person being enrolled. Later that person may be asked to pronounce a specific phrase for verification. This method assures protection against the use of a covertly recorded random phrase from that person.
- Two-factor authentication with a passphrase. The VeriSpeak voiceprint matching algorithm can be configured to work in a scenario, where each user records a unique phrase (such as passphrase or an answer to a "secret question" that is known only by the person being enrolled). Later a person is recognized by his or her own specific phrase with a high degree of accuracy. The overall system security increases as both voice authenticity and passphrase are checked.
- Liveness detection. A system may request each user to enroll a set of unique phrases. Later the user will be requested to say a specifc phrase from the enrolled set. This way the system can ensure that a live person is being verified (as opposed to an impostor who uses a voice recording).
- Identification capability. VeriSpeak functions can be used in 1-to-1 matching (verification) and 1-to-many (identification) modes.
- Multiple samples of the same phrase. A template may store several voice records with the same phrase to improve recognition reliability. Certain natural voice variations (i.e. hoarse voice) or environment changes (i.e. office and outdoors) can be stored in the same template.
- Fused matching. A system may ask users to pronounce several specific phrases during speaker verification or identification and match each audio sample against records in the database. The VeriSpeak algorithm can fuse the matching results for each phrase together to improve matching reliability.
Back to top of this page
VeriSpeak Technology and SDK
- VeriSpeak Standard and Extended SDK. VeriSpeak
Standard SDK is intended for PC-based biometric application development
and the Extended SDK is suitable for developing Web-based biometric
systems. Both SDKs include Voice Matcher and Extractor components; the
Extended SDK also includes Voice Client component. Read
more
- Recommendations and constraints for speaker recognition.
VeriSpeak has certain requirements for microphone settings and
position, as well as user behavior and environment. A passphrase should
be
kept
in
secret and not pronounced in an environment where
other people may hear it. Read
more
- System requirements. VeriSpeak-based software
can be run on computers with x86 compatible processors (at least 2
GHz processor recommended). Windows, Linux and Mac OS
X platforms are supported. A regular microphone
is suitable for voice capture. Read more
- Technical specifications. VeriSpeak extracts a
template in 0.07 seconds and can match up to 88 voiceprints per second,
when 5-second long voice samples are used. A single voiceprint template
requires about 6.5 kilobytes. Read
more
- Reliability and performance tests. The VeriSpeak
template extraction and matching algorithm has been tested with the
voice samples XM2VTS database and Neurotechnology internal
databases. Read more
- Download. A VeriSpeak 30-day SDK Trial is available for
downloading.
Back to top of this page
Contents of VeriSpeak Standard and Extended SDK
VeriSpeak SDK is based on VeriSpeak PC-based voice recognition
technology and is intended for biometric systems developers and
integrators. The SDK allows rapid development of biometric applications
using functions from the VeriSpeak algorithm. VeriSpeak can be easily
integrated into the customer's security system. The integrator has
complete control over SDK data input and output.
VeriSpeak is available as the following SDKs:
- VeriSpeak Standard SDK is intended for PC-based
biometric application development. It includes Voice Matcher and
Extractor component licenses, programming samples and tutorials and
software documentation. The SDK enables the development of biometric
applications for Microsoft Windows, Linux or Mac OS X operating
systems.
- VeriSpeak Extended SDK is intended for biometric
Web-based and network application development. It
includes all features and components of the Standard SDK with the
addition of Voice Client component licenses, sample client
applications, tutorials and a ready-to-use matching server
component.
The table below compares VeriSpeak Standard SDK and VeriSpeak
Extended SDK. See the licensing
model for more information on specific license types.
| Component licenses included with a specific SDK |
| Component types |
VeriSpeak
Standard SDK |
VeriSpeak
Extended SDK |
| • Voice Matcher |
1 single computer
license |
1 single computer
license |
| • Voice Client |
|
3 single computer
licenses
and
1 concurrent license |
| • Voice Extractor |
1 single computer
license |
1 single computer
license |
| • Matching Server |
|
+ |
VeriSpeak SDK includes programming samples and
tutorials that show how to use the components of the SDK to perform
voice template extraction or matching against other templates. The
samples and tutorials are available for these programming languages and
platforms:
| |
Windows
32 & 64 bit |
Linux
32 & 64 bit |
Mac
OS X |
| Programming samples |
| • C/C++ |
+ |
+ |
+ |
| • C# |
+ |
|
|
| • Sun Java 2 |
+ |
+ |
+ |
| • Visual Basic .NET |
+ |
|
|
| • Delphi |
+ |
|
|
| Programming tutorials |
| • C |
+ |
+ |
+ |
| • C# |
+ |
|
|
| • Visual Basic .NET |
+ |
|
|
| • Sun Java 2 |
+ |
+ |
+ |
| • Delphi |
+ |
|
|
Back to top of this page
Basic Recommendations for Speaker Recognition
The speaker recognition accuracy of VeriSpeak and MegaMatcher
depends on the audio quality during enrollment and identification.
Certain constraints should be noted before or during algorithm
integration into a speaker recognition system, whereas other can be
overcome by enrollment with the same phrase in different environments.
At least 2-seconds long voice samples are recommended
to assure recognition quality.
General Security
A passphrase should be kept in secret and not pronounced in
an environment where other people may hear it if the speaker
recognition system is used in a scenario with unique phrases for each
user.
Microphones
There are no particular constraints on models or
manufacturers when using regular PC microphones, headsets or built-in
laptop microphones. However these factors should be noted:
- The same microphone model is recommended (if
possible) for use during both enrollment and recognition, as different
models can produce different sound quality. Also some models may
introduce specific noise or distortion into the audio, or may include
certain hardware sound processing, which will not be present when using
a different model.
- The same microphone position and distance is
recommended during enrollment and recognition. Headsets
provide optimal distance between user and microphone; this distance is
recommended when non-headset microphones are used.
- Web cam built-in microphones should be used
with
care, as they are usually positioned at a rather long
distance from the user and may provide lower sound quality. The sound
quality may be affected if users change their position relative to the
web cam.
Sound Settings
Settings for clear sound must be ensured, as some
audio software, hardware or drivers may have certain means of sound
modification enabled by default. For example, the
Microsoft Windows OS usually has sound boost enabled by default.
At least 11,025 Hz sampling rate with at least 16-bit
depth should be set during voice recording.
Environment Constraints
The VeriSpeak and MegaMatcher speaker recognition
algorithm is sensitive to background noise or loud
voices in the background that may interfere
with the user's voice and affect the recognition results. These
solutions may be considered to reduce or eliminate these problems:
- A silent environment for enrollment and
recognition.
- Several samples of the same phrase recorded in
different environments can be stored in a biometric template. Later the
user will be matched against these samples with much higher recognition
quality.
- Close-range microphones (like those in headsets)
that are not affected by distant sources of sound.
- Third-party or custom solutions for background noise reduction,
like using two separate microphones for recording user voice and
background sound, and later subtracting the background noise from the
recording.
User Behavior and Voice Changes
These natural voice changes do not occur often but
may affect speaker recognition accuracy:
- A temporarily hoarse voice caused by a cold or
other sickness
- Different emotional states that affect voice
(i.e. cheerful voice versus tired voice)
- Different pronuncation speeds during enrollment
and identification
The aforementioned voice and user behavior changes
can be managed in two ways:
- Separate enrollments for the altered voice with
storing the records to the same person's template;
- Controlled neutral voice during enrollment and
identification.
Back to top of this
page
VeriSpeak SDK System Requirements
- PC or Mac with x86
(32-bit) or x86-64 (64-bit) compatible
processors. 2GHz or better processor is recommended.
- At least 128 MB of free RAM should be available
for the application. Additional RAM is required for applications that
perform 1-to-many identification, as all biometric templates need to be
stored in RAM for matching. For example, 1,000 templates
(each containing 1 voice record) require about 6 MB of
additional RAM.
- Free space on hard disk drive (HDD):
- at least 1 GB required for the development.
- 100 MB required for VeriSpeak components deployment.
- Additional space would be required in these cases:
- VeriSpeak does not require the original voice sample to
be stored for the matching; only the templates need to be stored.
However, storing voice samples on hard drive for the potential future
usage is recommended.
- Usually a database engine runs on a separate computer
(back-end server). However, DB engine can be installed on the same
computer for standalone applications. In this case HDD space for
templates storage must be available. For example, 10,000 templates
(each containing 1 voice record extracted from 5-seconds long sample)
stored using a relational database would require about 64 MB
of free HDD space. Also, the database engine itself requires HDD space
for running. Please refer to HDD space requirements from the database
engine providers.
- Microphone. Any microphone that is supported by
the operating system can be used.
- Database engine or connection with it. VeriSpeak
templates can be saved into any DB (including files) supporting binary
data saving. VeriSpeak Extended SDK contains the following support
modules for Matching Server:
- Microsoft SQL Server (only for Microsoft Windows platform);
- PostgreSQL (for Microsoft Windows and Linux platform);
- MySQL (for Microsoft Windows and Linux platforms);
- Oracle (for Microsoft Windows and Linux platforms);
- SQLite (for all platforms).
- Network/LAN connection (TCP/IP) for
client/server applications. Also, network connection is required for
using Matching server component (included in VeriSpeak Extended SDK).
Communication with Matching server is not encrypted therefore, if
communication must be secured, a dedicated network (not accessible
outside the system) or a secured network (such as VPN; VPN must be
configured using operating system or third party tools) is recommended.
- Microsoft Windows specific requirements:
- Microsoft Windows XP/Vista/7, 32-bit or 64-bit.
- Microsoft .NET framework 2.0 or newer (for .NET components
usage).
- One of following development environments for application
development:
- Microsoft Visual Studio 2005 SP1 or newer (for
application development under C/C++, C#, Visual Basic .Net)
- Sun Java 1.5 SDK or later
- Delphi 7
- Linux specific requirements:
- Linux 2.6 or newer kernel, 32-bit or 64-bit.
- glibc 2.7 or newer
- GStreamer 0.10.23 (with gst-plugin-base and gst-plugin-good)
or newer (for voice capture using microphone)
- libasound 1.0.x or newer (for voice capture)
- udev-143 or newer with libudev (for microphone usage)
- GTK+ 2.10.x or newer libs and dev packages (to run SDK
samples and applications based on them)
- GCC-4.0.x or newer (for application development)
- GNU Make 3.81 or newer (for application development)
- Sun Java 1.5 SDK or later (for application development with
Java)
- pkg-config-0.21 or newer (optional; only for Matching Server
database support modules compilation)
- Mac OS X specific requirements:
- Mac OS X (version 10.4 or newer)
- XCode 2.4 or newer (for application development)
Back to top of this
page
Technical Specifications
All specifications are given for
Intel Core2 processor with 4 cores running at 2.66 GHz.
At least 11,025 Hz sampling rate with at least 16-bit
depth should be configured during voice recording.
At least 2-second long voice samples are recommended
to assure recognition quality. Longer voice samples will improve the
recognition quality.
See also the whole list of recommendations and constraints for
speaker recognition.
All voice templates should be loaded into RAM before identification,
thus the maximum voice template database size is limited by the amount
of available RAM.
The VeriSpeak voice template matching algorithm can use more than
one processor core on multi-core processors, enabling
an increase in template matching speed. The template matching speeds in
the table below are given as a range, where the smaller number means
matching speed using 1 processor core, while the
larger number means matching speed using all 4 processor cores.
The specifications below have linear dependence
from the voice sample length. For example, when using voice samples
that are 2 times shorter, template extraction will take 2 times less
time, the extracted template will be 2 times shorter, and the templates
will be matched 2 times faster.
| VeriSpeak algorithm technical specifications (for
5 second long voice samples) |
Voice template extraction time
(milliseconds) |
60 - 90 |
Matching speed
(voiceprints per second) |
22 - 88 |
Template size in database (1)
(kilobytes) |
6.0 - 6.8 |
(1) When 1
voiceprint record is stored in a template. Template size increases
proportionally when multiple voiceprint records are stored in the same
template.
Reliability
and
Performance Tests
All
tests were performed on Intel Core2 processor with 4 cores running at
2.66 GHz.
|
The VeriSpeak algorithm has been tested with voice samples
taken from the XM2VTS
Database, as well as with voice samples from Neurotechnology
internal database.
These voice template extraction and matching
experiments were performed:
- Experiment 1 – used voice samples from the
XM2VTS database. All samples include the same
phrase pronounced by all subjects.
|
Experiment
1

Click to zoom |
- Experiment 2 – used voice samples from
Neurotechnology internal voice database 1. All samples include the same
phrase pronounced by all subjects.
- Experiment 3 – used voice samples from
Neurotechnology internal voice database 2. Each subject pronounced a unique
phrase during his/her recording.
-
Template matching was performed using all 4 cores
of the processor.
Receiver operation characteristics (ROC)
curves are usually used to demonstrate the recognition quality of an
algorithm. ROC curves show the dependence of false rejection rate (FRR)
on
the false acceptance rate (FAR). Charts with ROC
curves for both databases are available on the right.
|
Experiments
2 and 3

Click to zoom |
| VeriSpeak algorithm tests with XM2VTS and
Neurotechnology internal databases |
| |
Exp. 1 |
Exp. 2 |
Exp. 3 |
| Total voice samples in the database |
2360 |
309 |
305 |
| Subjects in the database |
295 |
42 |
42 |
| Recording sessions per subject |
8 |
1 - 10 |
1 - 10 |
| Average voice sample length (seconds) |
7.112 |
4.975 |
6.214 |
| Average template extraction speed
(milliseconds) |
132 |
56 |
69 |
| Average voiceprint template size (kilobytes) |
8.7 |
6.7 |
7.8 |
| Template matching speed (voiceprints per
second) |
64 |
136 |
88 |