I'm actually seeing drastically decreased performance (ie. it's slower) when moving from 3.33 to 3.34 or greater when the perl is persistant. While the initial DESTROY phase is a lot faster it seems to be at the expense of later object creation.
To demonstrate this I stored a large (tens of thousands of nodes for a graph structure) result set from our system in a Cache::FileCache. In the test script I retrieve the structure ('GET'), undef it ('CLEAN'), then do the same again.
Version 3.33
Init Cache: 0.000522137 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
Retrival attempt:
0 :
GET: 3.16013 wallclock secs ( 2.91 usr + 0.04 sys = 2.95 CPU)
CLEAN: 32.3321 wallclock secs (30.80 usr + 0.01 sys = 30.81 CPU)
1 :
GET: 3.18668 wallclock secs ( 2.98 usr + 0.01 sys = 2.99 CPU)
CLEAN: 33.8066 wallclock secs (31.85 usr + 0.03 sys = 31.88 CPU)
2 :
GET: 3.33551 wallclock secs ( 3.09 usr + 0.02 sys = 3.11 CPU)
CLEAN: 34.509 wallclock secs (31.89 usr + 0.02 sys = 31.91 CPU)
Init Cache: 0.000521898 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
Retrival attempt:
0 :
GET: 3.04257 wallclock secs ( 2.90 usr + 0.04 sys = 2.94 CPU)
CLEAN: 0.754264 wallclock secs ( 0.72 usr + 0.00 sys = 0.72 CPU)
1 :
GET: 91.0177 wallclock secs (85.86 usr + 0.04 sys = 85.90 CPU)
CLEAN: 0.805352 wallclock secs ( 0.73 usr + 0.00 sys = 0.73 CPU)
2 :
GET: 81.2487 wallclock secs (76.66 usr + 0.05 sys = 76.71 CPU)
CLEAN: 0.793248 wallclock secs ( 0.75 usr + 0.00 sys = 0.75 CPU)
As you can see version 3.33 had a very very slow destruction but the runs were predictable. In 3.34 and greater (tested to current at 3.36) the destruction is much faster but at the expense of predictability and much slower instantiation of subsequent objects and an overall greater net time.
Devel::DProf segfaults and Devel::Profiler currently produces and invalid tmon.out for some reason, so I haven't yet profiled this further. A sample dataset and the test script are available if it's needed.