Thread

Posted on Wed Jan 17 16:12:19 2007 by pete1234
Memory consumption.

Hi, thanks for providing this module, which works quite well. I have a requirement to parse some 25,000 spreadsheets. Due to what I can only see as memory leak in parseexcel it will run out of memory in a 1GB machine after about 300 files. (Test case is by simply calling the parse method and moving to the next file). The memory is freed at the end of the script execution, but builds steadily at about 2mb per file parsed. I can't run it in batches as I need to do some unique identification of the data. (I'm not doing this bit at the moment, its not the problem).

Please assist me if I'm being stupid, this isn't my area of expertise, I don't see an obvious close method. I've hacked in the Devel::leak module into a test script and it does indeed seem to read and leak memory for every cell in the spreadsheet. (I can observe what I think are leaked objects with random strings like "PackedIdx", "Spreadsheet::ParseExcel::Cell", and sometimes my input data). I've tried some workarounds, like to =(); nullify my objects on the script side and also tried the weaken() methods, without avail.

Are there any known workarounds for processing large sets of sheets? Thanks for your help.
Direct Responses: 4065 | Write a response
Posted on Wed Jan 17 17:01:37 2007 by jmcnamara in response to 4064
Re: Memory consumption.
Direct Responses: 4127 | Write a response
Posted on Tue Jan 23 12:10:50 2007 by pete1234 in response to 4065
Re: Memory consumption.

Thanks very much for your excellent links. The cell handler approach allowed about ten times the amount of files to be processed. (but this still isn't nearly enough).

I think the fork solution could work, but the code is non trivial already enough to make me wary about doing that. I think I'm going to have to convert them to csv using OLE, and then run the main program on that input.

Is there any work going on to implement the suggestion in the second link to weaken/destroy the stuctures?

Thanks again for your help.
Direct Responses: 4187 | Write a response
Posted on Mon Jan 29 17:03:47 2007 by jmcnamara in response to 4127
Re: Memory consumption.

Spreadsheet::ParseExcel isn't my module but I know a little bit about how it works.

I had a look at trying to weaken() the links internally but it looked a bit tricky to figure out all of the circular links.

Instead, I wrote a small function to take an ParseExcel object created by theSpreadsheet::ParseExcel::Workbook::Parse() method and to free the circular links manually. Here it is:

############################################################################### # # free_parser() # # Break the circular references in the Spreadsheet::ParseExcel object so the # memory can be garbage collected by perl. The circular references are caused # by links to the parent from the child objects. # sub free_parser { my $parser = $_[0]; # Free the parent _Book and Cell Format references. for my $worksheet ( @{ $parser->{Worksheet} } ) { $worksheet->{_Book} = undef; for my $cell ( @{ $worksheet->{Cells} } ) { for my $obj ( @{$cell} ) { my $type = ref $obj; next unless $type; if ( $type eq 'Spreadsheet::ParseExcel::Cell' ) { $obj->{Format} = undef; } } } } # Free the Font refs from Format for my $format ( @{ $parser->{Format} } ) { $format->{Font} = undef; } # Free the parent refs in the callback tables. for my $key ( keys %{ $parser->{_Excel}->{FuncTbl} } ) { $parser->{_Excel}->{FuncTbl}->{$key} = undef; } $parser = undef; }

You can use it as follows.

... my $excel = Spreadsheet::ParseExcel::Workbook->Parse($file); foreach my $sheet ( @{ $excel->{Worksheet} } ) { # Do some work here. } # Finished parsing current file. free_parser($excel); ...

I tested it with a small number of Excel files of different types but there may still be some links that aren't freed. Let me know how you get on.


John.
--
Direct Responses: 4188 | Write a response
Posted on Mon Jan 29 17:35:25 2007 by pete1234 in response to 4187
Re: Memory consumption.

Thanks very much, I'm actually running the process at with the csv idea. It'll run for a day or so, but I'll be sure to try this and see how it performs as I'd rather not have the extra conversion step long term.

Cheers.
Direct Responses: 4225 | Write a response
Posted on Fri Feb 2 12:54:24 2007 by pete1234 in response to 4188
Re: Memory consumption.

Hi John, thanks for your efforts. The function seems to lessen the impact, but it still leaks memory. (and with amount of input, almost any loss is fatal over time). Sample output from Devel::Leak is below, I don't know if it helps. I've had to go down the csv route because of time pressure. Thanks very much anyway.

new 0xa0ac214 : SV = PVHV(0xa0a8930) at 0xa0ac214 REFCNT = 7 FLAGS = (PADBUSY,PADMY,OBJECT,SHAREKEYS) IV = 20 NV = 0 STASH = 0x9c5fb84 "Spreadsheet::ParseExcel::Format" ARRAY = 0xa0ad478 (0:17, 1:12, 2:2, 4:1) hash quality = 88.5% KEYS = 20 FILL = 15 MAX = 31 RITER = -1 EITER = 0x0

new 0xa0ac220 : SV = RV(0xa0a5b38) at 0xa0ac220 REFCNT = 1 FLAGS = (ROK) RV = 0xa0ac034

new 0xa0ac22c : SV = PVAV(0xa0ac554) at 0xa0ac22c REFCNT = 1 FLAGS = () IV = 0 NV = 0 ARRAY = 0xa0ab3a0 FILL = 3 MAX = 3 ARYLEN = 0x0 FLAGS = (REAL)

new 0xa0ac238 : SV = PVAV(0xa0ac4d0) at 0xa0ac238 REFCNT = 1 FLAGS = () IV = 0 NV = 0 ARRAY = 0xa0ab378 FILL = 3 MAX = 3 ARYLEN = 0x0 FLAGS = (REAL)

new 0xa0ac244 : SV = IV(0xa0aa104) at 0xa0ac244 REFCNT = 1 FLAGS = (IOK,pIOK) IV = 64

new 0xa0ac250 : SV = IV(0xa0aa108) at 0xa0ac250 REFCNT = 1 FLAGS = (IOK,pIOK) IV = 64

...
# Test 3 got: "53285" (./dsu_man.pl at line 167)
# Expected: "33706" (Nuber of SVs created unexpected)

Direct Responses: 4269 | Write a response
Posted on Wed Feb 7 01:25:59 2007 by jmcnamara in response to 4225
Re: Memory consumption.

I can't reproduce this but I'm still interested in tracking it down. Could you write to me at jmcnamara at cpan . org.

John.
--
Direct Responses: 4275 | Write a response
Posted on Wed Feb 7 09:43:39 2007 by jdaily in response to 4269
Re: Memory consumption.
I haven't worked with this module in a few years. But when I did, I remember having the same issue. I tweaked the module to store the data in a hash, as oppose to the object that it was using. This worked well enough for my needs. Not sure if you have tried this, but if not, it's worth a shot.
Write a response