Slow Performance adding rows/cells to spreadsheet

Feb 17, 2010 at 3:45 PM

First let me start off on a good note - using this library of wrappers has been very easy.  I was slogging through the OpenXML documentation and while I had something going it took a lot more lines of code than it does using this library.

That said, however, I'm in a situation where I wish to add about 250,000 rows to a spreadsheet.  I'm finding the performance unbearable.  On my laptop, it takes about ~5.2 seconds to write out 1,000 rows, about ~14.8 seconds to 2,000 rows, about ~53 seconds for 3,000 rows, ~113 seconds for 4,000 rows and ~2000 seconds for 5,000.  At there I gave up timing further.  I want to write out about 250,000 rows and save it in under 10 seconds -- or at least under a minute, which is an eternity in itself.

Here's my code snippet - i basically lifted it from one of the samples provided in the library:

            using (var package = SpreadsheetDocumentWrapper.Open(_xlsxFile))
            {
                var part = package.WorkbookPart.WorksheetParts["Sheet1"];

                for (int i = row; i < 5000; i++)
                {
                    part.Worksheet.SetCellValue(new GridReference(i, 0), "TEXT");
                    part.Worksheet.SetCellValue(new GridReference(i, 1), 12.50);
                    part.Worksheet.SetCellValue(new GridReference(i, 2), "TEXT 2");
                    part.Worksheet.SetCellValue(new GridReference(i, 3), "TEXT 3");
                    part.Worksheet.SetCellValue(new GridReference(i, 4), 1.25);
                    part.Worksheet.SetCellValue(new GridReference(i, 5), 25);
                    part.Worksheet.SetCellValue(new GridReference(i, 6), 12500.50);
                    part.Worksheet.SetCellValue(new GridReference(i, 7), 12.12);
                    part.Worksheet.SetCellValue(new GridReference(i, 8), .15);
                    part.Worksheet.SetCellValue(new GridReference(i, 9), .05);
                    part.Worksheet.SetCellValue(new GridReference(i, 10), 123);
                }
            }

This sample is strictly writing out 5,000 records of the same thing - obviously this is just a test snippet - I will be writing out different values in the real world.  But, I'm finding that writing 2,000 rows is very very slow ... It would take hours to save out a mere 250,000 rows.

Is there a different approach I could be using that would achieve usable results?  Or, do I need to go back to using the core OpenXML API's?

thanks,

-tedvz

Coordinator
Feb 18, 2010 at 9:20 AM
Edited Feb 18, 2010 at 9:21 AM

When ExtremeML populates a cell in the manner you have implemented, it needs to go through several steps:

  • check whether the specified row exists and, if necessary, create it
  • check whether the specified cell in the specified row exists and, if necessary, create it

This process first creates the wrapper instances and adds them to the object graph that represents the worksheet data. Then, when the package is disposed, it creates the underlying SDK objects for all the new row and cell objects which, in turn, creates the Xml document elements that represent them. This approach has been chosen because it offers a way to write relatively simple client code, but I admit that it is not especially efficient in terms of resources and is also not optimized for performance. It works well for small spreadsheets but I can see that it might struggle with your scenario.

If you were programming the same operation against the SDK, there would be some gain because you would eliminate all the ExtremeML wrapper objects and you could also tune your algorithm to best suit the way the SDK works. Moreover, the fact that you know, in advance, that all the cell assignments are for new cells in new rows would enable your code to eliminate all the existense tests. However, even with all these potential gains, I would be surprised if resorting to the SDK allows you to achieve a significant improvement, because it too uses wrapper classes that maintain an in-memory model of the underlying Xml document, so it has similar steps to perform when adding new worksheet content.

It would be interesting to see a performance comparison between the two approaches. If your comparison reveals a significant potential gain, I would be happy to look at extending ExtremeML with some kind of bulk insert method, that bypasses all the individual steps.

However, it is also worth noting that some other users of the SDK have experienced severe performance and memory consumption problems:

http://social.msdn.microsoft.com/Forums/en-US/oxmlsdk/thread/1d93e595-2436-47e1-96d9-1cdf14dc52d3/

http://blog.goyello.com/2009/08/25/read-before-using-it-open-xml-sdk-performance-analysis/

 

 

Feb 18, 2010 at 9:25 AM

Hello,

tedvz wrote:This sample is strictly writing out 5,000 records of the same thing - obviously this is just a test snippet - I will be writing out different values in the real world.  But, I'm finding that writing 2,000 rows is very very slow ... It would take hours to save out a mere 250,000 rows.Is there a different approach I could be using that would achieve usable results?  Or, do I need to go back to using the core OpenXML API's?

just a suggestion: would you be able to use a profiler (like http://www.jetbrains.com/profiler/, free trial for 10 days) to get a better idea of what takes time ?

cheers,

-- Thibaut

 

Coordinator
Feb 18, 2010 at 10:04 AM

Thanks Thibaut - that's definitely the quickest way to pinpoint the source of the problem.

Apr 26, 2010 at 9:30 PM

I'm also very excited about ExtremeML and the power it offers! I'm running into the same issues that tedvz is with exporting rather large ADO.NET datasets to Excel. In my case I am dealing with 300 + columns, and only 1000 rows, needless to say a nice chunk of data. Below is the snippet I've come up with (utilizing the bare bones OpenXML API's) to achive a Fast Load using a Sql Data Reader. It would be awesome for ExtremeML to offer a "Fast Load" option, assuming the known risks of column ordinals.

This Method seems to perform very well....but does seem to be a bit of a memory hog...

        private void FastLoadExcelWorksheet(string filePath, string sheetName, SqlDataReader dataReader)
        {
            bool wsExists = WorksheetExists(filePath, sheetName);
            if (!wsExists)
                InsertWorksheet(filePath, sheetName);

           using (SpreadsheetDocument document = SpreadsheetDocument.Open(filePath, true))
            {
                WorkbookPart wbPart = document.WorkbookPart;

                // Get the sheet reference from the workbook.
                WorksheetPart worksheetPart = InitializeWorksheet(wbPart, sheetName);

                //create header row
                Row hdrRow = new Row();
                worksheetPart.Worksheet.First().AppendChild(hdrRow);

                //Create the header row.
                for (int lFieldIndex = 0; lFieldIndex < dataReader.FieldCount; lFieldIndex++)
                {
                    // create text cell with field header name.
                    string headerCol = dataReader.GetName(lFieldIndex);
                    hdrRow.AppendChild(CreateTextCell(headerCol));
                }

                List<Type> numericTypes = new List<Type>(new Type[] { typeof(short), typeof(int), typeof(long), typeof(float), typeof(double), typeof(decimal), typeof(ushort), typeof(uint), typeof(ulong) });
                while (dataReader.Read())
                {
                    //create data row(s)
                    Row lDataRow = new Row();
                    worksheetPart.Worksheet.First().AppendChild(lDataRow);

                    //dump data with correct data type
                    for (int lFieldIndex = 0; lFieldIndex < dataReader.FieldCount; lFieldIndex++)
                    {
                        object lValue = dataReader[lFieldIndex];
                        if (numericTypes.Contains(dataReader.GetFieldType(lFieldIndex)))
                        {
                            lDataRow.AppendChild(new Cell() { CellValue = new CellValue(lValue.ToString()) });
                        }
                        else
                        {
                            lDataRow.AppendChild(CreateTextCell(lValue.ToString()));
                        }
                    }
                }

                worksheetPart.Worksheet.Save();
                document.WorkbookPart.Workbook.Save();

            }
        }

        private Cell CreateTextCell(string aText)
        {
            //Create a new inline string cell.
            Cell lCell = new Cell();
            lCell.DataType = CellValues.InlineString;
            //Add text to the text cell.
            InlineString lInlineString = new InlineString();
            Text lText = new Text();
            lText.Text = aText;
            lInlineString.AppendChild(lText);
            lCell.AppendChild(lInlineString);
            return lCell;
        }
        private static bool DeleteWorksheet(string filePath, string sheetName)
        {

            using (SpreadsheetDocument document = SpreadsheetDocument.Open(filePath, true))
            {
                WorkbookPart wbPart = document.WorkbookPart;

                Sheet theSheet = wbPart.Workbook.Descendants<Sheet>().
                  Where(s => s.Name == sheetName).FirstOrDefault();
                if (theSheet == null)
                {
                    // The specified sheet doesn't exist.
                    return false;
                }

                // Remove the sheet reference from the workbook.
                WorksheetPart worksheetPart = (WorksheetPart)(wbPart.GetPartById(theSheet.Id));
                theSheet.Remove();

                // Delete the worksheet part.
                wbPart.DeletePart(worksheetPart);

                // Save the workbook.
                wbPart.Workbook.Save();
            }
            return true;
        }
        private static void InsertWorksheet(string filePath, string sheetName)
        {
            if (WorksheetExists(filePath, sheetName))
            {
                DeleteWorksheet(filePath, sheetName);
            }

            // Open the document for editing.
            using (SpreadsheetDocument spreadSheet = SpreadsheetDocument.Open(filePath, true))
            {
                // Add a blank WorksheetPart.
                WorksheetPart wsPart = spreadSheet.WorkbookPart.AddNewPart<WorksheetPart>();
                wsPart.Worksheet = new Worksheet(new SheetData());
                wsPart.Worksheet.Save();

                Sheets sheets = spreadSheet.WorkbookPart.Workbook.GetFirstChild<Sheets>();
                string relationshipId = spreadSheet.WorkbookPart.GetIdOfPart(wsPart);

                // Get a unique ID for the new worksheet.
                uint sheetId = 1;
                if (sheets.Elements<Sheet>().Count() > 0)
                {
                    sheetId = sheets.Elements<Sheet>().Select(s => s.SheetId.Value).Max() + 1;
                }

                // Give the new worksheet a name.
                if (string.IsNullOrEmpty(sheetName))
                {
                    sheetName = "Sheet" + sheetId;
                }

                // Append the new worksheet and associate it with the workbook.
                Sheet sheet = new Sheet() { Id = relationshipId, SheetId = sheetId, Name = sheetName };
                sheets.Append(sheet);
                spreadSheet.WorkbookPart.Workbook.Save();
            }
        }

        private static bool WorksheetExists(string filePath, string sheetName)
        {
            bool wsExists = false;
            if (string.IsNullOrEmpty(sheetName))
                return wsExists;

            using (SpreadsheetDocument document = SpreadsheetDocument.Open(filePath, true))
            {

                WorkbookPart wbPart = document.WorkbookPart;

                Sheet theSheet = wbPart.Workbook.Descendants<Sheet>().
                  Where(s => s.Name == sheetName).FirstOrDefault();

                if (theSheet != null)
                {
                    // The specified sheet doesn't exist.
                    wsExists = true;
                }
                return wsExists;
            }
        }
Coordinator
Apr 27, 2010 at 6:32 AM

@jstone923: Many thanks for posting this proposal - it looks like a very feasible solution and is certainly worth some investigation about integrating the approach into the Fill() method.

I will spend some time this week looking at how this might be tackled (possibly using a new method overload) but it would be interesting to see your performance benchmarks for the two approaches.

Thanks again for your suggestion.

Tim 

Jan 16, 2011 at 2:12 PM

This solutions looks good, but i haven't tested it (for an DataTable import with about 8 columns and 45000 rows).

Tim have you dealt with this approach?

Dec 3, 2013 at 4:14 PM
I am facing the same problem with an object array that is approximately 100 columns x 50,000 rows. I was using the Excel interop services, which allows you to populate an entire Excel range from a table with a single call, but has the downside that if your application dies, it will leave Excel running. It also requires Excel to be installed on the target computer. I tried ExtremeML to get around the use of Excel, but having to populate each cell individually takes far too long. Writing this array with Excel takes about a second. The code snippet for writing the data array through Excel interop is:
            public void WriteToExcel(string sheet, string beginCell, object[,] data, string NumberFormat = "")
            {

                // This routine writes a data array into an Excel worksheet in the open workbook. If the Excel
                // Application is not already running start it.
                //
                // The application checks the names of each worksheet in the workbook. If the specified worksheet
                // already exists, it will write the data into that worksheet, overwriting any data already in the sheet.
                // If the worksheet does not exist, it creates the sheet, then writes the data.
                // The data goes into a range stating at "beginCell". The size of the data range is set by the
                // size of the data array.

                if (!xlRunning)
                {
                    StartExcel();
                }

                Worksheet m_xlWorksheet;
                int m_nNumberOfWorksheets;
                int m_nWorksheetIndex;
                m_nNumberOfWorksheets = xlWorkbook.Worksheets.Count;
                m_xlWorksheet = (Worksheet)xlWorkbook.Worksheets[1];
                for (m_nWorksheetIndex = 1; m_nWorksheetIndex < m_nNumberOfWorksheets + 1; m_nWorksheetIndex++)
                {
                    m_xlWorksheet = (Worksheet)xlWorkbook.Worksheets[m_nWorksheetIndex];
                    if (m_xlWorksheet.Name == sheet) break;
                }

                if (m_nWorksheetIndex == m_nNumberOfWorksheets + 1)     // if we didn't find the right sheet, create it
                {
                    m_xlWorksheet = (Worksheet)xlWorkbook.Worksheets.Add();
                    m_xlWorksheet.Name = sheet;
                }

                Range m_xlRange;
                m_xlRange = (Range)m_xlWorksheet.get_Range(beginCell);
                m_xlRange = m_xlRange.Resize[data.GetUpperBound(0) + 1, data.GetUpperBound(1) + 1];
                m_xlRange.Value = data;
                if ("" != NumberFormat)
                {
                    m_xlRange.NumberFormat = NumberFormat;
                }
                m_xlWorksheet.UsedRange.Columns.AutoFit();
            } // End of WriteToExcel