Desired functionality

Dec 2, 2009 at 10:30 AM
Edited Dec 2, 2009 at 10:59 AM

Hello!

First of all thank you for this great work!

I've just started using your library, and here is 2 proposals.

1) Make SpreadsheetDocumentWrapper class not sealed, and it's constructor protectable, or even public.

I need to open document from Stream, and, after modifications, get this stream in final state. So that I can, for example, send it as an email attachment.

This can be achieved by manually constructing SpreadsheetWorker(Stream, true) and pass it to SpreadsheetDocumentWrapper. Also it is good to have more "Open" methods.

2) Then populating tables, pin cells in columns to the left and to the right from table columns. So that they stay on their place, and don't move down as cells below table do. This is how Excel populates tables.

This is what I need mostly.

Thank you!

 

Coordinator
Dec 2, 2009 at 11:03 AM

Hello andrej_,

Thanks for your suggestions and I am glad you are finding the library useful.

[1] Yes, I agree that the SpreadsheetDocumentWrapper class could be more accessible. There were some design criteria that made it desirable to seal the lower level wrapper classes, but there is no reason why this class should be sealed. Actually, I am planning to add support for streamed output in a forthcoming release, but it's still worthwhile to open this class. I will include this change in the next upload (within the next few days).

[2] Can you explain a little more about what you propose here? I tried very hard to emulate the Excel functionality, but maybe I missed something. Your clarification (maybe with an example) will help me make ExtremeML more Excel-like.

Thanks again,

Tim

Dec 2, 2009 at 11:44 AM
Edited Dec 2, 2009 at 11:52 AM

[2] Here is a template and a worksheet populated from it

a b c d
2




3





4

5



a b c d
2
xxx 1111 2222 3333

xxx 1111 2222 3333

xxx 1111 2222 3333

xxx 1111 2222 3333

xxx 1111 2222 3333






4

5



As you can see, row with 3 is deleted, and row with 4 moved below table. This way no data can be adjacent to the table. The expected result is

a b c d
2
xxx 1111 2222 3333
3
xxx 1111 2222 3333
4
xxx 1111 2222 3333

xxx 1111 2222 3333

xxx 1111 2222 3333








5



So content in all columns beside table columns remains on place, and only cells below table moved down. And we can have 2 tables side by side, for example..

Dec 2, 2009 at 12:05 PM
Edited Dec 2, 2009 at 12:07 PM

[3] Another convinient way to populate templates is to use DefinedNames.

WorkbookPart.Workbook.DefinedNames.DefinedNames[int] is the way to get it by index. It would be nice to get it by Name as we get Tables.

And, to work with DefinedName, we need an ability to get 4 coordinates of it borders, a "must have" :)

Also it is good to have an ability to set it's value, as with cells (all cells in a named range get this value).

And to Fill DefinedName just like we Fill Tables.

 

 

Coordinator
Dec 2, 2009 at 12:15 PM

Thanks for the clarification.

Now that I understand the problem, I can at least explain why it was done this way, although I don't have an immediate solution, as there are some complex things happening while populating a table that are not immediately obvious from the above example.

In fact, your example illustrates two separate problems. The first problem is that Excel tables require at least one data row, so we are forced to create all templates with an empty data row. When populating such a table, the ExtremeML code has to decide what to do with the empty row (delete it or use it). But, remember that ExtremeML also allows data to be appended to an existing populated table (without an empty row), so the algorithm needs to support this scenario too. One other consideration is what happens to an empty template table when rows are programmatically inserted into the underlying worksheet at the location of empty row. In the end, to keep the data injection algorithm as simple as possible, I chose to delete the empty row, as this was apparently the only way that the same code could be used for all situations. However, I agree that one side effect of this design is that any template data adjacent to the empty row will be lost. I will look again and see if there is a way to refine this process.

The second issue, as you described, is that template data adjacent to a table below the empty row will be shifted down when the table expands. And, for the same reason, it is not currently possible to put two tables adjacent to each other. Clearly the correct solution is the one you describe - that table inserts do not shift entire rows down, but instead shift just the cells beneath them. However, this is a complex operation in OpenXML. I will give some thought as to how it might be achieved, but I can't promise it will be part of the next release.

Thanks again for your feedback. It has been very helpful.

Coordinator
Dec 2, 2009 at 12:19 PM

@andrej_:

Re your comment [3] ... yes, I agree. Full support for defined names is planned and coming soon.

Please continue to post your suggestions here. It will help me to make improvements. Thanks.

Dec 2, 2009 at 1:06 PM
Edited Dec 2, 2009 at 1:07 PM

[2] Yes, I understand that it is a big change in algorithm. From manipulating rows as a whole instances, we get to maniputating parts of them.

Move part of a row n-indices lower. For this, we need to add n new rows at the bottom, and shift cells (part of a row) between i and i+n rows in a bottom-up i-loop, incrementing cell indices. By the way, when some cell is moved, any reference to it (in a formula in another cell, ect) must be updated.

Then where will be n original rows with empty part in a table columns, and you fill them with data.

The first problem of empty row will be solved automatically. This is just another empty part of a row.

 

Thanks again for your work, will be waiting your new releases!

Andrej.

Dec 2, 2009 at 4:12 PM
Edited Dec 2, 2009 at 4:20 PM

I think the most optimal way to implement cell shifting and referential integrity is to represent references between elements as references between objects in memory. For example, if Cell object c1 references another Cell object c2, the place of c2 in the sheet doesn't matter, and we do not need to update any references when we put c2 in another row. It is best not to keep row and column number in a cell, but to make this properties calculatable, from parent element - row, which has an index in a sheet and can find index of a cell in it'a list of cells. So, to move a cell, we just insert it in a new rows's list in place of another cell (which gets discarded). Or we can just copy value from cell to cell, keepeng references. If we change row index, no need to update cells either. Also, about named ranges. They are essentially left-top, and write-bottom cells. So named range updates automatically then it's border moves. The thing is, no update of any xml should be done until Save() is called. Before Save we just operate objects in memory, with efficient methods of accessing and modifying the OBJECT GRAPH. By the way, for collections of objects that we get by name, rather then by index, it is worth to implenet Dictionary or SortedDictionary map instead of List. I don't know much about sdk or your library yet, maybe this is rougthly the case things are implemented. Just sharing thoughts about the subject.. And sorry for my english, not much writing experience )

Coordinator
Dec 2, 2009 at 4:45 PM

Actually the referential integrity is already implemented. For example, if you look at the CellWrapper class, you will see that it implements the IAddressDependency interface, which is a standard interface that must be implemented by any document element that might suffer from changes in the worksheet geometry. All objects that implement this interface are registered in a List<>, which is exposed via the AddressDependencies property of the WorksheetPartWrapper class.

Then, whenever an operation is conducted that affects worksheet geometry (like inserting or deleting rows or columns), a single call is made to notify all affected elements, so that they can update their addressing.

This mechanism has already been implemented for many elements that are affected by worksheet addressing (like cells, hyperlinks, shared strings etc), so it is already safe to insert rows and columns into a worksheet without risk of damaging referential integrity. However, I have not yet done the work for cell formulas, as this will require some parsing logic to separate out their embedded cell address. I intend to use regular expressions for this, but I have not made much progress on it yet, as I am currently occupied on a number of other improvements.

I like your suggestion about deferring addressing updates until Save() but I have already tried this approach and it is not possible because worksheets use an optimization technique call sparse table markup, which means that row and cell numbers are not necessarily contiguous. In the end, I was forced to implement the IAddressDependency mechanism, which perpetually keeps the addresses in sync.

Regarding named ranges, you're right. They are top-left and bottom-right addressing structures. I have already developed the CellRange class (look in the ExtremeML.Spreadsheet.Address namespace for all addressing classes) that handles all of the calculation that will be required for named ranges. I think that the work that remains to be done is just to create a wrapper class and write the method that will return it by name.

I had some more thoughts about the table insertion problem. It is a bit more complex than it appears, but maybe you have some ideas?

If a worksheet contains one table, we have agreed that insertion should only affect the columns in which the table resides. In such cases, all cells below the table will shift down. But what happens if there are two tables on the same worksheet, one above the other, and they do not exactly share the same columns. For example, table one occupies the range A1:D5 and table 2 occupies the range B7:H10. When inserting rows into table 1, we must shift down the cells below it, but this will destroy the integrity of table 2. Any suggestions?

Dec 2, 2009 at 6:11 PM
Edited Dec 2, 2009 at 6:44 PM

I absolutely agree with you that key concept in programmatic generation of documents is a correct template creation. In template that you described, user stated that upper table should keep no more than that number of rows (to A6). So, fiil only that rows. You can optionally throw an exception, but I wouldn't. Let it be up to template creator.

Regarding gaps in row's list of cells, I would keep in RowWrapper the first index on non empty cell, and List<> of cells from that index to the right. If there can be gaps inside it, I would keep null in place for simplicity. In practice I think it it quite acceptable. So cells and rows (wrapers) move in their corresponding Lists, without much loop attribute update.. And on Save() SDK (wraped) objects updated. I have a feeling that life would be much easier (and faster) that way, but you know more details than I on this subject.

And about formulas. I don't guite understand. You tell some integrity is supported. Is where any mean of reference other than by formula? Maybe, simple references are updated? "=A5", ="A5+B6" ? It would be great even with simple formulas.

About named ranges, be sure to add methods GetValue<T>, SetValue, as with cells. Because with templates, the best practice would be to reference individual key cells by name, rather then by coordinates. Get returns the value of top-leftmost cell.

Coordinator
Dec 2, 2009 at 6:58 PM

I agree that the template designer carries a certain responsibility to create a design that is robust, but maybe your suggestion regarding the two-table scenario will be too restrictive. Imagine you wanted to create a worksheet template with 2 empty tables, one above the other. Your objective is that the tables should be close together when the worksheet is populated (perhaps separated by a single empty row), as this will be easiest for the end user to read. But the problem is that you don't know how may rows of data will be injected into the top table. One day it might be 5 rows, but another day it may be 500 rows. This means that it is impossible to create a finished result that always has one empty row separating the two tables.

If you create this scenario in ExtremeML today, it will work perfectly, because table injection always expands the table by inserting new rows into the worksheet. But, as you have discovered, it presents problems if the template contains data adjacent to the table. I would like to solve the problem you discovered without creating a new one. I think this requires more analysis, but I am confident a good solution can be found where both scenarios will work successfully.

In the current release there is no referential integrity for cell formulas - not even simple formulas. So, if you create a cell containing a formula below a table and then fill the table, the cell formula will be destroyed. Nonetheless, adding referential integrity for cell formulas is quite straightforward and will be added as soon as I can find the time to work on it.

Yes, named ranges will have the full capabilities that cells currently have (and some other features that I am currently adding, like RichText support). I will let you know as soon as I have a working prototype. And, of course, I will  still be happy to add subsequent improvements.

Dec 3, 2009 at 5:21 AM
Edited Dec 3, 2009 at 6:43 AM

Yes, I agree. This policy is to "stop on obstacle" (if an obstacle is not completely below the table). Here is an alternative - "move obstacle".

You still move groups of cells, but that group is dynamic from row to row, spanning entire obstacles. It is best be illustrated on the example (row numbers are to the right).

        A         B          C          D         E          F          G          H           I
          J       K
         s
         s
         s
         s
         s
         s
         s
         s
         s
         s
      1
         s
Столбец1 Столбец2 Столбец3 Столбец4          s
         s
         s
         s
         s
      2
         s



         s
         s
         s
         s
         s
      3
         s




         s
         s
         s
         s
         s
      4
         s

Столбец1 Столбец2 Столбец3 Столбец4 Столбец5          s
         s
         s
      5
         s






         s          s
         s
      6
         s






         s
         s
         s
      7
         s






         s
         s
         s
      8
Столбец1 Столбец2 Столбец3 Столбец4 Столбец5 Столбец6 Столбец7 Столбец8 Столбец9 Столбец10       9










      10










      11










      12
Столбец1 Столбец2 Столбец3






      13










      14
         s
Dec 3, 2009 at 5:46 AM
Edited Dec 3, 2009 at 6:44 AM

Cells with "s" are never moved.

When you populate Table4 (lowest), only A:C, Row>14 move down (filling empty row doesn't move anything).

Table3 : A-J, Row>=11 move down.

Table2 : C7:G8, and Table3 move down (A-J, Row>=9). Note that H1:*8 stay on place. After Table3 moved, H9:J9 will be new empty cells.

Table1 : B4:E4, B5:G8, A9:J* move down.

The point is, moving an object triggers moving all objects below it. And when you move an object, only data directly below it gets moved (plus data below triggered objects).

There can be Table0 in G1:I2, without problem. Populating it would move Tables 2 (hense table 3 and 4) down. The only thing is, when populating Table1 after Table0, it would be best not to move Table2, until we aproach 1-row distance to Table2. You see what I mean? Table2 gets moved only if we expand Table1 and distance from Table1 is 1 row or we expand Table2 and distance from it is 2 rows. That would be a complete solution. But, if it is too complicated for the beginning, we can move Table2 on any expansion of Table0 and Table1, for a good start. In practice I wouldn't do such overlapping table templates.

A B C D E F G H I J K






Столбец1 Столбец2 Столбец3
1

Столбец1 Столбец2 Столбец3 Столбец4




2










3










4


Столбец1 Столбец2 Столбец3 Столбец4 Столбец5


5










6










7










8
Столбец1 Столбец2 Столбец3 Столбец4 Столбец5 Столбец6 Столбец7 Столбец8 Столбец9 Столбец10 9










10










11










12
Столбец1 Столбец2 Столбец3






13










14

 

What do you think?

And regarding formula. If there would be support for simplest formula cell updates (=A5), we could solve all our tasks simply by reference moving cells by static cells (in our temlate), and reference static cells by complicated formulas.

 

Coordinator
Dec 3, 2009 at 7:32 AM

That's a brilliant piece of analysis on the table problem and a very innovative solution to the cell formulas problem.

I think that it will still take some careful thought to code the table algorithm without breaking existing functionality, but I agree that it seems to cover all the requirements we have identified. I will start writing a pseudo code plan for this as soon as I get time.

But first, I am in the middle of three complex updates that I would like to stabilize and release before starting any new additions. However, the cell formula idea is so simple to implement that I can probably add that to the current task list and release it in the next few days. In fact, if you are waiting for this functionality, I may have time to create the FormulaWrapper class and send it to you separately.

Thanks again - your ideas have made a great contribution to this project.

Dec 3, 2009 at 8:09 AM

No worries, I'll wait a few days. And I'm glad to help in development of such a promising project, thank you

 

Coordinator
Dec 11, 2009 at 8:00 PM

@andrej_:

I have just uploaded a source code & runtime update, together with updated documentation, which includes the following changes:

- Regex bug fixed in CellReference.Parse().
- Class SpreadsheetDocumentWrapper is no longer sealed (and has protected instead of private constructor).
- Created struct GridReference for passing cell coordinates to methods. (some method overloads are now deprecated).
- Implemented support for cell formulas
- Implemented support for absolute and relative addressing in cell references, areas, row spans & column spans

These changes provide almost complete support for cell formulas, as described in a new section in the user guide.

Please note that this latest update is built against the August 2009 release of the OpenXML SDK. The next priority is to update ExtremeML for compatibility with the December 2009 release, which has a number of breaking changes. This update should be complete within the next 5 days.