|
|
The "Polymer Beamline" A2 of HASYLAB, Hamburg is a "1st generation device", and data analysis is a "1st generation analysis". It is complex and heterogeneous. Although the migration from VMS to UNIX has started, only the old file formats are in use. A decision concerning a novel file format has not yet been made.
In the past five years there has been a trend to migrate from VMS to WINDOWS 3.11. Several programs have been ported to WINDOWS, but it turned out that UNIX (especially LINUX) is much more convenient for experiment control and data processing.
The two most versatile programs, MICKI (for DIDA format files) and OTOKO (for EMBL format files) are available in versions for VMS and WINDOWS. Some users, who have recorded data on image plate, use ImagePro on their PC for data analysis. Under UNIX data evaluation is generally performed using an interpreter like IDL or PV-WAVE. File format filters and evaluation routines are under construction.
Data from our gas-filled detectors are recorded in files using a "DIDA" format. Since beamline A2 emerged from an EMBL-beamline, there are routines for conversion between DIDA and EMBL format. Image plates have gained a certain importance, and so has the GEL format, which is written by our image plate scanner. There is a routine to convert GEL file images to DIDA format (gel2vpf).
I have written routines, which can read GEL files and DIDA files into a "structure" of PV-WAVE. After processing, the data are written using a very simple binary format, which can be read across platforms by every PV-WAVE interpreter. I do not intend to push this format, but I would like to write filters for a common interchange format for the SAS community.
Many people use DIDA, but only few were able to interpret it because of lacking information. Therefore I would like to give some insight into the format.
DIDA was implemented by Rainer Steube: "The general structure has been taken from the Australian DDF-package, which originally has been written in RATFOR (by PTR et al.)" The structure of DIDA files has never been clearly documented, and the program sources were never handed out by R. Steube. So I obtained my first knowledge about DIDA only by "reverse engineering" of numerous files. Only two years ago I received a program listing of the DIDA procedures, which helped to reveal the format.
DIDA files are organized in pages of fixed length. But every DIDA file may have different page size (ClusSize). The page size and other general information is in the first page (I call it sesame). The remaining blocks of the file are either catalogue pages or data pages. The arrangement of catalogue pages and data pages is fixed throughout the whole DIDA file.
The next block after the sesame is the first catalogue page. It holds a certain number of catalogue entries (CatSize), each of which describes a unique data page. After each catalogue page we find exactly CatSize data pages, and after the block of data pages there is the next catalogue page. The reference between a catalogue entry and the corresponding data page is obvious. Thus catalogue pages are stored and retrieved by sequential access.
The catalogue entry describes the contents of the data page logically by a vector with three components, ( OBJ, KEY, INDEX ). All data pages with the same OBJ describe the same scattering pattern. For the KEY component I only found three distinct values in the files that I analyzed: KEY = -1 indicates that here an identification string is stored. KEY = 0 indicates that here one may retrieve information about the internal structure of the scattering pattern and the number format used for storing the data. For both keys it is generally sufficient to use one page, and therefore the only INDEX found in the file is 1. Under KEY=1 the scattering intensities are stored, and for this purpose generally several pages are needed, whose ordinal is in INDEX. This means that data pages are stored and retrieved by random access.
There is another field in the catalogue entry, which is used for bookkeeping purposes.
The complete structure of the DIDA format shall not be explained in plain text here. It can be retrieved directly from my documented program listings of a general DIDA-analyzer and a special filter which reads ".cpf" and ".vpf" files, written in IDL / PV-WAVE dialect. From these listings the importance of the "Endian problem" and its solution becomes obvious: Any filter program which interprets binary numbers and is used for transport across platforms has to test the byte order ("Big Endian" or "Little Endian"). If the data originate "from the other party", it has to rearrange the bytes in the numbers.
Provision for an endian test is made in the definition of the GEL format. Thus my routine for reading scattering patterns written in GEL format under PV-WAVE can easily detect an endian switch.
Although the SAS analysis itself is not the main topic of the canSAS workshop, my analysis programs may be of some interest. I intend to write my new programs using PV-WAVE. Every pattern is a structure (i.e. a record), and my "program" is simply a collection of functions and procedures called from the PV-WAVE prompt. By applying a function to a structure a modified structure is generated. I do not intend to write a GUI, since my major interest is the development of analysis methods and their adaption to special problems.
Writing a GUI and hooking in methods (procedures and functions) might be a task for a central software engineer, if we would like to define a common software interface (and programming language). In this case we could develop a user friendly SAS analysis tool for solving standard problems (not for the advanced scientist, but for a "SAS analytical service").