[138] in Information Retrieval

home help back first fref pref prev next nref lref last post

Code Review - some additional comments

daemon@ATHENA.MIT.EDU (Mitchell N Charity)
Fri Jan 22 15:04:20 1993

Date: Fri, 22 Jan 93 15:03:30 EST
From: mcharity@hq.LCS.MIT.EDU (Mitchell N Charity)
To: elibdev@Athena.MIT.EDU
Reply-To: mcharity@lcs.mit.edu

At today's code review, there was some interest in seeing this...

-- included message --
From mcharity Wed Dec 23 20:44:15 1992 EST
Subject: MARC routines

Here are some random comments on marc.h and marclib.c ...

Regards,
Mitchell
-------------------
general

Note the existence of several ftp'able C style guides / coding standards.
Those I have include:
  Recommended C Style and Coding Standards - various
  How To Steal Code (or Inventing The Wheel Only Once) - Henry Spencer
  cstyle-ESA-OZ.txt

Religious annotation:
It has been said that programs are a way to communicate with other people,
and are only incidentally compilable by a computer.
Given that, all the usual rules of writing clear english apply.
One can write code (even in C:) which clearly and concisely says
what is being done, and why.
It can "read like a novel".
It should never be necessary to emulate a compiler or computer in order to
understand what is happening.
Flame off :).


Makefile

Note existence of etags (emacs tags).
After etags is run on sources (% etags foo.h foo.c ...),
emacs can then do search and query-replace over all the files
(M-x tags-search, M-x tags-query-replace).
A common idiom is to include in the Makefile:
etags:
	etags ${FILES}

Note existence of makedepend.
When makedepend is run on sources (% makedepend foo.h foo.c ...),
it creates #include dependency information and appends it to the Makefile.
A common idiom is to include in the Makefile:
depend:
	makedepend ${FILES}



marc.h

A header file can contain a description of the abstraction being provided.
It defines the externally visible aspects of a package.
It can describe abstraction, its assumptions, limits, use, etc.
This information might instead go in a document or man page.
Documentation duplication should be avoided:
 - variable names may not accurately reflect their use
 - nearby comments will _sometimes_ not match code
 - comments in another file will _often_ not match code
 - comments in 2 or more files will usually disagree w each other and the code.

While a header describes an interface/abstraction, it should avoid
revealing implementation details which the consumer of the abstraction
does not need to know about.
These are included, with a description of the implementation, in the .c file.

Headers can be thought of as defining an interface, and often include
data type declarations and procedure prototypes.

     /* MARC leader positions, etc */

     /* leader data positions*/
     #define RECLEN_P 0
     #define STATUS_P 5
     #define TYPE_P 6
     #define LEVEL_P 7
     #define BASE_P 12
     #define ENC_P 17
     #define DCF_P 18
     #define LINK_P 19
     #define	DIR_P 24	/* position of the directory */
     #define	DIRLEN 12	/* length of a directory entry */
     #define	RECLEN_L 5	/* length of the record length field */

Using tabs in source code has the problem that different editors/printers
will give them different sizes.

Another way of naming might be something like:
#define LEADER_OFFSET_RECLEN 0
#define LEADER_OFFSET_STATUS 5
#define LEADER_OFFSET_TYPE   6
#define LEADER_OFFSET_LEVEL  7
...

     /* special MARC characters */

     #define FIELD 0x1f
     #define END_FIELD 0x1e
     #define END_REC 0x1d

Names can convey type information,
and through their similarity, clarify relationships.
#define FIELD_BEG_CHAR  0x1f
#define FIELD_END_CHAR  0x1e
#define RECORD_END_CHAR 0x1d

Declaring objects with the type intended can be clearer and safer than
relying on type coercion.  Declaring as chars:
#define FIELD_BEG_CHAR  '\037'
#define FIELD_END_CHAR  '\036'
#define RECORD_END_CHAR '\035'


     /* miscellaneous */

     #define MAXTAGS 1000
     #define	MAXMARC	65000	/* we don't expect records larger than 20K */
     #define	MINMARC	36	/* leader plus one directory entry */

It is almost a stereotypical critique of unix that limits are hardwired
into programs, and frequently prove insufficient and hard to change.
Unfortunately, given C's manual memory management, avoiding static limits
is often too expensive and complexifying.

Note that MAXTAGS is only used in mrMapMarc, to size three arrays
which could be dynamically allocated (malloced) instead.

One might label this section "Fundamental limits" and give it a high
profile location.

     /*  errors */
     #define GOOD 0
     #define TAG_NOT_FOUND -1
     #define OCC_NOT_FOUND -2
     #define SFD_NOT_FOUND -3
     #define MEMORY_ERROR -4
     #define READ_FAIL  -5
     #define READ_ZERO -6
     #define BAD_MARC_SIZE -7
     #define TOO_MANY_TAGS -8
     #define BAD_OCC_REQ -9
     #define NO_TAG -10

     /* the structure of our dreams */

     struct marcRec 
     {
       int length;	    /* leader data */
       char status;
       char type;
       char level;
       char encod;
       char dcf;
       char linking;
       int *tags;	    /* array of tag numbers */
       int *tagOffs;	    /* array of offsets where above tags begin */
       int *tagLens;	    /* array of lengths of the tags defined above */
       char *data;	    /* the actual data of the tags */
       int numTags;	    /* total number of tags */
       int curTag;	    /* current tag in the structure */
       int curFldOcc;    /* which member of the array is the current tag */
       int curTagOcc;    /* tag occurrence of the current tag (if read by FindTagNam) */
       char curSfd;	    /* current subfield in the structure */
       int curSfdOcc;    /*sfd occurrence (if read by FindSfdNam) */
       char *curSfdPos;  /* position of subfield within tag */
       char *tagData;    /* actual data of tag */
       char *sfdData;    /* actual data of subfield */
     };

A large structure's organization can be clarified by adding substructure.
One might use anonymous structs like:
struct marcRec 
{
  struct {	    /* leader data */
    int length;
    char status;
    char type;
    char level;
    char encod;
    char dcf;
    char linking;
  } leader;
  struct {
    int *tags;	    /* array of tag numbers */
    int *tagOffs;	    /* array of offsets where above tags begin */
    int *tagLens;	    /* array of lengths of the tags defined above */
    char *data;	    /* the actual data of the tags */
  } record;       
...
Access is then like: struct marcRec *foo; foo->leader.status = ...

The tagss, tagOffs, tagLens might better be represented as:
struct {
  int number;
  int offset_in_record;
  int length;
} mr_tag_entry;
struct marcRec {
...
  mr_tag_entry *tags;
...
}
This makes their relationship explicit.
The triple mallocs/frees/etc collapse to one.
Access can be cleaner:
  strncpy(foo,&marc->data[marc->tagOffs[pos]],marc->tagLens[pos]);
becomes
  mr_tag_entry *tag = ...;
  strncpy(foo,&marc->data[tag->offset],tag->length);

The curXXX concept, having state associated with access, worries me.
If I access a field, and then do something else, I have to be
sure that the something else doesnt access a field, or my result
will be stomped on.
It doesnt look like the state is actually being _used_ for anything,
so the mrFind... could just return a pointer or somesuch.

A common idiom is to replace
  struct foo { ... };
with
  typedef struct { ... } foo;
This permits one to replace
  bar () {
    struct foo x;
with
  bar () {
    foo x;

It can be useful to specify which parts of the structure are public,
(subject to direct use, can be relied upon, implementor cannot easily change),
and which parts are private, secrets of the implementation, and subject
to arbitrary change.

Headers generally contain function prototypes.
Example: int mrGetField (struct marcRec*,int);
Where compatibility with very old compilers is needed,
this is #ifdef'ed as in
#ifdef ANCIENT_COMPILER
int mrGetField();
#else
int mrGetField (struct marcRec*,int);
#endif
If the return value is other than int, a declaration is required for
correctness.
The full argument prototype permits additional type checking,
implicit type casting, and various compiler optimizations.

marclib.c

mrInit

When one can initialize structure elements to zero,
one can bzero(&foo,sizeof(foo)) to avoid forgetting elements.

         marc->tags=(int *) malloc(sizeof(int));
         if (marc->tags==NULL)
             return MEMORY_ERROR;
      [...]
         free(marc->tags);

A common idiom is:
  marc->tags = NULL;
[...]
  if(marc->tags != NULL) free(marc->tags);

Combined statements can sometimes be clearer through brevity.
  if((marc->tags=(int*)malloc(sizeof(int)))==NULL) return MEMORY_ERROR;
  if((marc->tagLens=(int*)malloc(sizeof(int)))==NULL) return MEMORY_ERROR;
  if((marc->tagOffs=(int*)malloc(sizeof(int)))==NULL) return MEMORY_ERROR;
or
  if((marc->tags=(int*)malloc(sizeof(int)))==NULL) ||
     (marc->tagLens=(int*)malloc(sizeof(int)))==NULL) ||
     (marc->tagOffs=(int*)malloc(sizeof(int)))==NULL) return MEMORY_ERROR;
They can also be impenetrable.

         int GetI(); /* returns the int expressed by the ascii chars */
This should be declared at the file level, so you only need one of them,
and any mismatch with the actual GetI can be detected.

     /* get the position of the tag data (base address of data) */
         baseadr=GetI(&rawRec[BASE_P],5);
     /* length of the data portion of the record */
         marc->length-=baseadr;
When comments are being added to explain code,
it is often better to modify/rename the code.
When there are comments every line, something is wrong.
The style guides give various commenting conventions.

         ...=GetI(&rawRec[BASE_P],5);
It has been said that the only numbers which should directly appear
in code are 0 and 1, and 1 is questionable.:)  All others should be named.

         baseaddr=GetI(&rawRec[BASE_P],5);
To understand this line, one needs to look at the definition of GetI,
and understand MARC record organization.
An alternative might be:
         baseoffset = atoi_bounded(&rawRec[OFFSET_OF_BASE_OFFSET],
                                   LENGTH_OF_BASE_OFFSET);
One ends up trying to maximize clarity by trading off obscurity against 
verbosity and effort.
The use of the "atoi" name can give clarity by engendering expectations,
but these can bite if atoi_foo then behaves differently than atoi in some way.
One approach to expectation management is to call it "psuedo_atoi".:)

Procedures and data which are internal to the abstraction,
ie no other files use them, should be declared static.
This _assures_ that noone else uses them,
and allows greater compiler optimization.

mrFindTagNam (marc,tag,occ)

It looks like the occ'th argument is one based (ie, the first is 1, 2nd 2,...).
By weak convention, indexes are usually 0 based.

-- end of included message --

home help back first fref pref prev next nref lref last post