Traditional file design
Traditional file design
44.1 Purpose
A file is a repository for a set of related data records, a program, a document, a spreadsheet, an image, an object, or some other logical entity. Most information systems create, process, and/or manage data files. This chapter focuses on design options for data files.
44.2 Strengths, weaknesses, and limitations
The strengths and weaknesses of specific file organizations are briefly described in context.
44.3 Inputs and related ideas
A database (Chapter 45) is a set of related files. The individual files that make up a database are defined using the techniques described in this chapter. The first step in designing a data file is to compile the relevant logical data structures using such tools as data flow diagrams (Chapter 24), entity relationship diagrams (Chapter 26), data normalization (Chapter 28), or Warnier-Orr diagrams (Chapter 33). Typically the data elements and composites (the data structures) are documented in the data dictionary (Chapter 25). Data structures are discussed in Chapter 43.
44.4 Concepts
A file can hold data, a program, a document, a spreadsheet, an image, an object, or virtually any imaginable logical entity. This chapter focuses on data files.
44.4.1 The data hierarchy
The data in a file are typically structured using the standard field/record/file data hierarchy. A field is a data element that cannot be logically decomposed. (A field’s individual digits or characters have logical meaning only in aggregate.) Logically related fields are grouped to form records. A file is a set of related records. To put it another way, fields hold attributes, each record holds a single occurrence, and a file holds all the occurrences of an entity. The key field (or key attribute) uniquely identifies a single record (a single occurrence of the entity), distinguishing it from all other records.
44.4.2 File types
A master file holds permanent data that are managed or maintained over time. A transaction file holds data that describe current transactions. Often the data in a transaction file are used to update (or maintain) a master file. A temporary file holds intermediate results and exists for only a brief time; for example, in a bill processing application a temporary file might be used to hold a set of records between the sorting and processing steps. A backup file is a copy of a master or transaction file used to recover data if disaster strikes. A history file (or archive) holds already processed transactions or non-current master records.
44.4.3 Logical and physical I/O
Logically, a single record is read (input), its fields processed, and the results written (output). Then a new logical input/process/output cycle begins. The physical input/output process is a bit more complex, however.
44.4.3.1 Open and close
When a data file is first created, its name and physical secondary storage location are noted in a directory maintained by the operating system. Subsequently, in response to a user (or application program) command, the operating system opens the file by searching the directory by name, extracting the file’s secondary storage address, and (perhaps) transferring all or part of the file into memory. Once a file is opened, its records can be accessed by the application program. The active link to the file is broken by a close command.
44.4.3.2 Physical and logical records
Programmers and users visualize a logical record that holds the set of related fields needed to complete a single input/process/output cycle. A physical record, in contrast, is the unit of data that moves between the peripheral device and main memory. Note that, the logical record and the physical record can be different.
44.4.3.2.1 Blocking
For example, most disks are divided into concentric circles called tracks which, in turn, are divided into fixed-length sectors, and it is the contents of a sector that move between memory and the disk’s surface. Assume the sector size is 512 bytes and picture a series of 100-byte logical records. If a single 100-byte logical record is stored in each 512-byte sector, 412 bytes per sector are unused. However, if five of those 100-byte logical records are blocked to form a 500-byte physical record and that physical record is stored in a single sector, only 12 bytes per sector are wasted.
With one block per sector, data move between the disk’s surface and the computer’s memory one block at a time. Software (the database management system, the operating system, a device driver, or an access method) is used to assemble a program’s output logical records to form blocks, which are subsequently transferred to disk. The same software disassembles input blocks to get the logical records the program needs.
44.4.3.2.2 Spanned records
On a large file, even a few slack bytes per sector can add up to a great deal of wasted space. Additional space efficiency can be obtained by using spanned records. For example, imagine that five complete 100-byte records plus the first 12 bytes of the sixth record are stored in the first sector. The second sector contains the last 88 bytes of the sixth record, the next four 100-byte records, and the first 24 bytes of record 11, and so on. Spanned records are also used when the logical record length exceeds the sector size.
The problem with spanned records is that two or more physical I/O operations might be required to access a single logical record. In the example above, the first 12 bytes of record number 6 are stored in the first sector and the remaining 88 bytes are stored in the second sector, so the only way to get logical record number 6, is to read both sectors. Because only one sector can be read at a time, two physical I/O operations are needed, and each I/O operation takes time. Using spanned records sacrifices speed for storage efficiency.
44.4.3.3 Primitives
Each peripheral device is controlled by its own set of primitive commands. Consequently, if data are to be transferred between the peripheral device and main memory, the user’s logical I/O request must be translated into an appropriate set of primitive commands. For example, physically reading data from a disk is a two-step operation:
- 1. Move the access mechanism to the target track (seek).
- 2. Copy (read) the contents of the target sector into memory.
Logically, the user requests a record. Physically, that request is translated into two physical primitives: seek and read (or seek and write). The translation from a logical request to physical I/O commands is typically performed by the operating system, an access method, or a device driver and is transparent to the user.
44.4.4 File organizations
Logically, the data in a file are read and written one record at a time. Consequently, it must be possible to distinguish the individual records. The task of distinguishing the records is greatly simplified if they are stored using a consistent set of rules defined by a file organization. Sequential, random, and indexed sequential are three common file organizations.
44.4.4.1 Sequential files
On a physical sequential file, records are read and written in physical storage order. For example, transactions might be captured, stored on disk or magnetic tape, and subsequently processed in time order. Often, transactions or master file records are sorted on a key field and then processed in key sequence.
Sequential files are excellent for high activity applications in which a large percentage of the records are processed each time the application is launched. Traditional batch processing tasks (such as preparing monthly bills or a weekly payroll) are good examples, and generating a control breaks report (Chapter 47) is an inherently sequential activity. Sequential files should not be used if the system requires quick access to specific records, however.
44.4.4.2 Random or direct access files
The records on a direct access or random access file can be read or written in any order. Because a program can directly access a specific record, response time is very good. As a general rule, master files that support interactive or real-time applications should be organized randomly rather than sequentially.
Compared to a sequential file, a random access file has more overhead and thus needs more space to hold the same amount of data. Additionally, average processing time per transaction is higher because it takes longer to process a given number of direct access records than an equivalent number of sequential records. Direct access’ response time advantage applies to a specific transaction, not the average of all transactions.
44.4.4.2.1 Indexes
One way to achieve random access is to maintain an index listing of the record key and the associated physical disk address for each record in the file. Depending on the system, the index is maintained by the operating system, by an access method, by the database management system, or by other support software. Often, the index is read when the file is opened and held in storage until the file is closed. When a logical read or write command is issued, the index is searched by key for the record’s physical location.
An index can also be used to support logical sequential processing. Logical sequential processing implies key order, but not necessarily physical storage order. For example, the records in a database might be processed in index order even though they are not stored in physical sequential order.
44.4.4.2.2 Relative addressing
Relative addressing is an alternative to maintaining an index. One approach is to assign each record a number indicating its position relative to the beginning of the file; the first record is 0, the second record is 1, the third record is 2, and so on. The record’s relative record number might be used as a key, or the record’s logical key might be translated to a relative record number using an algorithm (see Section 44.4.4.2.3).
Instead of a relative record number, some systems view the data in a file as a string of bytes or characters and distinguish the records (and fields) by relative byte address. A relative byte (or relative character) address represents the byte’s location relative to the beginning of the file; in effect, relative byte addresses resemble main memory address. Relative byte addresses are easily mapped to relative record addresses; for example, if the record length is 100 bytes, records begin at relative byte 0, relative byte 100, and so on.
44.4.4.2.3 Hashing
Using a logical key as a relative record number means assigning one physical storage slot for every possible key. For example, using a social security number as a key implies 999,999,999 possible records (there is no social security number 000-00-0000). The result is a significant waste of storage space; for example, a university that enrolls 20,000 students would use only a fraction of the available storage slots to hold student records. Consequently, logical keys are converted into relative addresses using hashing (or scatter storage) algorithms.
Truncation algorithms select a portion of the logical key as the relative address. For example, using the last four digits of a social security number implies a need for only 9,999 record slots. Folding algorithms partition the logical key and add the parts. For example, adding the first three, the middle three, and the last three digits of a social security number generates a number between 0 and 2,997. A variation of the folding technique partitions the logical key, multiplies the parts, and uses the product (or a portion of the product) as a relative record number.
Perhaps the most common type of hashing algorithm is the division/remainder method. Start with a reasonable estimate of the expected number of records to be stored, and divide the logical key by that value. The remainder, a number between 0 and the divisor, becomes the relative record number. Note that the estimated number of records must be odd. Ideally, the divisor is a prime number slightly larger than the estimated number of records.
Using a hashing algorithm generates collisions (two or more logical keys that yield the same relative record number). When collisions occur, a secondary algorithm is used to determine where the new record is stored. Sometimes, the next available storage location is used. Sometimes, the new record is displaced by a constant number of record slots. Sometimes, a second hashing algorithm is used.
44.4.4.2.4 Chaining
An alternative to maintaining an index or using a hashing algorithm is to maintain a linked list (Chapter 43) of logical keys. This technique is called chaining.
44.4.4.3 Indexed sequential files
An indexed sequential file is a compromise between sequential and random access. Records are physically stored in key order, so they can be accessed sequentially. Additionally, an index is maintained relating logical keys to physical disk addresses, so the data can be accessed randomly. Examples include IBM’s ISAM and VSAM and Control Data’s SCOPE.
Problems occur when records are added to or deleted from an indexed sequential file. Rather than moving all the records, new records are stored in an overflow area and the space associated with deleted records is unused. Over time, slack space and overflow data can severely impact the efficiency of both sequential and direct access, so frequent file reorganization is necessary. Do not use indexed sequential files when the data are volatile (in other words, when records are frequently added or deleted).
44.5 Key terms
- Attribute —
- A property of an entity.
- Backup file —
- A file that holds a copy of a master or transaction file; backup files are used to recover data if disaster strikes.
- Block —
- Two or more logical records stored together as part of the same physical record.
- Chaining —
- Maintaining a linked list of the logical keys of the records in a file.
- Collision —
- An event that occurs when two or more logical keys input to a hashing algorithm yield the same relative address.
- Data dictionary —
- A collection of data about the data.
- Data element —
- An attribute that cannot be logically decomposed.
- Data structure —
- A set of related data elements.
- Database —
- A set of related files.
- Direct access (random access) —
- Reading records from or writing records to a file in any order.
- Directory —
- A list of the names and addresses of every file stored on a disk (or other secondary storage device).
- Entity —
- An object (a person, group, place, thing, or activity) about which data are stored.
- Field —
- A data element physically stored on some medium.
- File —
- A set of related records.
- File name —
- A unique logical identifier assigned to a file (usually by the user).
- Hashing —
- Using an algorithm to convert a logical key to a relative address.
- History file (archive) —
- A file that holds already processed transactions or no longer current master records.
- Index —
- A list of the record keys and the associated physical disk addresses for each record in a file
- Indexed sequential file —
- A file on which records are stored in key order and an index is maintained, thus allowing the records to be accessed sequentially or randomly.
- Key —
- The attribute or group of attributes that uniquely distinguishes one occurrence of an entity.
- Logical record —
- The set of related fields needed to complete a single input/process/output cycle.
- Master file —
- A file that holds permanent data that are accessed over a period of time.
- Occurrence —
- A single instance of an entity.
- Physical record —
- The unit of data that moves between the peripheral device and main memory.
- Primitive —
- A command that tells a peripheral device to perform one of its basic functions.
- Random access (direct access) —
- Reading records from or writing records to a file in any order.
- Record —
- The set of fields associated with an occurrence of an entity.
- Relative addressing —
- Assigning each record (or byte) in a file to an address that represents its position relative to the beginning of the file.
- Sequential access —
- Reading records from or writing records to a file in key and/or physical storage order.
- Spanned record —
- A logical record that extends over two or more physical records.
- Temporary file —
- A file that holds intermediate results and exists for only a brief time.
- Transaction file —
- A file that holds current data; a transaction file is often used to update (or maintain) a master file.
44.6 Software
Most programming languages support the file organizations described in this chapter.
44.7 References
- 1. Austing, R. F. and Cassel, L. N., File Organization and Access: From Data to Information, D. C. Heath, Lexington, MA, 1988.
- 2. Davis, W. S., Business Systems Analysis and Design, Wadsworth, Belmont, CA, 1994.
- 3. Davis, W. S., Computers and Information Systems: An Introduction, West Publishing, Minneapolis/St. Paul, MN, 1997.
- 4. Davis, W. S., Systems Analysis and Design: A Structured Approach, Addison-Wesley, Reading, MA, 1983.
Comments
Post a Comment