Legacy Systems
Legacy Systems
Data Structures
Data structures constitute the physical and logical arrangement of data in files and databases. Under- standing how data are organized and accessed is central to understanding transaction processing. Data
structures have two fundamental components: organization and access method. Organization refers to the way records are physically arranged on the secondary storage device (for example, a disk). This may be either sequential or random. The records in sequential files are stored in contiguous locations that occupy a specified area of disk space. Records in random files are stored without regard for their physical relation- ship to other records of the same file. In fact, random files may have records distributed throughout a disk. The access method is the technique used to locate records and to navigate through the database or file.
No single structure is best for all processing tasks, and selecting a structure involves a trade-off between desirable features. The file operation requirements that influence the selection of the data struc- ture are listed in Table 2-2.
In the following section, we examine several data structures that are used in flat-file systems. Recall from Chapter 1 that the flat-file model describes an environment in which individual data files are not integrated with other files. End users in this environment own their data files rather than share them with other users. Data processing is thus performed by stand-alone applications rather than integrated systems. The flat-file approach is a single-view model that characterizes legacy systems in which data files are structured, formatted, and arranged to suit the specific needs of the owner or primary user of the system. Such structuring, however, may omit or corrupt data attributes that are essential to other users, thus pre- venting successful integration of systems across the organization.
SEQUENTIAL STRUCTURE
Figure 2-37 illustrates the sequential structure, which is typically called the sequential access method. Under this arrangement, for example, the record with key value 1875 is placed in the physical storage space immediately following the record with key value 1874. Thus, all records in the file lie in contiguous storage spaces in a specified sequence (ascending or descending) arranged by their primary key.
Sequential files are simple and easy to process. The application starts at the beginning of the file and processes each record in sequence. Of the file processing operations in Table 2-2, this approach is efficient for Operations 4 and 5, which are, respectively, reading an entire file and finding the next record in the file. Also, when a large portion of the file (perhaps 20 percent or more) is to be processed in one operation, the sequential structure is efficient for record updating (Operation 3 in Table 2-2). Sequential files are not efficient when the user is interested in locating only one or a few records on a file. A simple analogy can be made with an audiocassette. If you want to listen to only the tenth song on the tape, you must fast-forward to the point at which you think the song is located and then press the play button. Some searching is usually required to find the beginning of the song. However, if you are interested in hearing all the songs on the tape, you can simply play it from the beginning. An example of a sequential file application is payroll processing, whereby 100 percent of the employee records on the payroll file are processed each payroll period. Magnetic tape is a cheap, effective, and commonly used storage medium for sequential files. Sequential files may also be stored on magnetic disks.
The sequential access method does not permit accessing a record directly. Applications that require direct access operations need a different data structure. The techniques described next address this need.
DIRECT ACCESS STRUCTURES
Direct access structures store data at a unique location, known as an address, on a hard disk or floppy disk. The disk address is a numeric value that represents the cylinder, surface, and block location on the disk.4 The operating system uses this address to store and retrieve the data record. Using our music analogy again, the direct access approach is similar to the way songs are stored on a compact disc. If the listener chooses, he or she can select a specific song directly without searching through all the other songs.
An important part of the direct access approach is in determining the disk address, which is based on the record’s primary key. Bank account numbers, social security numbers, credit card numbers, and license plate numbers are examples of primary keys that are translated into addresses to store and retrieve data by different business applications. The following techniques are examples of data structures that have direct access capability.
INDEXED STRUCTURE
An indexed structure is so named because, in addition to the actual data file, there exists a separate index that is itself a file of record addresses. This index contains the numeric value of the physical disk storage location (cylinder, surface, and record block) for each record in the associated data file. The data file itself may be organized either sequentially or randomly. Figure 2-38 presents an example of an indexed random file.
Records in an indexed random file are dispersed throughout a disk without regard for their physical proximity to other related records. In fact, records belonging to the same file may reside on different disks. A record’s physical location is unimportant as long as the operating system software can find it when needed. Searching the index for the desired key value, reading the corresponding storage location (address), and then moving the disk read/write head to the address location accomplish this. When a new record is added to the file, the data management software selects a vacant disk location, stores the record, and adds the new address to the index.
The physical organization of the index itself may be either sequential (by key value) or random.
Random indexes are easier to maintain, in terms of adding records, because new key records are simply added to the end of the index without regard to their sequence. Indexes in sequential order are more diffi- cult to maintain because new record keys must be inserted between existing keys. One advantage of a se- quential index is that it can be searched rapidly. Because of its logical arrangement, algorithms can be
used to speed the search through the index to find a key value. This becomes particularly important for large data files with associated large indexes.
The principal advantage of indexed random files is in operations involving the processing of individual records (Operations 1, 2, 3, and 6 in Table 2-2). Another advantage is their efficient use of disk storage. Records may be placed wherever there is space without concern for maintaining contiguous storage locations. However, random files are not efficient structures for operations that involve processing a large portion of a file. A great deal of access time may be required to access an entire file of records that are randomly dispersed throughout the storage device. Sequential files are more efficient for this purpose.
VIRTUAL STORAGE ACCESS METHOD STRUCTURE
The virtual storage access method (VSAM) structure is used for very large files that require routine batch processing and a moderate degree of individual record processing. For instance, the customer file of a public utility company will be processed in batch mode for billing purposes and directly accessed in response to individual customer queries. Because of its sequential organization, the VSAM structure can be searched sequentially for efficient batch processing. Figure 2-39 illustrates how VSAM uses indexes to allow direct access processing.
The VSAM structure is used for files that often occupy several cylinders of contiguous storage on a disk. To find a specific record location, the VSAM file uses a number of indexes that describe in summarized form the contents of each cylinder. For example, in Figure 2-39, we are searching for a record with the key value 2546. The access method goes first to the overall file index, which contains only the highest key value for each cylinder in the file, and determines that Record 2546 is somewhere on Cylinder 99. A quick scan of the surface index for Cylinder 99 reveals that the record is on Surface 3 of Cylinder 99.
VSAM indexes do not provide an exact physical address for a single record. However, they identify the disk track on which the record in question resides. The last step is to search the identified track sequentially to find the record with key value 2546.
The VSAM structure is moderately effective for Operations 1 and 3 in Table 2-2. Because VSAM must read multiple indexes and search the track sequentially, the average access time for a single record is slower than that of the indexed sequential or indexed random structures. Direct access speed is sacrificed to achieve very efficient performance in Operations 4, 5, and 6.
The greatest disadvantage with the VSAM structure is that it does not perform record insertion operations (Operation 2) efficiently. Because the VSAM file is organized sequentially, inserting a new re- cord into the file requires the physical relocation of all the records located beyond the point of insertion. The indexes that describe this physical arrangement must, therefore, also be updated with each insertion. This is extremely time-consuming and disruptive. One method of dealing with this problem is to store new records in an overflow area that is physically separate from the other data records in the file. Figure 2-40 shows how this is done.
A VSAM file has three physical components: the indexes, the prime data storage area, and the over-flow area. Rather than inserting a new record directly into the prime area, the data management software places it in a randomly selected location in the overflow area. It then records the address of the location in a special field (called a pointer) in the prime area. Later, when searching for the record, the indexes direct the access method to the track location on which the record should reside. The pointer at that location reveals the record’s actual location in the overflow area. Thus, accessing a record may involve searching the indexes, searching the track in the prime data area, and finally searching the overflow area. This slows data access time for both direct access and batch processing.
Periodically, the VSAM file must be reorganized by integrating the overflow records into the prime area and then reconstructing the indexes. This involves time, cost, and disruption to operations. There- fore, when a file is highly volatile (records are added or deleted frequently), the maintenance burden asso- ciated with the VSAM approach tends to render it impractical. However, for large, stable files that need both direct access and batch processing, the VSAM structure is a popular option.
HASHING STRUCTURE
A hashing structure employs an algorithm that converts the primary key of a record directly into a storage address. Hashing eliminates the need for a separate index. By calculating the address, rather than reading it from an index, records can be retrieved more quickly. Figure 2-41 illustrates the hashing approach.
This example assumes an inventory file with 100,000 inventory items. The algorithm divides the inventory number (the primary key) into a prime number. Recall that a prime number is one that can be divided only by itself and 1 without leaving a residual value. Thus, the calculation will always produce a value that can be translated into a storage location. Hence, the residual 6.27215705 becomes Cylinder 272, Surface 15, and Record 705. The hashing structure uses a random file organization because the process of calculating residuals and converting them into storage locations produces widely dispersed record addresses.
The principal advantage of hashing is access speed. Calculating a record’s address is faster than searching for it through an index. This structure is suited to applications that require rapid access to individual records in performing Operations 1, 2, 3, and 6 in Table 2-2.
The hashing structure has two significant disadvantages. First, this technique does not use storage space efficiently. The storage location chosen for a record is a mathematical function of its primary key value. The algorithm will never select some disk locations because they do not correspond to legitimate key values. As much as one-third of the disk pack may be wasted.
The second disadvantage is the reverse of the first. Different record keys may generate the same (or similar) residual, which translates into the same address. This is called a collision because two records cannot be stored at the same location. One solution to this problem is to randomly select a location for the second record and place a pointer to it from the first (the calculated) location. The dark arrow in Figure 2-41 represents this technique.
The collision problem slows access to records. Locating a record displaced in this manner involves first calculating its theoretical address, searching that location, and then determining the actual address from the pointer contained in the record at that location. This has an additional implication for Operation 7 in Table 2-2—deleting a record from a file. If the first record is deleted from the file, the pointer to the second (collision) record will also be deleted and the address of the second record will be lost. This can be dealt with in two ways: (1) After deleting the first record, the collision record can be physically relocated to its calculated address, which is now vacant, or (2) the first record is marked deleted but is left in place to preserve the pointer to the collision record.
POINTER STRUCTURE
Figure 2-42 presents the pointer structure, which in this example is used to create a linked-list file. This approach stores in a field of one record the address (pointer) of a related record. The pointers provide
connections between the records. In this example, Record 124 points to the location of Record 125; Record 125 points to 126; and so on. As each record is processed, the computer program reads the pointer field to locate the next one. The last record in the list contains an EOF marker. The records in this type of file are spread over the entire disk without concern for their physical proximity with other related records. Pointers used in this way make efficient use of disk space and are efficient structures for applications that involve Operations 4, 5, and 6 in Table 2-2.
Types of Pointers
Figure 2-43 shows three types of pointers: physical address, relative address, and logical key pointers. A physical address pointer contains the actual disk storage location (cylinder, surface, and record number) that the disk controller needs. This physical address allows the system to access the record directly with- out obtaining further information. This method has the advantage of speed, because it does not need to be manipulated further to determine a record’s location. However, it also has two disadvantages: First, if the related record is moved from one disk location to another, the pointer must be changed. This is a problem when disks are periodically reorganized or copied. Second, the physical pointers bear no logical relation- ship to the records they identify. If a pointer is lost or destroyed and cannot be recovered, the record it references is also lost.
A relative address pointer contains the relative position of a record in the file. For example, the pointer could specify the 135th record in the file. This must be further manipulated to convert it to the actual physical address. The conversion software calculates this by using the physical address of the beginning of the file, the length of each record in the file, and the relative address of the record being sought.
A logical key pointer contains the primary key of the related record. This key value is then converted into the record’s physical address by a hashing algorithm.
Batch Processing Using Sequential Files
The most basic computer-processing configuration is batch mode using sequential file structures. Figure 2-44 illustrates this method.
Each program in a batch system is called a run. In this example, there is an edit run, an AR file update run, an inventory file update run, and two intermediate sort runs. The entire file or batch of records is processed through each run before it moves to the next run. When the last run finishes processing the batch, the session terminates.
A prominent feature of this system is the use of sequential files, which are simple to create and maintain. Although sequential files are still used by organizations for backup purposes, their presence in data processing is declining. This file structure is effective for managing large files, such as those used by federal and state agencies that have a high activity ratio. The activity ratio of a file is defined as the percentage of records on the file that are processed each time the file is accessed. For example, a federal pay- roll file has an activity ratio of 1:1. Each time the payroll file is accessed (payday), all the records on it are processed because everyone gets a paycheck.
The sequential files in the system shown in Figure 2-44 are represented in the flowchart as tapes, but remember that disks are also a common medium for sequential files. The operational description that follows applies equally to both types of media.
KEYSTROKE
The first step in this process is keystroke. In this example, clerks transcribe source documents (sales orders) to magnetic tape for processing later. The transaction file created in this step contains data about customer sales. These data are used to update the appropriate customer and inventory records. As an internal control measure, the keystroke clerks calculate control totals for the batch based on the total sales amount and total number of inventory items sold. The system uses this information to maintain the integrity of the process. After every processing run, control totals are recalculated and compared to the previously calculated value. Thus, if a record is incompletely processed, lost, or processed more than once, the batch totals calculated after the run will not equal the beginning batch totals. Once the system detects an out-of-balance condition, it sends error reports to users and data control personnel.
EDIT RUN
At a predetermined time each day, the data processing department executes this batch system. The edit program is the first to be run. This program identifies clerical errors in the batch and automatically removes these records from the transaction file. Error records go to a separate error file, where an authorized person corrects and resubmits them for processing with the next day’s batch. The edit program recalculates the batch total to reflect changes due to the removal of error records. The resulting clean transaction file then moves to the next program in the system.
SORT RUNS
Before updating a sequential master file, the transaction file must be sorted and placed in the same sequence as the master file. Figure 2-45 presents record structures for the sales order transaction file and two associated master files, AR and inventory.
Notice that the record structure for the sales order file contains a primary key (PK)—a unique identifier—and two secondary key (SK) fields, ACCOUNT NUMBER and INVENTORY NUMBER. ACCOUNT NUMBER is used to identify the customer account to be updated in the AR master file. IN- VENTORY NUMBER is the key for locating the inventory record to be updated in the inventory master file. To simplify the example, we assume that each sale is for a single item of inventory. Because the AR update run comes first in the sequence, the sales order file must first be sorted by ACCOUNT NUMBER and then sorted by INVENTORY NUMBER to update inventory.
UPDATE RUNS
Updating a master file record involves changing the value of one or more of its variable fields to reflect the effects of a transaction. The system in our example performs two separate update procedures. The AR update program recalculates customer balances by adding the value stored in the INVOICE AMOUNT field of a transaction record to the CURRENT BALANCE field value in the associated AR record. The
inventory update program reduces inventory levels by deducting the QUANTITY SOLD value of a trans- action record from the QUANTITY ON HAND field value of the associated inventory record.
SEQUENTIAL FILE BACKUP PROCEDURES
An important characteristic of the sequential file update process is that it produces a new physical master file. The new file contains all of the records from the original file, including those that were updated by transactions and those that were not updated. The original master continues to exist. This feature provides an automatic backup capability called the grandparent–parent–child approach. The parent is the original master file, and the child is the newly created (updated) file. With the next batch of transactions, the child becomes the parent, the original parent becomes the grandparent (the backup), and a new child is created. Should the current master file (the child) become lost, damaged, or corrupted by erroneous data, a new child can be created from the original parent and the corresponding transaction file.
Batch Processing Using Direct Access Files
Changing the file structures from sequential to direct access greatly simplifies the system. Figure 2-46 shows a system that is functionally equivalent to the one presented in Figure 2-44 but has been redesigned to use direct access files. Notice the use of disk symbols to represent direct access storage device media.
The shift to direct access files causes two noteworthy changes to this system. The first is the elimination of the sort programs. A disadvantage of sequential file updating is the need to sort the transaction file before each update run. Sorting consumes a good deal of computer time and is error-prone when very
large files are involved. Using direct access files eliminates the need to sort transactions into a predetermined sequence.
The second change is the elimination of automatic file backup in this system. Direct access update does not produce a new physical master as a by-product of the process. Instead, changes to field values are made to the original physical file. Providing file backup requires separate procedures.
DIRECT ACCESS FILE UPDATE AND BACKUP PROCEDURES
Each record in a direct access file is assigned a unique disk location or address that is determined by its key value. Because only a single valid location exists for each record, updating the record must occur in place of a destructive update approach similar to that for database tables. See Figure 2-29 and the associ- ated discussion in the chapter for details.
Direct access file backup issues are also similar to modern database systems. Because the destructive update approach leaves no backup copy of the original master file, only the current value is available to the user. If the current master becomes damaged or corrupted in some way, no backup version exists from which to reconstruct the file. To preserve adequate accounting records, special backup procedures need to be implemented like those illustrated in Figure 2-30.
General Logic for File Update
SEQUENTIAL FILE UPDATE
The logic of a sequential file update procedure is based on the following assumptions and conditions:
1. The transaction (T) file contains fewer records than the master (M) file. An organization may have thousands of customers listed in its customer (AR) file, but only a small percentage of these customers actually purchased goods during the period represented by the current batch of transactions.
2. More than one transaction record may correspond to a given master file record. For example, a department store sells its RCA 27-inch TV to several customers during a 1-day special offer. All of these transactions are separate records in the batch and must be processed against the same inventory master file record.
3. Both transaction file and master file must be in the same sequential order. For purposes of illustration, we will assume this to be ascending order.
4. The master file is presumed to be correct. Therefore, any sequencing irregularities are presumed to be errors in the transaction file and will cause the update process to terminate abnormally.
With these assumptions in mind, let’s walk through the update logic presented in Figure 2-47. This logic is divided into three sections: start-up, update loop, and end procedures.
Start-Up
The process begins by reading the first transaction (T) and the first master (M) record from their respective files in the computer’s memory. The T and M records in memory are designated as the current records.
Update Loop
The first step in the update loop is to compare the key fields of both records. One of three possible conditions will exist: T ¼ M, T > M, or T < M.
T ¼ M. When the key of T is equal to that of M, the transaction record matches the master record. Having found the correct master, the program updates the master from the transaction. The update program then reads another T record and compares the keys. If they are equal, the master is updated again. This continues until the key values change; recall that under Assumption 2, there may be many Ts for any M record.
T > M. The normal change in the key value relationship is for T to become greater than M. This is so because both T and M are sorted in ascending order (Assumption 3). The T > M relation signifies that processing on the current master record is complete. The updated master (currently stored in computer memory) is then written to a new master file—the child—and a new M record is read from the original (parent) file.
Because the transaction file represents a subset of the master (Assumption 1), there normally will be gaps between the key values of the transaction records. Figure 2-48 illustrates this with sample transactions and corresponding master file records. Notice the gap between Key 1 and Key 4 in the transaction file. When Key 4 is read into memory, the condition T > M exists until Master Record 4 is read. Before this, Master Records 2 and 3 are read into memory and then immediately written to the new master with- out modification.
T < M. The T < M key relationship signifies an error condition. The key of the current T record should only be smaller than that of the current M record if a T record is out of sequence (Assumption 4). For an
illustration, refer to Figure 2-48. Notice that the record with Key Value 10 in the transaction file is out of sequence. This goes undetected until the next T record (Key 7) is read. At this point, the computer’s memory contains M Record 10 (M10), and the previous M records (M6 through M9) have been read and then written, unchanged, to the new master. Reading Record T7 produces the condition T < M. It is now impossible to update Records M7 and M9 from their corresponding T records. The sequential file update process can only move forward through the files. Skipped records cannot be recovered and updated out of sequence. Because of this, the update process will be incomplete, and the data processing department must execute special error procedures to remedy the problem.
End Procedures
When the last T record is read and processed, the update loop procedure is complete. This is signaled by a special EOF record in the transaction file. At this point, one of two possible conditions will exist:
1. The M record in memory is the last record on the M file, or
2. There are still unprocessed records on the M file.
Assuming the second condition is true, the remaining records on the M file must be copied to the new master file. Some file structures indicate EOF with a record containing high key values (that is, the key field is filled with 9s) to force a permanent T > M condition, in which case all remaining M records will be read and copied to the new file. Other structures use a special EOF marker. The logic in this example assumes the latter approach. When the EOF condition is reached for the master and all the M records are copied, the update procedure is terminated.
DIRECT ACCESS FILE UPDATE
Figure 2-49 presents the general logic for updating direct access files. The two sample files of data pro- vided will be used to illustrate this process. Notice that this logic is simpler than that used for sequential files. There are three reasons for this. First, because record sequencing is irrelevant, the logic does not need to consider the relationship between the T and M key values. Consequently, the update program does not need to deal explicitly with the problem of multiple T records for a given M record. Second, unprocessed master records are not copied to a new master file. Third, complex procedures for searching the master file and retrieving the desired M record are performed by the computer’s operating system rather than by the update program.
The transaction file in Figure 2-49 is read from top to bottom. Each record is processed as it is encountered and without regard for its key sequence. First, T9 is read into memory. The operating system then searches for and retrieves the corresponding record (M9) from the master file. The current record is updated in memory and immediately written back to its original location on the master file. Records T2, T5, and T3 are all processed in the same manner. Finally, the second T9 transaction is processed just as the first, resulting in M9 being updated twice.
Comments
Post a Comment