Data structures

Data structures

43.1 Purpose

A data structure is a way of organizing data that considers both the data items and their relationships to each other. Selecting the right data structure can improve memory utilization, improve processing efficiency, and reduce development costs.

43.2 Strengths, weaknesses, and limitations

This chapter introduces several basic principles of data structures. The strengths and weaknesses of individual data structures will be discussed in context.

43.3 Inputs and related ideas

The study of data structures is an essential component of most computer science curricula. Consequently, this chapter is aimed at non-computer science graduates who have not been exposed to formal data structures. The emphasis is on terminology and visualization rather than mathematics and formal logic. Several relevant key terms are introduced in Chapter 25. Files and databases are discussed in Chapters 44 and 45, respectively.

43.4 Concepts

A data structure is a way of organizing multiple data items (usually, data elements or records) that considers both the data items and their relationships to each other. Selecting the right data structure can improve memory utilization, improve processing efficiency, and reduce development costs.

43.4.1 Algorithms and recursion

The study of data structures is tightly linked to algorithms for efficiently creating a structure, inserting data into or deleting data from the structure, finding a specific data item, and traversing the structure. Sort and search algorithms are of particular interest.

A subroutine is considered recursive if it calls itself or if it initiates a circular chain of calls that returns eventually to itself. Many data structures are inherently recursive, so the concept of recursion appears in numerous algorithms and definitions.

43.4.2 Data elements, arrays, and pointers

A data element is the most basic unit of data that has logical meaning. A single data element might hold an integer number, a floating-point number, a character string, a Boolean value (true or false), or any other data type. In this chapter, the term data item will be used to designate a data element, a data composite, or a record.

An array is an elementary data structure that resembles a table. A one-dimensional array is sometimes called a vector; a two-dimensional array is sometimes called a matrix. Typically, one data element is stored in each array cell, and the cells are distinguished by subscripts. A transformation or mapping function is used to convert the subscripts to memory addresses. Arrays are supported by most programming languages and, thus, are used to implement several data structures.

A pointer is a link to a data item. Typically, each data item contains the data plus a pointer in the form of the address or the key of the target data item. An access vector is a list of pointers providing access to a set of data items.

43.4.3 Lists

The most basic data structure is a list (Figure 43.1). Each entry in the list is called a node, and each node holds a single data item. In an ordered list, the nodes are stored in data value or key order.

43.4.3.1 Simple lists

The simplest type of list is a set of data items stored in consecutive memory locations. Such lists serve as a basis for comparison for more sophisticated data structures. For example, searching a simple list for a specific data item means, on average, testing half the list’s entries. The efficiency of a search algorithm can be measured by comparing the number of tests it needs to find the target data item to the average number of tests required in a simple list.

43.4.3.2 Linked lists

In a simple list, inserting a node means shifting the existing nodes to make room. Similarly, deleting a node means shifting nodes to fill in the freed space. Such data movement is inefficient.

One solution is to create a linked list (Figure 43.2). In a linked list, each node contains data plus a pointer to the next node. Note that the data items need not be stored in adjacent memory locations because the pointers define the list’s logical order. To insert a node into a linked list, locate the prior node, change its pointer to the new node, and set the new node’s pointer to the next node (Figure 43.3). To delete a node from a linked list, change the appropriate pointer to “jump over” the deleted node (Figure 43.4).

Figure 43.1 An ordered list.

Figure 43.2 A linked list.

Figure 43.3 Inserting a node into a linked list

Figure 43.4 Deleting a node from a linked list.

In a singly linked list, each node points only to the next node. In a circular linked list, the last node points back to the first node. A doubly linked list contains both forward and backward pointers; in other words, each node points to the previous node and to the next node. In a multi-linked list, each node contains two or more pointers to different data items; for example, a personnel record might contain pointers to a department record and a skills inventory record.

Linked lists are used in a variety of applications. In many operating systems, the set of control blocks that hold the data necessary for dispatching and memory allocation is implemented as a linked list. In MS-DOS, the file allocation table is a linked list of clusters.

43.4.3.3 Stacks

A stack is a special type of linked list in which all insertions and deletions occur at the top. Access to the stack is controlled by a single pointer (Figure 43.5). Adding an entry to the top of the stack is called pushing the stack. Removing an entry from the top is called popping the stack. Because insertions and deletions occur only at the top, the last item added to the stack is the first item removed from the stack (last in, first out). Stacks are used for a variety of applications, including tracking procedure calls to ensure that returns are executed in the proper order, evaluating arithmetic expressions, and parsing.

Figure 43.5 Access to a stack is controlled by a single pointer.

43.4.3.4 Queues

A queue is a special type of linked list in which insertions occur at the rear and deletions occur at the front. Access to a queue is controlled by two pointers (Figure 43.6), and the first item added to a queue is the first item removed (first in, first out).

43.4.4 Trees

A list is one-dimensional. A tree is a two-dimensional data structure in which the nodes form a hierarchy (Figure 43.7). Because memory addresses are one-dimensional, implementing a tree requires a mapping function.

43.4.4.1 Terminology

A tree’s top (or base) node is called the root node. The root node is the parent to one or more level-2 child nodes, and each of those children is (potentially) a parent to children of its own. A parent of a parent (or the parent of an ancestor) is called an ancestor; a child of a child (or the child of a descend-ant) is called a descendant. Nodes that share the same level are called siblings. A branch is a link between a parent and a child. A leaf (or leaf node) is a node with no branches. A subtree is a subset of a tree that is itself a tree. A tree can be defined recursively because each node is the root node of a subtree.

Figure 43.6 Access to a queue is controlled by two pointers.

Figure 43.7 A tree.

43.4.4.2 Binary trees

In a binary tree, each node has two branches and holds a data item plus two pointers, left and right (Figure 43.8). Often, the data items are stored in default order, with the left child holding a value less than the parent and the right child holding a value greater than the parent. To search the tree for a particular data item, start at the top of the tree, go left if the search key is less than the node’s value, and go right if the search key is greater than the node’s value. Then repeat the process at the new node. Insertions are made at the appropriate leaf node. A node is deleted by modifying a pointer.

43.4.4.3 Multi-way trees

In a multi-way tree, each node holds n (two or more) values and can have (n + 1) branches. For example, if the root node holds two values, it can have three branches and thus point to three children (Figure 43.9), each of which can hold up to two values and have up to three branches. All values in the left node are less than the parent node’s minimum value. All values in the right node are greater than the parent node’s maximum value. All values in the center node lie between the parent node’s values.

43.4.4.4 B-trees, or balanced multi-way trees

A B-tree, or balanced multi-way tree is a multi-way tree that has all its leaf nodes at the same level. Balance is achieved by building the tree from the bottom up. New entries are placed in a leaf node if possible. If the target leaf node is full, the leaf is split and the median value is moved up to the parent. B-trees are commonly used to index records on a secondary storage device without overflow.

Figure 43.8 A binary tree.

Figure 43.9 A multi-way tree.

Figure 43.10 An undirected graph.

43.4.5 Graphs and networks

In a tree, the nodes form a hierarchy. Graphs and networks are less restrictive.

A graph is a set of nodes (or vertexes) linked by a set of edges. The edges on an undirected graph have no direction (Figure 43.10); in other words, it is possible to move between two nodes in any direction as long as they are connected by an edge.

On a directed graph, or digraph, each edge (or arc) has a direction (Figure 43.11). A given node’s indegree is the number of entering arcs, and its outdegree is the number of exiting arcs. A source is a node of indegree 0, and a sink is a node of outdegree 0.

A path is a sequence of edges that links a set of nodes; on a digraph, the path’s direction is significant. A cycle is a path that leads from a node back to the same node. In many operating systems, the set of requests for resources is modeled as a graph, and the graph is evaluated for cycles, which imply deadlock.

Figure 43.11 A directed graph.

On a network, or weighted graph, the edges have values. For example, a communication network might be modeled as a weighted graph with such values as Baud rate, distance, or cost associated with each edge.

The process of traversing a graph or network involves identifying subtrees or spanning trees within the graph, and then working with the subtrees. A minimum spanning tree is a subtree or spanning tree for which the sum of arc weights is minimal. For example, a project network (Chapter 21) is a network and the critical path is a minimum spanning tree.

43.5 Key terms

Access vector — A list of pointers providing access to a set of data items.

Algorithm — A rule for arriving at an answer in a finite number of steps.

Ancestor — A parent of a parent (or an ancestor).

Arc — An edge on a directed graph.

Array — An elementary data structure that resembles a table; typically, one data element is stored in each array cell and the cells are distinguished by subscripts.

Attribute — A property of an entity.

Binary tree — A special type of tree in which each node has two branches.

Branch — On a tree, a link between a parent and a child.

Child — An immediate lower-level node in a tree.

Circular linked list — A linked list in which the last node points back to the first node.

Cycle — On a graph, a path that leads from a node back to the same node.

Data element — An attribute that cannot be logically decomposed; the most basic unit of data that has logical meaning.

Data structure — A way of organizing data that considers both the data items and their relationships to each other.

Descendant — A child of a child (or a descendant).

Directed graph (digraph) — A graph on which each edge (or arc) has a direction.

Doubly linked list — A linked list in which each node contains both forward and backward pointers.

Edge — On a graph, a link between two nodes.

Entity — An object (a person, group, place, thing, or activity) about which data are stored.

Field — A data element physically stored on some medium.

File — A set of related records.

Graph — A set of nodes (or vertexes) linked by a set of edges.

Indegree — On a directed graph, the number of arcs entering a given node.

Key — The attribute or group of attributes that uniquely distinguishes one occurrence of an entity.

Leaf (leaf node) — On a tree, a node with no branches.

Linked list — A list in which each node contains data plus a pointer to the next node.

List — A series of nodes each of which holds a single data item; the most basic data structure.

Matrix — A two-dimensional array.

Minimum spanning tree — Within a graph, a subtree or spanning tree for which the sum of arc weights is minimal.

Multi-linked list — A linked list in which each node contains two or more pointers, thus providing access to two or more other nodes.

Multi-way tree — A tree in which each node holds n (two or more) values and can have (n + 1) branches.

Network (weighted graph) — A graph on which the edges have values.

Node — An entry in a list; often, a single data element or a single record.

Occurrence — A single instance of an entity.

Ordered list — A list in which the nodes are stored in data value or key order.

Outdegree — On a directed graph, the number of arcs exiting from a given node.

Parent — The immediate higher-level node in a tree.

Path — On a graph, a sequence of edges that links a set of nodes; on a digraph, the path’s direction is significant.

Pointer — A link to a data item; typically, a key value or an address.

Pop — To remove an entry from the top of a stack.

Push — To add an entry to the top of a stack.

Queue — A special type of linked list in which insertions occur at the rear and deletions occur at the front.

Record — The set of fields associated with an occurrence of an entity.

Recursion — A subroutine calling itself; a subroutine initiating a circular chain of calls that returns eventually to itself.

Root (root node) — A tree’s top (or base) node.

Siblings — Two or more nodes that share the same level.

Singly linked list — A linked list in which each node points only to the next node.

Sink — On a directed graph, a node of outdegree 0.

Source — On a directed graph, a node of indegree 0.

Stack — A special type of linked list in which all insertions and deletions occur at the top.

Subtree (spanning tree) — A tree within a graph; a subset of a tree that is itself a tree.

Tree — A two-dimensional, hierarchical data structure; a tree can be defined recursively because each node is the root node of a subtree.

Undirected graph — A graph on which the edges have no direction.

Vector — A one-dimensional array.

43.6 Software

Not applicable.

43.7 References

1. Bamford, C. and Curran, P., Data Structures, Files and Databases, MacMillan Education, London, 1987

2. Horowitz, E. and Sahni, S., Fundamentals of Data Structures in Pascal, Computer Science Press, Rockville, MD, 1984.

3. Knuth, D. E., The Art of Computer Programming, vol. 1. Fundamental Algorithms, 2nd. ed., Addison-Wesley, Reading, MA, 1973.

4. Singh, B. and Naps, T. L., Introduction to Data Structures, West Publishing, Minneapolis/St. Paul, MN, 1985.

Search This Blog

information systems course