I came to Berkeley in Fall 2012 as an undergraduate in Industrial Engineering and Operations Research (IEOR). Starting my fifth semester I began to take courses in core CS, upper division mathematics, and graduate-level operations research. I spent this past summer developing constant-factor approximation algorithms for scheduling MapReduce jobs (at the University of Maryland, College Park), and this December I applied to PhD programs across the country (majority CS with some OR). I’ve been offered admission to Cornell’s Operations Research PhD program, and I’ve been invited to the visit day for Caltech’s Computing + Mathematical Sciences PhD program.
Given that I’ve yet to start graduate school, I haven’t determined where my research will end up leading. Here’s a non-exhaustive list of interests:
- Algorithms for graphs.
- Computational biology.
- Algorithms for unconstrained optimization.
- Algorithms for constrained optimization (particularly interior point methods).
- Constant-factor approximation algorithms.
- Theoretical aspects of distributed and parallel computation.
- Control (as in “control theory”) of high performance computing systems.
My main reason for taking CS 267 is to make progress on the last two points on that list. I think this course is a good starting point precisely because it is application focused; understanding what high performance computing actually looks like is critical in selecting research projects in the HPC domain which have both theoretical significance and practical interest.
The rest of this post will provide a high level overview of an exciting paper that appeared in a recent supercomputing conference: Parallel Distributed Memory Construction of Suffix and Longest Common Prefix Arrays (SC ’15).
An Application Area : Building Data Structures for Computational Biology
Why does Computational Biology need special data structures? What is the data structure considered in the paper?
Many problems in computational biology involve searching a massive reference text for exact or inexact matches of a large number of shorter query strings. One of the most common reference texts is the human genome – represented as a sequence of approximately 10^9 characters. The set of query strings varies greatly with the application in mind, but will often contain more than 10^6 “short reads” of approximately 100 characters each.
Under these circumstances, we cannot afford to search the entire reference text for each query. Instead, we preprocess the reference text to support querying that is linear not in the size of the reference text, but in the size of the query string. That is, given a single short read of 100 characters and a reference text of 10^9 characters, we can correctly find a match (or determine that one does not exist) in approximately 100 steps, rather than 1,000,000,000 steps.
The most fundamental data structure for this type of work is the Suffix Tree. The suffix tree of a string X is characterized by the fact that every suffix of X is spelled out by a unique path from the root to a leaf corresponding to that suffix. There are other requirements for suffix trees, but “every suffix gets a unique root-to-leaf path” is the most conceptually important. To get an idea of what the suffix tree of a given word looks like, check out this app on VisuAlgo.*
Suffix trees are great, but link-based data structures take up more memory and can be slightly slower than appropriately designed array-based data structures. In view of this, Manber and Meyers (1990) proposed an array-based alternative to suffix trees that they called Suffix Arrays.
The suffix array “SA” of a string X satisfies the property that SA[i] is the location in X of the i-th smallest suffix of X (where a stuff X[a:end] is smaller than X[b:end] if X[a:end] comes before X[b:end] in an alphabetically sorted list). So SA identifies the suffix of X that (in the sorted-list sense!) is smaller than every other suffix in X. Similarly, SA[len(X)-1] identifies the suffix of X that is larger than every other suffix in X. More generally, if you wanted the i-th smallest suffix of X in hand (i.e. not just an index specifying a location), you would write X[SA[i]:end]. You can also play with Suffix Arrays on VisuAlgo.net.
On its own, a suffix array does not replace a suffix tree (that is, there exist distinct strings over the same alphabet with identical suffix arrays, while the same is not true for suffix trees). To remedy this, Manber and Meyers proposed an additional data structure that can be combined with suffix arrays to capture all the information contained in suffix trees. The supplementary data structure is called the Longest Common Prefix (LCP) Array, and is defined as follows (for a reference text “X”)
LCP[i] = length_of_longest_common_prefix(X[SA[i-1]:end], X[SA[i]:end].
That is, LCP[i] is the number of uninterrupted matches among leading characters of the (i-1)-st smallest and i-th smallest suffixes. (Again, “smallest” here means “earliest in the order of an alphabetically sorted list”.) LCP does not have an interpretation, and is usually initialized to 0.
That’s a lot of background, but we’re now ready to dive into the paper itself! As is clear from the title, the authors developed a parallel algorithm for constructing suffix arrays and LCP arrays.
Introductory stuff you can gloss over unless you’re really interested or find yourself confused.
* You can use any set of letters you want in the VisuAlgo app, but your string must end with a unique delimiting character (they use the dollar sign, $). The delimiting character is necessary because some strings don’t have suffix trees. One simple example is the word “banana”. Banana violates one of the suffix tree requirements (I won’t say which) since the suffix “na” is a prefix of the suffix “nana”.
Pretty much all algorithms for suffix trees, suffix arrays, and LCP arrays get around this by appending a delimiting character (usually “$”) to the reference text. The delimiting character must satisfy the property that it is the lexicographically smallest character among all possible characters in the string. The dollar sign is used as the delimiting character precisely because [$ < anyLetterOrNumber] will evaluate to True for the default ASCII-ID based character ordering.
Highlights from Parallel Distributed Memory Construction of Suffix and Longest Common Prefix Arrays (Flick and Aluru)
Where does this work fit in relation to existing literature?
*We use abbreviate “parallel distributed memory” by PDM.
General Suffix Array and LCP Array Construction
The original Suffix & LCP Array paper (by Manber and Meyers, 1990) showed how to construct these arrays in O(n log(n)) time on a sequential processor. Subsequent work by Ko and Aluru (2003) developed an O(n) scheme for sequential processors. Independent work by Karkkainen and Sanders (also 2003) developed a very different algorithm that attained the same O(n) runtime.
PDM suffix array construction was first studied by Futamura, Aluru, and Kurtz in 2001. Their approach suffered from two drawbacks (1) the entire reference text needed to be kept in memory at each processor, and (2) algorithm runtime was O(n^2) in the worst case. In 2014, Abdelhadi et al. extended this algorithm to the MPI model.
The problem has also been examined for shared-memory machines (SMM’s). Two research groups have examined SMM suffix array construction with traditional CPU’s, but both saw speedup factors < 2. In 2014, Shun published a paper (Fast Parallel Computation of Longest Common Prefixes) detailing a number of methods to construct the LCP array given the suffix array. Shun’s algorithms demonstrated speedup factors between 20 and 50. The suffix array construction problem under SMM’s has also been studied on GPU’s. Deo and Keely saw speedups up to 35x, and Osipov saw speedups between 6x and 18x.
Side-by-Side Comparison to Closest Related Work
In 2007, Kulla and Sanders developed a PDM suffix array construction algorithm that they described as “completely independent of the model of computation.” The implementation they analyzed in detail was based on the MPI model. The interested reader is encouraged to review the original paper, but suffice it to say (1) the runtime dependence on n was no worse than O(n log(n)), and (2) the algorithm included an additive complexity bottleneck – on the order of p^2 * log(p) * log(n) where p is the number of processors used in parallel.
The primary paper considered in this post (i.e. Flick and Aluru, 2015) is similar to the Kulla-Sanders paper, but differs in the following ways (1) Flick and Aluru construct the LCP array concurrently with the suffix array, where Kulla and Sanders do not construct the LCP array, and (2) Flick and Aluru remove the additive bottleneck term : p^2 log(p) * log(n), at the expense of introducing a multiplicative factor of log(n). The second of these distinctions enables the algorithm of Flick and Aluru to efficiently scale to thousands of processors.
What architecture is the algorithm designed for? How well does the algorithm scale?
The algorithm was designed for (1) high-core-count, (2) distributed memory machines. Their experimental environment consisted of 100 identical nodes, each with the following specifications:
- 2 Intel E5 2650 processors (2GHz, 8 cores/CPU)
- 128GB memory
- Node-to-node connections with 40GBit Infiniband.
So in total, the cluster contained 1600 cores.
Asymptotically, the only runtime dependence on processors comes from the time to sort n items in parallel on p processors.
How well does the algorithm empirically perform?
The speedup factors substantially exceed those in all known prior work. The tables and figures below summarize the performance of the algorithm in a variety of ways.
Table 1 shows how alternative suffix + LCP array construction methods compare to the authors’ proposed method. divsusort is the fastest-known open source implementation of a sequential suffix + LCP array algorithm. mkESA is one of the shared-memory algorithms mentioned earlier (with speedup < 2). cloudSACA is a recent implementation of the 2001 O(n^2) parallel algorithm discussed earlier.
Table 2 shows how runtime of the algorithm scales with processors in absolute time. The most striking fact is the ability to create the suffix array of the entire human genome (3 billion base-pairs) in under 5 seconds (though this is without the LCP array).
The plot below demonstrates just how well the proposed algorithm scales. As stated before, divsusort is the fastest-known open source implementation of a sequential suffix + LCP array algorithm (it represents the dotted horizontal line near the bottom of the graph. The plot demonstrates that when using 64 of the 100 available nodes, the proposed algorithm outperforms divusort by a factor greater than 100 (the paper states that the precise factor was just over 110). When using 64 nodes (1024 cores), the local inputs were only 3MB in size.
Do the authors address any open problems that remain for suffix array or LCP array construction?
The authors suggest exploring how empirical performance of their algorithm could be enhanced with hybrid parallelism. Local operations (particularly the sorting that occurs at each node) may be possible to speed up with shared-memory parallel implementations (via OpenMP) or GPU’s (via CUDA or OpenCL).