**HKUST-TCSC-2000-01**-
Jürgen Albert, Dora Giammarresi and Derick Wood.

Normal Form Algorithms for Extended Context-Free Grammars.We investigate the complexity of a variety of normal-form transformations for extended context-free grammars, where by extended we mean that the set of right-hand sides for each nonterminal in such a grammar is a regular set. The study is motivated by the implementation project GraMa which will provide a C++ toolkit for the symbolic manipulation of context-free objects just as Grail does for regular objects.

Our results generalize known complexity bounds for context-free grammars but do so in nontrivial ways. Specifically, we introduce a new representation scheme for extended context-free grammars (the symbol-threaded expression forest), a new normal form for these grammars (dot normal form) and new regular expression algorithms.

This report has been submitted for journal publication; a preliminary version appeared at WIA '98.

**HKUST-TCSC-2000-02**-
Anne Brüggemann-Klein and Derick Wood.

Caterpillars, Context, Tree Automata and Tree Pattern Matching.We present a novel, yet simple, technique for the specification of context in structured documents that we call caterpillar expressions. Although we are primarily applying this technique in the specification of context-dependent style sheets for HTML, SGML and XML documents, it can also be used for query specification for structured documents, as we shall demonstrate, and for the specification of computer program transformations.

From a conceptual point of view, structured documents are trees, and one of the oldest and best-established techniques to process trees and, hence, structured documents are tree automata. We present a number of theoretical results that allow us to compare the expressive power of caterpillar expressions and caterpillar automata, their companions, to the expressive power of tree automata. In particular, we demonstrate that each caterpillar expression describes a regular tree language that is, hence, recognizable by a tree automaton.

Finally, we employ caterpillar expressions for tree pattern matching. We demonstrate that caterpillar automata are able to solve tree-pattern-matching problems for some, but not all, types of tree inclusion that Kilpeläinen investigated in his PhD thesis. In simulating tree pattern matching with caterpillar automata, we reprove some of Kilpeläinen's results in a uniform framework.

This report will appear in the post-proceedings of the DLT '99 conference published by World Scientific Publishing Co., Singapore.

**HKUST-TCSC-2000-03**-
Sunil Arya and Ho-Yam Addy Fu.

Expected-Case Complexity of Approximate Nearest Neighbor Searching.Most research in algorithms for geometric query problems has focused on their worst-case performance. But when information on the query distribution is available, the alternative paradigm of designing and analyzing algorithms from the perspective of expected-case performance appears more attractive. We study the approximate nearest neighbor problem from this point of view.

As a first step in this direction, we assume that the query points are chosen uniformly from a hypercube that encloses all the data points; however, we make no assumption on the distribution of data points. We investigate three simple variants of partition trees: sliding-midpoint, balance-split, and hybrid-split trees. We show that with these simple tree-based data structures, it is possible to achieve linear space and logarithmic or polylogarithmic query time in the expected case. In contrast, the data structures known to achieve linear space and logarithmic query time in the worst case are complex, and algorithms on them run more slowly in practice. Moreover, for the sliding-midpoint tree, we prove that it achieves optimal expected query time under reasonable assumptions.

**HKUST-TCSC-2000-04**-
Vicky Choi and Mordecai Golin

Lopsided Trees: Analyses, Algorithms, and Applications.Lopsided trees are rooted, ordered, trees in which the length of an edge from a node to its

*i*th child depends upon the value of*i.*These trees model a variety of problems and have therefore been extensively studied. In this paper we combine analytic and combinatorial techniques to address three open problems on such trees:- Given
*n*, efficiently construct a lopsided tree with*n*leaves that has minimal external-path-length. - Express the cost of the minimal external-path-length tree
as a function of
*n.* - Calculate exactly how many nodes of depth
*<= x*exist in the infinite lopsided tree.

Lopsided trees model

*Varn codes*, prefix free codes in which the letters of the encoding alphabet can have different lengths. The solutions to the first and second problems above solve corresponding open problems on Varn codes. The solution to the third problem can be used to model the performance of broadcasting algorithms in the postal model of communication. Finding these solutions requires generalizing the definition of Fibonacci numbers and then using Mellin-transform techniques.Keywords: Varn Codes, Fibonacci Recurrences, Mellin Transforms, Postal Model.

- Given
**HKUST-TCSC-2000-05**-
Phil Bradford, Mordecai J. Golin, Lawrence L. Larmore and Wojciech Rytter.

Prefix-Free Codes for Unequal Letter Costs: Dynamic Programming with the Monge Property.In this paper we discuss the problem of finding optimal prefix-free codes for unequal letter costs, a variation of the classical Huffman coding problem. Our problem consists of finding a minimal cost prefix-free code in which the encoding alphabet consists of unequal cost (length) letters, with lengths

*alpha*and*beta*. The most efficient algorithm known previously requires*O(n*time to construct such a minimal-cost set of^{2+max(alpha,beta)})*n*codewords, provided*alpha*and*beta*are integers. In this paper we provide an*O(n*time algorithm. Our improvement comes from the use of a more sophisticated modeling of the problem, combined with the observation that the problem possesses a "Monge property" and that the SMAWK algorithm on monotone matrices can therefore be applied. Keywords: Dynamic Programming, Huffman Codes, Lopsided Trees, Monge Matrix, Monotone Matrix, Prefix-Free Codes.^{max(alpha,beta)}) **HKUST-TCSC-2000-06**-
Zhongping Qin, Alexander Wolff, Yinfeng Xu, and Binhai Zhu

New Algorithms for Two-Label Point LabelingGiven a label shape

*L*and a set of*n*points in the plane, the two-label point-labeling problem consists of placing*2n*non-intersecting translated copies of*L*of maximum size such that each point touches two unique copies--its labels.In this paper we give new and simple approximation algorithms for

*L*an axis-parallel square or a circle. For squares we improve the best previously known approximation factor from*1/3*to*1/2*. For circles the improvement from*1/2*to roughly 0.5125 is less significant, but the fact that*1/2*is not best possible is interesting in its own right. For the decision version of the latter problem we have an NP-hardness proof that also shows that it is NP-hard to approximate the label size beyond a factor of*approx 0.7321*. We keep the*O(n logn)*time and*O(n)*space bounds of the previous algorithms.The algorithm we propose for the two-square labeling problem actually solves a different problem. It labels points with rectangles of height-width ratio 2 of maximum size in one of four positions. It is the first point-labeling algorithm that combines the following properties. It allows more than two label positions

*and*solves the size maximization problem optimally in polynomial time. This algorithm also improves the approximation factor of the only known algorithm for Knuth and Raghunathan's Metafont labeling problem from*1/3*to*1/2*. **HKUST-TCSC-2000-07**-
Anne Brüggemann-Klein and Derick Wood.

Regularly Extended Two-Way Nondeterministic Tree Automata.We establish that regularly extended two-way nondeterministic tree automata with unranked alphabets have the same expressive power as regularly extended nondeterministic tree automata with unranked alphabets. We obtain this result by establishing regularly extended versions of a congruence on trees and of a congruence on, so called, views. Our motivation for the study of these tree models is the Extensible Markup Language (XML), a metalanguage for defining document grammars. Such grammars have regular sets of right-hand sides for their productions and tree automata provide an alternative and useful modeling tool for them. In particular, we believe that they provide a useful computational model for what we call caterpillar expressions.

This is a preliminary version of the paper presented at CIAA 2000.

**HKUST-TCSC-2000-08**-
Anne Brüggemann-Klein and Derick Wood.

Caterpillars: A Context Specification Technique.We present a novel, yet simple, technique for the specification of context in structured documents that we call caterpillar expressions. Although we are primarily applying this technique in the specification of context-dependent style sheets for HTML, SGML and XML documents, it can also be used for query specification for structured documents, as we shall demonstrate, and for the specification of computer program transformations.

From a conceptual point of view, structured documents are trees, and one of the oldest and best-established techniques to process trees and, hence, structured documents are tree automata. We present a number of theoretical results that allow us to compare the expressive power of caterpillar expressions and caterpillar automata, their companions, to the expressive power of tree automata. In particular, we demonstrate that each caterpillar expression describes a regular tree language that is, hence, recognizable by a tree automaton.

Finally, we employ caterpillar expressions for tree pattern matching. We demonstrate that caterpillar automata are able to solve tree-pattern-matching problems for some, but not all, types of tree inclusion that Kilpeläinen investigated in his PhD thesis. In simulating tree pattern matching with caterpillar automata, we reprove some of Kilpeläinen's results in a uniform framework.

This is a preliminary version of the paper that will appear in

*Markup Languages.* **HKUST-TCSC-2000-09**-
Anne Brüggemann-Klein and Derick Wood.

Document Engineering with Extensible Abstract Document Structures.Document engineering is the systematic development of document presentations, representations and document tools. We particularly address document engineering for the Extensible Markup Language (XML); both its Document Type Definitions (DTDs) and its DTD-conforming documents.

Our thesis is that document engineering should be firmly based on explicit, formal document models. This precept is well accepted and practiced in, for example, the database community but it is, apparently, less accepted in the XML community. We want to change this position.

As a first step in the demonstration of the value of explicit, formal document models we derive and discuss an explicit, formal model for XML DTDs and their DTD-conforming documents. This new model captures what can be considered to be the essence of an XML DTD and its DTD-conforming documents.

This paper is a preliminary version of our PODDP 2000 presentation.

**HKUST-TCSC-2000-10**-
Anne Brüggemann-Klein and Derick Wood.

The Regularity of Two-Way Nondeterministic Tree Automata Languages.We establish that regularly extended two-way nondeterministic tree automata with unranked alphabets have the same expressive power as regularly extended nondeterministic tree automata with unranked alphabets. We obtain this result by establishing regularly extended versions of a congruence on trees and of a congruence on, so called, views. Our motivation for the study of these tree models is the Extensible Markup Language (XML), a metalanguage for defining document grammars. Such grammars have regular sets of right-hand sides for their productions and tree automata provide an alternative and useful modeling tool for them. In particular, we believe that they provide a useful computational model for what we call caterpillar expressions.

This full version has been selected for possible publication in the

*International Journal of Foundations of Computer Science;*an extended abstract was presented at CIAA 2000.

Web page maintained by Siu-Wing Cheng