A Fast Regular Expression Indexing Engine

even obvious, methods to speed up a regex matching engine. ? using a carefully designed index … the following example to see how a regex engine can help … – by J Cho – Cited by 24 – Related articles – All 15 versions

More PDF Content

A Fast Regular Expression Indexing Engine
Page 1
A Fast Regular Expression Indexing Engine Junghoo Cho University of California, Los Angeles cho@cs.ucla.edu Sridhar Rajagopalan IBM Almaden Research sridhar@almaden.ibm.com Abstract In this paper, we describe the design, architecture, and the lessons learned from the implementation of a fast reg- ular expression indexing engine FREE. FREE uses a pre- built index to identify the text data units which may con- tain a matching string and only examines these further. In this way, FREE shows orders of magnitude performance im- provement in certain cases over standard regular expression matching systems, such as lex, awk and grep [18, 4]. 1 Introduction The amount of data in text databases and the world wide web continues to grow at a prodigious rate. 1 This unprece- dented increase in the size of modern datasets has made otherwise simple and well understood tasks challenging. A good instance of these challenges is the matching of reg- ular patterns (i.e. find all substrings which satisfy a given regular expression) in a corpus. For reasonably large text databases, say, 100G bytes, matching a regular expression takes several days. This paper addresses scale and perfor- mance issues when we match regular expressions (a regex) against a large corpus. We propose and consider one of the simplest, perhaps even obvious, methods to speed up a regex matching engine – using a carefully designed index to restrict the amount of data which needs to be examined. Even in this elementary context, there are a fairly large set of interesting choices and design decisions to make. We first illustrate how one can use an index to speed up a regex matching. Example 1.1 The following regex describes all the URLs which point to MP3 files on the web. 2 From the regex, it is apparent that any web page contain- ing such a URL should contain the substring .mp3 as well. 1 An Inktomi and NEC joint study [12] announced in early 2000 that the web had exceeded 1,000,000,000 pages. 2 In Table 1 we summarize the basic syntax of a regular expression and our shorthand. Thus, by looking up the string .mp3 in an index, and only examining pages which contain the string, we could signif- icantly reduce both the processing and the I/O workload. ? The example leaves behind several unanswered ques- tions. For example, should the system also look up href= from the index? For what strings should the sys- tem create index entries? Should it create an entry for

Download A Fast Regular Expression Indexing Engine pdf from oak.cs.ucla.edu, 12 pages, 113.48KB.