Boorex Engine

From ESM Wiki
Jump to: navigation, search

Contents


Introduction

Boorex engine provides searching for patterns in text data using regular expressions. Engine works as following:

  • administrator defines a set of regular expressions and assign an ID to each expression or group of expressions;
  • administrator uses expression ID's and pattern search operator in preconditions and result conditions;
  • engine loads and compile expressions;
  • during execution engine searches for patterns, calculates corresponding conditions and calculates engine results accordingly;
  • engine results can be used further in a pipeline as needed;

Boorex engine is defined with <boorex> node. Example of a boorex engine:

<mppd>
    <engines>

        <boorex id="select_body_part">
            <i id="by_file_name">\.gif$</i>
            <i id="by_mime_type">^image/gif$</i>

            <result id="is_match">
                <case>
                    <condition>
                            $body_part.file_name ~= by_file_name
                        $OR $body_part.mime_type ~= by_mime_type
                    </condition>
                    <result>yes</result>
                </case>
                <result>no</result>
            </result>
        </boorex>

        <mysql>
            <query id="save_body_part">
                <precondition>$engines.select_body_part.is_match $EQ yes</precondition>
                ...
            </query>
        </mysql>

    </engines>
</mppd>

In this example boorex engine select_body_part defines two regular expressions and calculates is_match result depending on result of searching for these expressions in body part file name or body part mime type. Engine's result is_match is used in precondition of another engine mysql.save_body_part.

Boorex engine doesn't have action as a separate logical part. Instead engine functionality is provided with pattern search operator that is used in preconditions and result conditions.

Options

The following options are available for boorex engine:

OptionProperties
<encoding> Description: Internal encoding to be used by this engine. Possible value are:
  • ASCII -- engine will work only with ASCII characters;
  • Unicode -- engine will work with international charactes however this will require about 4x more RAM resources; expressions must be specified using UTF-8 and matching should be made against UTF-8 text;
Type: Constant.
Default: ACSII
Attributes: none
<defaults> Description: Defines default values for expression attributes and the way defaults are applied. Possible values are:
  • global -- means that defaults are defined once and remains the same during loading of all expressions; this kind of defaults should be fine for most cases;
  • positional -- means that attributes explicitly specified for an expression overrides defaults for subsequent expressions;

Also empty value for this option is possible which means default value.

Type: Constant.
Default: global
Attributes:
  • id -- defines default expression ID; default value is match;
  • options -- defines default expression options; default value is perl,match_perl,match_not_null,match_single_line,match_not_dot_newline,match_any;
<inline>
<i>
Description: Value of the option specifies single regular expression. Multiple options are allowed. Syntax or a regular expression and the way it is searched in text is defined with expression options attribute. By default Perl regular expressions are expected.
Type: Constant. Regular expression.
Default: No default.
Attributes:
  • id -- defines expression ID; default value is defined with <defaults> option; for details see Expression ID;
  • options -- defines expression options; default value is defined with <defaults> option; for details see Expression options;
<file>
<f>
Description: Defines path to a file with regular expressions. For details see Expression file.
Type: Constant. File path.
Default: No default.
Attributes:
  • id -- defines default expression ID for expressions from the file; default value is defined with <defaults> option;
  • options -- defines default expression options for expressions from the file; default value is defined with <defaults> option;

Expression ID

Each pattern expression has an ID associated with it. If it is not specified explicitly it is set by default to current default expression ID. Special value "*" (asterisk) for ID can be used which instructs engine to use current default but this only useful in expression file. Multiple expressions are allowed to have same ID. In this case they are searched in turns as they appear in XML until any is matched. This is equivalent to one big expression using "|" (alternation) operator for combining all expressions with the same ID. Expression ID is used in right hand of pattern search operator to refer to a particular expression.

Expression Options

Each pattern expression has options associated with it. Options affect the way a pattern is interpreted and searched for. Options are a list of comma-separated strings. Each string denotes single option. For example: "perl,match_perl,match_not_null,match_single_line”. Each option string has a short synonym. For example “icase” has short synonym “i”. Some options has effect only if other option (or options) is set. This is called preconditions for an option.

There are a few special directives that specifies how options for an expression are merged with default options:

  • Single "*" (asterisk) char as an option instructs implementation to add all default options to expression options. This can be useful if one wants to use default options but add or remove a few other options and doesn't want to repeat long list of default options. For example: “*,icase” mean “use default and case insensitive”.
  • Char "!" (exclamation mark) before any option instructs implementation to remove this option from expression options. For example “*,!match_single_line” mean “use default but not match single line”.

Most of syntax and matching flags options available for Boost Regex library are mirrored with Expression Options. The following table contains all supported options with explanation that was taken from here:

OptionProperties
literal Description: Treat the string as a literal (no special characters).
Synonym: l
Preconditions:
icase Description: Specifies that matching of regular expressions against a character container sequence shall be performed without regard to case.
Synonym: i
Preconditions:
collate Description: Specifies that character ranges of the form [a-b] should be locale sensitive.
Synonym: c
Preconditions: all except literal
optimize Description: Specifies that the regular expression engine should pay more attention to the speed with which regular expressions are matched, and less to the speed with which regular expression objects are constructed. Otherwise it has no detectable effect on the program output. This currently has no effect for Boost.Regex.
Synonym: o
Preconditions:
bk_plus_qm Description: When set then \? acts as a zero-or-one repeat operator, and \+ acts as a one-or-more repeat operator.
Synonym: bq
Preconditions: basic or sed or grep or emacs
bk_vbar Description: When set then \| acts as the alternation operator.
Synonym: bv
Preconditions: basic or sed or grep or emacs
no_intervals Description: When set then bounded repeats such as a{2,3} are not permitted.
Synonym: v
Preconditions: basic or sed or grep or emacs
no_char_classes Description: When set then character classes such as [[:alnum:]] are not allowed.
Synonym: e
Preconditions: basic or sed or grep or emacs
no_escape_in_lists Description: When set this makes the escape character ordinary inside lists, so that [\b] would match either '\' or 'b'. This bit is on by default for POSIX-Extended regular expressions, but can be unset to force escapes to be recognised inside lists.
Synonym: p
Preconditions: extended or egrep or awk or basic or sed or grep or emacs
no_mod_m Description: Normally Boost.Regex behaves as if the Perl m-modifier is on: so the assertions ^ and $ match after and before embedded newlines respectively, setting this flags is equivalent to prefixing the expression with (?-m).
Synonym: m
Preconditions: ECMAScript or perl or JavaScript or JScript or normal
mod_x Description: Turns on the perl x-modifier: causes unescaped whitespace in the expression to be ignored.
Synonym: x
Preconditions: ECMAScript or perl or JavaScript or JScript or normal
mod_s Description: Normally whether Boost.Regex will match "." against a newline character is determined by the match flag match_dot_not_newline. Specifying this flag is equivalent to prefixing the expression with (?s) and therefore causes "." to match a newline character regardless of whether match_not_dot_newline is set in the match flags.
Synonym: s
Preconditions: ECMAScript or perl or JavaScript or JScript or normal
no_mod_s Description: Normally whether Boost.Regex will match "." against a newline character is determined by the match flag match_dot_not_newline. Specifying this flag is equivalent to prefixing the expression with (?-s) and therefore causes "." not to match a newline character regardless of whether match_not_dot_newline is set in the match flags.
Synonym: nd
Preconditions: ECMAScript or perl or JavaScript or JScript or normal
basic Description: Specifies that the grammar recognized by the regular expression engine is the same as that used by POSIX basic regular expressions in IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Base Definitions and Headers, Section 9, Regular Expressions (FWD.1).
Synonym: b
Preconditions:
extended Description: Specifies that the grammar recognized by the regular expression engine is the same as that used by POSIX extended regular expressions in IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Base Definitions and Headers, Section 9, Regular Expressions (FWD.1).

Refer to the POSIX extended regular expression guide for more information. In addition some perl-style escape sequences are supported (The POSIX standard specifies that only "special" characters may be escaped, all other escape sequences result in undefined behavior).

Synonym: X
Preconditions:
normal Description: As ECMAScript.
Synonym: N
Preconditions:
emacs Description: Specifies that the grammar recognised is the superset of the POSIX-Basic syntax used by the emacs program.
Synonym: E
Preconditions:
awk Description: Specifies that the grammar recognized by the regular expression engine is the same as that used by POSIX utility awk in IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and Utilities, Section 4, awk (FWD.1).

That is to say: the same as POSIX extended syntax, but with escape sequences in character classes permitted. In addition some perl-style escape sequences are supported (actually the awk syntax only requires \a \b \t \v \f \n and \r to be recognised, all other Perl-style escape sequences invoke undefined behavior according to the POSIX standard, but are in fact recognised by Boost.Regex).

Synonym: A
Preconditions:
grep Description: Specifies that the grammar recognized by the regular expression engine is the same as that used by POSIX utility grep in IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and Utilities, Section 4, Utilit\ies, grep (FWD.1).

That is to say, the same as POSIX basic syntax, but with the newline character acting as an alternation character; the expression is treated as a newline separated list of alternatives.

Synonym: G
Preconditions:
egrep Description: Specifies that the grammar recognized by the regular expression engine is the same as that used by POSIX utility grep when given the -E option in IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and Utilities, Section 4, Utilities, grep (FWD.1).

That is to say, the same as POSIX extended syntax, but with the newline character acting as an alternation character in addition to "|".

Synonym: EG
Preconditions:
sed Description: As basic.
Synonym: S
Preconditions:
perl Description: As ECMAScript.
Synonym: P
Preconditions:
ECMAScript Description: Specifies that the grammar recognized by the regular expression engine uses its normal semantics: that is the same as that given in the ECMA-262, ECMAScript Language Specification, Chapter 15 part 10, RegExp (Regular Expression) Objects (FWD.1).

This is functionally identical to the Perl regular expression syntax. Boost.Regex also recognizes all of the perl-compatible (?...) extensions in this mode.

Synonym: ES
Preconditions:
JavaScript Description: As ECMAScript.
Synonym: J
Preconditions:
JScript Description: As ECMAScript.
Synonym: JS
Preconditions:
match_not_bol Description: Specifies that the expression "^" should not be matched against the sub-sequence [first,first).
Synonym: nb
Preconditions:
match_not_eol Description: Specifies that the expression "$" should not be matched against the sub-sequence [last,last).
Synonym: nl
Preconditions:
match_not_bob Description: Specifies that the expressions "\A" and "\`" should not match against the sub-sequence [first,first).
Synonym: no
Preconditions:
match_not_eob Description: Specifies that the expressions "\'", "\z" and "\Z" should not match against the sub-sequence [last,last).
Synonym: ne
Preconditions:
match_not_bow Description: Specifies that the expressions "\<" and "\b" should not be matched against the sub-sequence [first,first).
Synonym: nw
Preconditions:
match_not_eow Description: Specifies that the expressions "\>" and "\b" should not be matched against the sub-sequence [last,last).
Synonym: NW
Preconditions:
match_not_dot_newline Description: Specifies that the expression "." does not match a newline character. This is the inverse of Perl's s/ modifier.
Synonym: nn
Preconditions:
match_not_dot_null Description: Specifies that the expression "." does not match a character null '\0'.
Synonym: f
Preconditions:
match_prev_avail Description: Specifies that --first is a valid iterator position, when this flag is set then the flags match_not_bol and match_not_bow are ignored by the regular expression algorithms (RE.7) and iterators (RE.8).
Synonym: a
Preconditions:
match_any Description: Specifies that if more than one match is possible then any match is an acceptable result: this will still find the leftmost match, but may not find the "best" match at that position. Use this flag if you care about the speed of matching, but don't care what was matched (only whether there is one or not).
Synonym: y
Preconditions:
match_not_null Description: Specifies that the expression can not be matched against an empty sequence.
Synonym: L
Preconditions:
match_continuous Description: Specifies that the expression must match a sub-sequence that begins at first.
Synonym: u
Preconditions:
match_partial Description: Specifies that if no match can be found, then it is acceptable to return a match [from, last) such that from!= last, if there could exist some longer sequence of characters [from,to) of which [from,last) is a prefix, and which would result in a full match. This flag is used when matching incomplete or very long texts, see the partial matches documentation for more information.
Synonym: r
Preconditions:
match_not_initial_null Description: Don't match initial null.
Synonym: h
Preconditions:
match_all Description: Must find the whole of input even if match_any is set.
Synonym: d
Preconditions:
match_perl Description: Specifies that the expression should be matched according to the Perl matching rules, irrespective of what kind of expression was compiled.
Synonym: ml
Preconditions:
match_posix Description: Specifies that the expression should be matched according to the POSIX leftmost-longest rule, regardless of what kind of expression was compiled. Be warned that these rules do not work well with many Perl-specific features such as non-greedy repeats.
Synonym: mx
Preconditions:
match_nosubs Description: Makes the expression behave as if it had no marked subexpressions, no matter how many capturing groups are actually present.
Synonym: ms
Preconditions:
match_single_line Description: Equivalent to the inverse of Perl's m/ modifier; prevents ^ from matching after an embedded newline character (so that it only matches at the start of the text being matched), and $ from matching before an embedded newline (so that it only matches at the end of the text being matched).
Synonym: I
Preconditions:

Expression File

If pattern expressions are loaded from file this file must be of the format specified in this section.

Expression file is a text file. Each line specifies single expression with optionally specified expression ID and/or expression options. Line that has '#' character as first non-whitespace character is considered as comment and is ignored. Empty lines or lines with only whitespaces are ignored too.

Expression line consists of three ordered whitespace-separated fields:

  1. Pattern expression.
    First mandatory field which is a pattern expression itself. If pattern should contain whitespaces or starts with '/' or '#' characters it must be quoted with '/' characters. Pattern expression supports legacy format which allows specification of one or more of the following options right after closing '/' quotation character:
    i – equivalent to icase
    m – equivalent to !match_single_line;
    s – equivalent to !match_not_dot_newline;
    All legacy options are combined with default options and are processed before usual expression options. Expression must have format defined by expression options. Examples of expressions:
    \.gif$
    /with whitespace/
    /legacy/ims
  2. Expression ID.
    Second optional field which defines expression ID. If this field is omitted or contains '*' character then default ID is used. The field can be omitted if expression options are omitted too. Otherwise if you don't want to change default ID for expression but do want to change expression options you must use '*' character for expression ID field. To avoid collisions '/' character is not allowed within expression ID. Examples of expressions with IDs:
    \.pdf$ pdf
    /with whitespace/ ws
    /legacy/ims leg
  3. Expression options.
    Third optional field which defines expression options. Options are defined as list of comma-separated option strings. No whitespaces are allowed between options. Expression options was described previously. If options field is omitted default options are used. Examples of expressions with IDs and options:
    ^Subject:.*rolex.$       rolex       *,!match_single_line,icase
    ^Subject:.*mortgage.*$   mortgage    *,!I,i

Pattern search operator

Pattern search operator is a binary boolean operator that evaluates to true if a pattern is found in a text, otherwise it evaluates to false. The operator can be used in preconditions and result conditions of Boorex engine. Pattern search operator is depicted as sequence of two characters: "~=" (tilde and equal sign). Left hand of an operator is a template for a text through which searching must be done. Right hand of an operator is a template that must evaluates to existing expression ID. Corresponding expression will be used for searching. Usually right hand is just a constant equal to one of expression ID. For example:

    $body_part.text ~= secret         -- seaches in body part text for expression with ID "secret"
    $recipient ~= distribution_list   -- seaches in recipient address for expression with ID "destribution_list"

Runtime patterns

Runtime patterns allows patterns to be specified as templates that are evaluated, compiled and matched at runtime. In particular this use-case allows patterns to be retrieved from DB on per-group(user) basis and matched at runtime. Use-case scenario is the following:

  • administrator defines rules of how patterns are retrieved; patterns may be retrieved using MySQL or HTTP query or other ESM capability; constant patterns may also be used;
  • administrator defines a target for matching and rules of how matches (or mismatches) are handled;
  • during processing of a message mppd retrieves a pattern or a list of patterns and compiles them;
  • mppd matches patterns against target using first-matched principle and takes further actions depending on match results as has been defined by administarator.

The following options are used for runtime patterns:

<mppd>
    <engines>
        ...
        <boorex>

            <!-- Specifies:                                            -->
            <!--     Single pattern id, options and expression.        -->
            <!--     Multiple patterns are allowed in which case they  -->
            <!--     are matched in turns on first-match principle.    -->
            <!-- Value: template that evaluates to pattern expression. -->
            <!-- Attributes:                                           -->
            <!--     id                                                -->
            <!--         template that evaluates to pattern id         -->
            <!--     options                                           -->
            <!--         template that evaluates to pattern syntax and -->
            <!--         matching options.                             -->
            <!-- Default: no default.                                  -->
            <expression id="$engine.mysql.get_pattern.id" options="$engine.mysql.get_pattern.options">$engines.mysql.get_pattern.expression</expression>

            <!-- Specifies:                                            -->
            <!--     List of patterns of the format equivalent to      -->
            <!--     expression file format.      -->
            <!--     Patterns are matched in turns on first-match      -->
            <!--     principle. Multiple lists may be specified.       -->
            <!-- Value: template that evaluates to list of patterns.   -->
            <!-- Attributes:                                           -->
            <!--     id                                                -->
            <!--         template that evaluates to default pattern id -->
            <!--     options                                           -->
            <!--         template that evaluates to pattern default    -->
            <!--         syntax and matching options.                  -->
            <!-- Default: no default.                                  -->
            <expression_list id="$engine.mysql.get_pattern_list.id" options="$engine.mysql.get_pattern_list.options">$engines.mysql.get_pattern_list.list</expression_list>

            <!-- Specifies: Target data to match patterns against.     -->
            <!-- Value: template that evaluates to a UTF8 string.      -->
            <!-- Default:                                              -->
            <!--     no default. If not specified runtime patterns     -->
            <!--     won't be matched                                  -->
            <data>$headers.subject</data>

        </boorex>
        ...
    </engines>
</mppd>

To check whether any expression is matched use $boorex_first_matched macro in <result>'s of boorex engine. To get matched data value and matched pattern id use $boorex_first_matched_value and $boorex_first_matched_id macros respectively. For example:

<mppd>
    <engines>
        <boorex>
            <expression id="case_token_pattern">...</expression>
            <expression id="opportunity_token_pattern">...</expression>
            <result id="crm_object_type">
                <case>
                    <condition>$boorex_first_matched $EQ yes $AND $boorex_first_matched_id $EQ case_token_pattern</condition>
                    <result>Case</result>
                </case>
                <case>
                    <condition>$boorex_first_matched $EQ yes $AND $boorex_first_matched_id $EQ opportunity_token_pattern</condition>
                    <result>Opportunity</result>
                </case>
                <result>$empty</result>
            </result>
            <result id="crm_object_id">
                <result>$boorex_first_matched_value</result>
            </result>
        </boorex>
    </engines>
</mppd>

Sub-matches

Runtime patterns support regulare expressions sub-matches. To check whether particular sub-match is matched use ${boorex_first_matched index} where index must evaluate to an integer index of sub-match. Sub-match indexing is started from 1 (one). Zero index is related to whole match and thus equivalent to the macro withoug arguments. To get sub-match value use ${boorex_first_matched_value index} macro with same indexing scheme. For example to extract "initial" subject without preceding "Re:", "Fwd" and combinations one may use the following config:

<mppd>
    <engines>
        <boorex>
            <expression options="*,icase">^[\\s\\t]*(?:(?:Res|Re|Fwd|Fw)(?:\\[[\\d]*\\])?:[\\s\\t]*)*(.*)\$</expression>
            <data>$headers.subject</data>
            <result id="initial_subject>
                <result>${boorex_first_matched_value 1}</result>
            </result>
        </boorex>
    </engines>
</mppd>

See also

Personal tools