Skip to content

Commit e1d7434

Browse files
author
Esben Sparre Andreasen
committed
JS: add query js/useless-regexp-character-escape
1 parent 25130f2 commit e1d7434

14 files changed

+530
-11
lines changed

change-notes/1.23/analysis-javascript.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
| Loop bound injection (`js/loop-bound-injection`) | security, external/cwe/cwe-834 | Highlights loops where a user-controlled object with an arbitrary .length value can trick the server to loop indefinitely. Results are not shown on LGTM by default. |
1919
| Suspicious method name (`js/suspicious-method-name-declaration`) | correctness, typescript, methods | Highlights suspiciously named methods where the developer likely meant to write a constructor or function. Results are shown on LGTM by default. |
2020
| Use of returnless function (`js/use-of-returnless-function`) | maintainability, correctness | Highlights calls where the return value is used, but the callee never returns a value. Results are shown on LGTM by default. |
21-
21+
| Useless regular expression character escape (`js/useless-regexp-character-escape`) | correctness, security, external/cwe/cwe-20 | Highlights regular expression strings with useless character escapes, indicating a possible violation of [CWE-20](https://cwe.mitre.org/data/definitions/20.html). Results are shown on LGTM by default. |
2222

2323
## Changes to existing queries
2424

javascript/config/suites/javascript/security

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
+ semmlecode-javascript-queries/Security/CWE-020/IncompleteUrlSubstringSanitization.ql: /Security/CWE/CWE-020
55
+ semmlecode-javascript-queries/Security/CWE-020/IncorrectSuffixCheck.ql: /Security/CWE/CWE-020
66
+ semmlecode-javascript-queries/Security/CWE-020/MissingRegExpAnchor.ql: /Security/CWE/CWE-020
7+
+ semmlecode-javascript-queries/Security/CWE-020/UselessRegExpCharacterEscape.ql: /Security/CWE/CWE-020
78
+ semmlecode-javascript-queries/Security/CWE-022/TaintedPath.ql: /Security/CWE/CWE-022
89
+ semmlecode-javascript-queries/Security/CWE-022/ZipSlip.ql: /Security/CWE/CWE-022
910
+ semmlecode-javascript-queries/Security/CWE-078/CommandInjection.ql: /Security/CWE/CWE-078

javascript/ql/src/Security/CWE-020/IncompleteHostnameRegExp.ql

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
*/
1212

1313
import javascript
14+
import semmle.javascript.CharacterEscapes
1415

1516
/**
1617
* Holds if `pattern` is a regular expression pattern for URLs with a host matched by `hostPart`,
@@ -40,7 +41,9 @@ where
4041
)
4142
) and
4243
// ignore patterns with capture groups after the TLD
43-
not pattern.regexpMatch("(?i).*[.](" + RegExpPatterns::commonTLD() + ").*[(][?]:.*[)].*")
44+
not pattern.regexpMatch("(?i).*[.](" + RegExpPatterns::commonTLD() + ").*[(][?]:.*[)].*") and
45+
// avoid double reporting
46+
not CharacterEscapes::hasALikelyRegExpPatternMistake(re)
4447
select re,
4548
"This " + kind + " has an unescaped '.' before '" + hostPart +
4649
"', so it might match more hosts than expected.", aux, "here"
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
<!DOCTYPE qhelp PUBLIC
2+
"-//Semmle//qhelp//EN"
3+
"qhelp.dtd">
4+
<qhelp>
5+
6+
<overview>
7+
<p>
8+
9+
When a character in a string literal or regular expression
10+
literal is preceded by a backslash, it is interpreted as part of an
11+
escape sequence. For example, the escape sequence <code>\n</code> in a
12+
string literal corresponds to a single <code>newline</code> character,
13+
and not the <code>\</code> and <code>n</code> characters.
14+
15+
However, not all characters change meaning when used in an
16+
escape sequence. In this case, the backslash just makes the character
17+
appear to mean something else, and the backslash actually has no
18+
effect. For example, the escape sequence <code>\k</code> in a string
19+
literal just means <code>k</code>.
20+
21+
Such superfluous escape sequences are usually benign, and
22+
do not change the behavior of the program.
23+
24+
</p>
25+
26+
<p>
27+
28+
The set of characters that change meaning when in escape
29+
sequences is different for regular expression literals and string
30+
literals.
31+
32+
This can be problematic when a regular expression literal
33+
is turned into a regular expression that is built from one or more
34+
string literals. The problem occurs when a regular expression escape
35+
sequence loses its special meaning in a string literal.
36+
37+
</p>
38+
39+
</overview>
40+
41+
<recommendation>
42+
<p>
43+
44+
Ensure that the right amount of backslashes is used when
45+
escaping characters in strings, template literals and regular
46+
expressions.
47+
48+
Pay special attention to the number of backslashes when
49+
rewriting a regular expression as a string literal.
50+
51+
</p>
52+
</recommendation>
53+
54+
<example>
55+
56+
<p>
57+
58+
The following example code checks that a string is
59+
<code>"my-marker"</code>, possibly surrounded by white space:
60+
61+
</p>
62+
63+
<sample src="examples/UselessRegExpCharacterEscape_bad_1.js"/>
64+
65+
<p>
66+
67+
However, the check does not work properly for white space
68+
as the two <code>\s</code> occurrences are semantically equivalent to
69+
just <code>s</code>, meaning that the check will succeed for strings
70+
like <code>"smy-markers"</code> instead of <code>" my-marker
71+
"</code>.
72+
73+
Address these shortcomings by either using a regular
74+
expression literal (<code>/(^\s*)my-marker(\s*$)/</code>), or by
75+
adding extra backslashes
76+
(<code>'(^\\s*)my-marker(\\s*$)'</code>).
77+
78+
</p>
79+
80+
</example>
81+
82+
<references>
83+
<li>MDN: <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#Escaping">Regular expression escape notation</a></li>
84+
<li>MDN: <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String#Escape_notation">String escape notation</a></li>
85+
</references>
86+
</qhelp>
Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
/**
2+
* @name Useless regular-expression character escape
3+
* @description Prepending a backslash to an ordinary character in a string
4+
* does not have any effect, and may make regular expressions constructed from this string
5+
* behave unexpectedly.
6+
* @kind problem
7+
* @problem.severity error
8+
* @precision high
9+
* @id js/useless-regexp-character-escape
10+
* @tags correctness
11+
* security
12+
* external/cwe/cwe-20
13+
*/
14+
15+
import javascript
16+
import semmle.javascript.CharacterEscapes::CharacterEscapes
17+
18+
newtype TRegExpPatternMistake =
19+
/**
20+
* A character escape mistake in regular expression string `src`
21+
* for the character `char` at `index` in `rawStringNode`, explained
22+
* by `mistake`.
23+
*/
24+
TIdentityEscapeInStringMistake(
25+
RegExpPatternSource src, string char, string mistake, ASTNode rawStringNode, int index
26+
) {
27+
char = getALikelyRegExpPatternMistake(src, mistake, rawStringNode, index)
28+
} or
29+
/**
30+
* A backslash-escaped 'b' at `index` of `rawStringNode` in the
31+
* regular expression string `src`, indicating intent to use the
32+
* word-boundary assertion '\b'.
33+
*/
34+
TBackspaceInStringMistake(RegExpPatternSource src, ASTNode rawStringNode, int index) {
35+
exists(string raw, string cooked |
36+
exists(StringLiteral lit | lit = rawStringNode |
37+
rawStringNode = src.asExpr() and
38+
raw = lit.getRawValue() and
39+
cooked = lit.getStringValue()
40+
)
41+
or
42+
exists(TemplateElement elem | elem = rawStringNode |
43+
rawStringNode = src.asExpr().(TemplateLiteral).getAnElement() and
44+
raw = elem.getRawValue() and
45+
cooked = elem.getStringValue()
46+
)
47+
|
48+
"b" = getAnEscapedCharacter(raw, index) and
49+
// except if the string is exactly \b
50+
cooked.length() != 1
51+
)
52+
}
53+
54+
/**
55+
* A character escape mistake in a regular expression string.
56+
*
57+
* Implementation note: the main purpose of this class is to associate an
58+
* exact character ___location with an alert message, in the name of
59+
* user-friendly alerts. The implementation can be simplified
60+
* significantly by only using the enclosing string ___location as the alert
61+
* ___location.
62+
*/
63+
class RegExpPatternMistake extends TRegExpPatternMistake {
64+
/**
65+
* Holds if this element is at the specified ___location.
66+
* The ___location spans column `startcolumn` of line `startline` to
67+
* column `endcolumn` of line `endline` in file `filepath`.
68+
* For more information, see
69+
* [Locations](https://help.semmle.com/QL/learn-ql/ql/locations.html).
70+
*/
71+
predicate hasLocationInfo(
72+
string filepath, int startline, int startcolumn, int endline, int endcolumn
73+
) {
74+
exists(int srcStartcolumn, int srcEndcolumn, int index |
75+
index = getIndex() and
76+
getRawStringNode()
77+
.getLocation()
78+
.hasLocationInfo(filepath, startline, srcStartcolumn, endline, srcEndcolumn)
79+
|
80+
(
81+
if startline = endline
82+
then startcolumn = srcStartcolumn + index - 1 and endcolumn = srcStartcolumn + index
83+
else (
84+
startcolumn = srcStartcolumn and endcolumn = srcEndcolumn
85+
)
86+
)
87+
)
88+
}
89+
90+
/** Gets a textual representation of this element. */
91+
string toString() { result = getMessage() }
92+
93+
abstract ASTNode getRawStringNode();
94+
95+
abstract RegExpPatternSource getSrc();
96+
97+
abstract int getIndex();
98+
99+
abstract string getMessage();
100+
}
101+
102+
/**
103+
* An identity-escaped character that indicates programmer intent to
104+
* do something special in a regular expression.
105+
*/
106+
class IdentityEscapeInStringMistake extends RegExpPatternMistake, TIdentityEscapeInStringMistake {
107+
RegExpPatternSource src;
108+
109+
string char;
110+
111+
string mistake;
112+
113+
int index;
114+
115+
ASTNode rawStringNode;
116+
117+
IdentityEscapeInStringMistake() {
118+
this = TIdentityEscapeInStringMistake(src, char, mistake, rawStringNode, index)
119+
}
120+
121+
override string getMessage() {
122+
result = "'\\" + char + "' is equivalent to just '" + char + "', so the sequence " + mistake
123+
}
124+
125+
override int getIndex() { result = index }
126+
127+
override RegExpPatternSource getSrc() { result = src }
128+
129+
override ASTNode getRawStringNode() { result = rawStringNode }
130+
}
131+
132+
/**
133+
* A backspace as '\b' in a regular expression string, indicating
134+
* programmer intent to use the word-boundary assertion '\b'.
135+
*/
136+
class BackspaceInStringMistake extends RegExpPatternMistake, TBackspaceInStringMistake {
137+
RegExpPatternSource src;
138+
139+
int index;
140+
141+
ASTNode rawStringNode;
142+
143+
BackspaceInStringMistake() { this = TBackspaceInStringMistake(src, rawStringNode, index) }
144+
145+
override string getMessage() { result = "'\\b' is a backspace, and not a word-boundary assertion" }
146+
147+
override int getIndex() { result = index }
148+
149+
override RegExpPatternSource getSrc() { result = src }
150+
151+
override ASTNode getRawStringNode() { result = rawStringNode }
152+
}
153+
154+
from RegExpPatternMistake mistake
155+
select mistake, "The escape sequence " + mistake.getMessage() + " when it is used in a $@.",
156+
mistake.getSrc().getAParse(), "regular expression"
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
let regex = new RegExp('(^\s*)my-marker(\s*$)'),
2+
isMyMarkerText = regex.test(text);
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
/**
2+
* Provides classes for reasoning about character escapes in literals.
3+
*/
4+
5+
import javascript
6+
7+
module CharacterEscapes {
8+
/**
9+
* Provides sets of characters (as strings) with specific string/regexp characteristics.
10+
*/
11+
private module Sets {
12+
string sharedEscapeChars() { result = "fnrtvux0\\" }
13+
14+
string regexpAssertionChars() { result = "bB" }
15+
16+
string regexpCharClassChars() { result = "cdDpPsSwW" }
17+
18+
string regexpBackreferenceChars() { result = "123456789k" }
19+
20+
string regexpMetaChars() { result = "^$*+?.()|{}[]-" }
21+
}
22+
23+
/**
24+
* Gets the `i`th character of `raw`, which is preceded by an uneven number of backslashes.
25+
*/
26+
bindingset[raw]
27+
string getAnEscapedCharacter(string raw, int i) {
28+
result = raw.regexpFind("(?<=(^|[^\\\\])\\\\(\\\\{2}){0,10}).", _, i)
29+
}
30+
31+
/**
32+
* Holds if `n` is delimited by `delim` and contains `rawStringNode` with the raw string value `raw`.
33+
*/
34+
private predicate hasRawStringAndQuote(
35+
DataFlow::ValueNode n, string delim, ASTNode rawStringNode, string raw
36+
) {
37+
rawStringNode = n.asExpr() and
38+
raw = rawStringNode.(StringLiteral).getRawValue() and
39+
delim = raw.charAt(0)
40+
or
41+
rawStringNode = n.asExpr().(TemplateLiteral).getAnElement() and
42+
raw = rawStringNode.(TemplateElement).getRawValue() and
43+
delim = "`"
44+
or
45+
rawStringNode = n.asExpr() and
46+
raw = rawStringNode.(RegExpLiteral).getRawValue() and
47+
delim = "/"
48+
}
49+
50+
/**
51+
* Gets a character in `n` that is preceded by a single useless backslash.
52+
*
53+
* The character is the `i`th character of `rawStringNode`'s raw string value.
54+
*/
55+
string getAnIdentityEscapedCharacter(DataFlow::Node n, ASTNode rawStringNode, int i) {
56+
exists(string delim, string raw, string additionalEscapeChars |
57+
hasRawStringAndQuote(n, delim, rawStringNode, raw) and
58+
if rawStringNode instanceof RegExpLiteral
59+
then
60+
additionalEscapeChars = Sets::regexpMetaChars() + Sets::regexpAssertionChars() + Sets::regexpCharClassChars() +
61+
Sets::regexpBackreferenceChars()
62+
else additionalEscapeChars = "b"
63+
|
64+
result = getAnEscapedCharacter(raw, i) and
65+
not result = (Sets::sharedEscapeChars() + delim + additionalEscapeChars).charAt(_)
66+
)
67+
}
68+
69+
/**
70+
* Holds if `src` likely contains a regular expression mistake, to be reported by `js/useless-regexp-character-escape`.
71+
*/
72+
predicate hasALikelyRegExpPatternMistake(RegExpPatternSource src) {
73+
exists(getALikelyRegExpPatternMistake(src, _, _, _))
74+
}
75+
76+
/**
77+
* Gets a character in `n` that is preceded by a single useless backslash, resulting in a likely regular expression mistake explained by `mistake`.
78+
*
79+
* The character is the `i`th character of the raw string value of `rawStringNode`.
80+
*/
81+
string getALikelyRegExpPatternMistake(
82+
RegExpPatternSource src, string mistake, ASTNode rawStringNode, int i
83+
) {
84+
result = getAnIdentityEscapedCharacter(src, rawStringNode, i) and
85+
(
86+
result = Sets::regexpAssertionChars().charAt(_) and mistake = "is not an assertion"
87+
or
88+
result = Sets::regexpCharClassChars().charAt(_) and mistake = "is not a character class"
89+
or
90+
result = Sets::regexpBackreferenceChars().charAt(_) and mistake = "is not a backreference"
91+
or
92+
// conservative formulation: we do not know in general if the sequence is enclosed in a character class `[...]`
93+
result = Sets::regexpMetaChars().charAt(_) and
94+
mistake = "may still represent a meta-character"
95+
)
96+
}
97+
}

javascript/ql/test/query-tests/Security/CWE-020/IncompleteHostnameRegExp.expected

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,15 +14,11 @@
1414
| tst-IncompleteHostnameRegExp.js:38:2:38:44 | /^(http ... p\\/f\\// | This regular expression has an unescaped '.' before 'example.com', so it might match more hosts than expected. | tst-IncompleteHostnameRegExp.js:38:2:38:44 | /^(http ... p\\/f\\// | here |
1515
| tst-IncompleteHostnameRegExp.js:39:2:39:33 | /^(http ... om\\/)/g | This regular expression has an unescaped '.' before 'example.com', so it might match more hosts than expected. | tst-IncompleteHostnameRegExp.js:39:2:39:33 | /^(http ... om\\/)/g | here |
1616
| tst-IncompleteHostnameRegExp.js:40:2:40:30 | /^https ... le.com/ | This regular expression has an unescaped '.' before 'example.com', so it might match more hosts than expected. | tst-IncompleteHostnameRegExp.js:40:2:40:30 | /^https ... le.com/ | here |
17-
| tst-IncompleteHostnameRegExp.js:41:13:41:68 | '^http: ... e\\.com' | This regular expression has an unescaped '.' before 'example.com', so it might match more hosts than expected. | tst-IncompleteHostnameRegExp.js:41:13:41:68 | '^http: ... e\\.com' | here |
18-
| tst-IncompleteHostnameRegExp.js:41:41:41:68 | '^https ... e\\.com' | This string, which is used as a regular expression $@, has an unescaped '.' before 'example.com', so it might match more hosts than expected. | tst-IncompleteHostnameRegExp.js:41:13:41:68 | '^http: ... e\\.com' | here |
19-
| tst-IncompleteHostnameRegExp.js:42:13:42:62 | '^http[ ... \\/(.+)' | This regular expression has an unescaped '.' before 'example.com', so it might match more hosts than expected. | tst-IncompleteHostnameRegExp.js:42:13:42:62 | '^http[ ... \\/(.+)' | here |
2017
| tst-IncompleteHostnameRegExp.js:43:2:43:33 | /^https ... e.com$/ | This regular expression has an unescaped '.' before 'example.com', so it might match more hosts than expected. | tst-IncompleteHostnameRegExp.js:43:2:43:33 | /^https ... e.com$/ | here |
2118
| tst-IncompleteHostnameRegExp.js:44:9:44:101 | '^proto ... ernal)' | This regular expression has an unescaped '.' before 'example-b.com', so it might match more hosts than expected. | tst-IncompleteHostnameRegExp.js:44:9:44:101 | '^proto ... ernal)' | here |
2219
| tst-IncompleteHostnameRegExp.js:46:2:46:29 | /^(exam ... e.com)/ | This regular expression has an unescaped '.' before 'dev\|example.com', so it might match more hosts than expected. | tst-IncompleteHostnameRegExp.js:46:2:46:29 | /^(exam ... e.com)/ | here |
23-
| tst-IncompleteHostnameRegExp.js:48:13:48:68 | '^http: ... e\\.com' | This regular expression has an unescaped '.' before 'example.com', so it might match more hosts than expected. | tst-IncompleteHostnameRegExp.js:48:13:48:68 | '^http: ... e\\.com' | here |
24-
| tst-IncompleteHostnameRegExp.js:48:41:48:68 | '^https ... e\\.com' | This string, which is used as a regular expression $@, has an unescaped '.' before 'example.com', so it might match more hosts than expected. | tst-IncompleteHostnameRegExp.js:48:13:48:68 | '^http: ... e\\.com' | here |
20+
| tst-IncompleteHostnameRegExp.js:48:13:48:68 | '^http: ... \\\\.com' | This regular expression has an unescaped '.' before 'example\\.com', so it might match more hosts than expected. | tst-IncompleteHostnameRegExp.js:48:13:48:68 | '^http: ... \\\\.com' | here |
21+
| tst-IncompleteHostnameRegExp.js:48:41:48:68 | '^https ... \\\\.com' | This string, which is used as a regular expression $@, has an unescaped '.' before 'example\\.com', so it might match more hosts than expected. | tst-IncompleteHostnameRegExp.js:48:13:48:68 | '^http: ... \\\\.com' | here |
2522
| tst-IncompleteHostnameRegExp.js:53:13:53:36 | 'test.' ... e.com$' | This regular expression has an unescaped '.' before 'example.com', so it might match more hosts than expected. | tst-IncompleteHostnameRegExp.js:53:13:53:36 | 'test.' ... e.com$' | here |
2623
| tst-SemiAnchoredRegExp.js:30:2:30:23 | /^good. ... er.com/ | This regular expression has an unescaped '.' before 'com\|better.com', so it might match more hosts than expected. | tst-SemiAnchoredRegExp.js:30:2:30:23 | /^good. ... er.com/ | here |
2724
| tst-SemiAnchoredRegExp.js:66:13:66:34 | '^good. ... er.com' | This regular expression has an unescaped '.' before 'com\|better.com', so it might match more hosts than expected. | tst-SemiAnchoredRegExp.js:66:13:66:34 | '^good. ... er.com' | here |
28-
| tst-SemiAnchoredRegExp.js:67:13:67:36 | '^good\\ ... r\\.com' | This regular expression has an unescaped '.' before 'com\|better.com', so it might match more hosts than expected. | tst-SemiAnchoredRegExp.js:67:13:67:36 | '^good\\ ... r\\.com' | here |

0 commit comments

Comments
 (0)