programing

공백 일치 정규식-Java

nasanasas 2020. 8. 26. 07:57
반응형

공백 일치 정규식-Java


정규식 용 Java API는 \s공백과 일치 하는 상태입니다 . 따라서 정규식 \\s\\s은 두 개의 공백과 일치해야합니다.

Pattern whitespace = Pattern.compile("\\s\\s");
matcher = whitespace.matcher(modLine);
while (matcher.find()) matcher.replaceAll(" ");

이것의 목적은 두 개의 연속 된 공백의 모든 인스턴스를 단일 공백으로 바꾸는 것입니다. 그러나 이것은 실제로 작동하지 않습니다.

정규식이나 "공백"이라는 용어에 대해 심각한 오해가 있습니까?


예, matcher.replaceAll ()의 결과를 가져와야합니다.

String result = matcher.replaceAll(" ");
System.out.println(result);

UTS # 18의 RL1.2\s 를 충족하기 위해 반드시 필요하지만 Java에서는 유니 코드 공백 속성을 지원하지 않기 때문에 Java에서 고유 문자 집합의 공백을 일치시키는 데 사용할 수 없습니다 . 안타깝게도 표준을 준수하지 않습니다.

유니 코드는 26 개의 코드 포인트를 \p{White_Space}다음 과 같이 정의합니다 . 그 중 20 개는 다양한 종류의 \pZ GeneralCategory = Separator 이고 나머지 6 개는 \p{Cc} GeneralCategory = Control 입니다.

공백은 매우 안정적인 속성이며 동일한 속성은 거의 영원히 존재했습니다. 그럼에도 불구하고 Java에는 이들에 대한 유니 코드 표준을 준수하는 속성이 없으므로 대신 다음과 같은 코드를 사용해야합니다.

String whitespace_chars =  ""       /* dummy empty string for homogeneity */
                        + "\\u0009" // CHARACTER TABULATION
                        + "\\u000A" // LINE FEED (LF)
                        + "\\u000B" // LINE TABULATION
                        + "\\u000C" // FORM FEED (FF)
                        + "\\u000D" // CARRIAGE RETURN (CR)
                        + "\\u0020" // SPACE
                        + "\\u0085" // NEXT LINE (NEL) 
                        + "\\u00A0" // NO-BREAK SPACE
                        + "\\u1680" // OGHAM SPACE MARK
                        + "\\u180E" // MONGOLIAN VOWEL SEPARATOR
                        + "\\u2000" // EN QUAD 
                        + "\\u2001" // EM QUAD 
                        + "\\u2002" // EN SPACE
                        + "\\u2003" // EM SPACE
                        + "\\u2004" // THREE-PER-EM SPACE
                        + "\\u2005" // FOUR-PER-EM SPACE
                        + "\\u2006" // SIX-PER-EM SPACE
                        + "\\u2007" // FIGURE SPACE
                        + "\\u2008" // PUNCTUATION SPACE
                        + "\\u2009" // THIN SPACE
                        + "\\u200A" // HAIR SPACE
                        + "\\u2028" // LINE SEPARATOR
                        + "\\u2029" // PARAGRAPH SEPARATOR
                        + "\\u202F" // NARROW NO-BREAK SPACE
                        + "\\u205F" // MEDIUM MATHEMATICAL SPACE
                        + "\\u3000" // IDEOGRAPHIC SPACE
                        ;        
/* A \s that actually works for Java’s native character set: Unicode */
String     whitespace_charclass = "["  + whitespace_chars + "]";    
/* A \S that actually works for  Java’s native character set: Unicode */
String not_whitespace_charclass = "[^" + whitespace_chars + "]";

이제 사용할 수 있습니다 whitespace_charclass + "+"당신의 패턴으로 replaceAll.


모든 것에 대해 죄송합니다. Java의 정규식은 고유 한 고유 문자 집합에서 잘 작동하지 않으므로 실제로 작동하도록하려면 이국적인 후프를 거쳐야합니다.

당신이 공백이 나쁜 생각한다면, 당신은 당신이 얻을 무엇을해야 볼 수 \w\b마지막으로 제대로 작동하도록!

Yes, it’s possible, and yes, it’s a mindnumbing mess. That’s being charitable, even. The easiest way to get a standards-comforming regex library for Java is to JNI over to ICU’s stuff. That’s what Google does for Android, because OraSun’s doesn’t measure up.

If you don’t want to do that but still want to stick with Java, I have a front-end regex rewriting library I wrote that “fixes” Java’s patterns, at least to get them conform to the requirements of RL1.2a in UTS#18, Unicode Regular Expressions.


For Java (not php, not javascript, not anyother):

txt.replaceAll("\\p{javaSpaceChar}{2,}"," ")

when I sended a question to a Regexbuddy (regex developer application) forum, I got more exact reply to my \s Java question:

"Message author: Jan Goyvaerts

In Java, the shorthands \s, \d, and \w only include ASCII characters. ... This is not a bug in Java, but simply one of the many things you need to be aware of when working with regular expressions. To match all Unicode whitespace as well as line breaks, you can use [\s\p{Z}] in Java. RegexBuddy does not yet support Java-specific properties such as \p{javaSpaceChar} (which matches the exact same characters as [\s\p{Z}]).

... \s\s will match two spaces, if the input is ASCII only. The real problem is with the OP's code, as is pointed out by the accepted answer in that question."


Seems to work for me:

String s = "  a   b      c";
System.out.println("\""  + s.replaceAll("\\s\\s", " ") + "\"");

will print:

" a  b   c"

I think you intended to do this instead of your code:

Pattern whitespace = Pattern.compile("\\s\\s");
Matcher matcher = whitespace.matcher(s);
String result = "";
if (matcher.find()) {
    result = matcher.replaceAll(" ");
}

System.out.println(result);

Pattern whitespace = Pattern.compile("\\s\\s");
matcher = whitespace.matcher(modLine);

boolean flag = true;
while(flag)
{
 //Update your original search text with the result of the replace
 modLine = matcher.replaceAll(" ");
 //reset matcher to look at this "new" text
 matcher = whitespace.matcher(modLine);
 //search again ... and if no match , set flag to false to exit, else run again
 if(!matcher.find())
 {
 flag = false;
 }
}

For your purpose you can use this snnippet:

import org.apache.commons.lang3.StringUtils;
StrintUtils.StringUtils.normalizeSpace(string);

this will normalize the spacing to single and will strip off the starting and trailing whitespaces as well.

For your purpose you can use this snnippet:

import org.apache.commons.lang3.StringUtils;
StrintUtils.StringUtils.normalizeSpace(string);

this will normalize the spacing to single and will strip off the starting and trailing whitespaces as well.

String sampleString = "Hello world!"; sampleString.replaceAll("\s{2}", " "); // replaces exactly two consecutive spaces

sampleString.replaceAll("\s{2,}", " "); // replaces two or more consecutive white spaces


Use of whitespace in RE is a pain, but I believe they work. The OP's problem can also be solved using StringTokenizer or the split() method. However, to use RE (uncomment the println() to view how the matcher is breaking up the String), here is a sample code:

import java.util.regex.*;

public class Two21WS {
    private String  str = "";
    private Pattern pattern = Pattern.compile ("\\s{2,}");  // multiple spaces

    public Two21WS (String s) {
            StringBuffer sb = new StringBuffer();
            Matcher matcher = pattern.matcher (s);
            int startNext = 0;
            while (matcher.find (startNext)) {
                    if (startNext == 0)
                            sb.append (s.substring (0, matcher.start()));
                    else
                            sb.append (s.substring (startNext, matcher.start()));
                    sb.append (" ");
                    startNext = matcher.end();
                    //System.out.println ("Start, end = " + matcher.start()+", "+matcher.end() +
                    //                      ", sb: \"" + sb.toString() + "\"");
            }
            sb.append (s.substring (startNext));
            str = sb.toString();
    }

    public String toString () {
            return str;
    }

    public static void main (String[] args) {
            String tester = " a    b      cdef     gh  ij   kl";
            System.out.println ("Initial: \"" + tester + "\"");
            System.out.println ("Two21WS: \"" + new Two21WS(tester) + "\"");
}}

It produces the following (compile with javac and run at the command prompt):

% java Two21WS Initial: " a b cdef gh ij kl" Two21WS: " a b cdef gh ij kl"

참고URL : https://stackoverflow.com/questions/4731055/whitespace-matching-regex-java

반응형