악센트 부호가있는 문자를 남기고 Java의 문자열에서 "단어 문자"가 아닌 모든 문자를 제거 하시겠습니까?

programing

악센트 부호가있는 문자를 남기고 Java의 문자열에서 "단어 문자"가 아닌 모든 문자를 제거 하시겠습니까?

nasanasas 2020. 11. 13. 08:21

악센트 부호가있는 문자를 남기고 Java의 문자열에서 "단어 문자"가 아닌 모든 문자를 제거 하시겠습니까?

분명히 Java의 Regex 플레이버는 내가 Regex를 사용할 때 Umlauts 및 기타 특수 문자를 "단어 문자가 아닌 문자"로 간주합니다.

        "TESTÜTEST".replaceAll( "\\W", "" )

나를 위해 "TESTTEST"를 반환합니다. 내가 원하는 것은 모든 "단어 문자"가 아닌 모든 문자 만 제거하는 것입니다. 이 작업을 수행 할 수있는 방법은

         "[^A-Za-z0-9äöüÄÖÜßéèáàúùóò]"

내가 ô를 잊었다는 것을 깨닫기 위해서만?

사용 [^\p{L}\p{Nd}]+-이것은 문자 나 (십진수) 숫자가 아닌 모든 (유니 코드) 문자와 일치합니다.

자바 :

String resultString = subjectString.replaceAll("[^\\p{L}\\p{Nd}]+", "");

편집하다:

전자가 다음과 같은 일부 숫자 기호와 일치 \p{N}하기 \p{Nd}때문에로 변경 했습니다 ¼. 후자는 그렇지 않습니다. regex101.com 에서 확인 하세요 .

나는이 실에 부딪혔을 때 정반대를 이루려고 노력했다. 나는 그것이 꽤 오래되었다는 것을 알고 있지만 그럼에도 불구하고 여기 내 해결책이 있습니다. 블록을 사용할 수 있습니다 . 여기를 참조 하십시오 . 이 경우 올바른 가져 오기를 사용하여 다음 코드를 컴파일하십시오.

> String s = "äêìóblah"; 
> Pattern p = Pattern.compile("[\\p{InLatin-1Supplement}]+"); // this regex uses a block
> Matcher m = p.matcher(s);
> System.out.println(m.find());
> System.out.println(s.replaceAll(p.pattern(), "#"));

다음 출력이 표시되어야합니다.

진실

#blah

베스트,

때때로 단순히 문자를 제거하지 않고 악센트 만 제거합니다. URL에 문자열을 포함해야 할 때마다 Java REST 웹 프로젝트에서 사용하는 다음 유틸리티 클래스를 생각해 냈습니다.

import java.text.Normalizer;
import java.text.Normalizer.Form;

import org.apache.commons.lang.StringUtils;

/**
 * Utility class for String manipulation.
 * 
 * @author Stefan Haberl
 */
public abstract class TextUtils {
    private static String[] searchList = { "Ä", "ä", "Ö", "ö", "Ü", "ü", "ß" };
    private static String[] replaceList = { "Ae", "ae", "Oe", "oe", "Ue", "ue",
            "sz" };

    /**
     * Normalizes a String by removing all accents to original 127 US-ASCII
     * characters. This method handles German umlauts and "sharp-s" correctly
     * 
     * @param s
     *            The String to normalize
     * @return The normalized String
     */
    public static String normalize(String s) {
        if (s == null)
            return null;

        String n = null;

        n = StringUtils.replaceEachRepeatedly(s, searchList, replaceList);
        n = Normalizer.normalize(n, Form.NFD).replaceAll("[^\\p{ASCII}]", "");

        return n;
    }

    /**
     * Returns a clean representation of a String which might be used safely
     * within an URL. Slugs are a more human friendly form of URL encoding a
     * String.
     * <p>
     * The method first normalizes a String, then converts it to lowercase and
     * removes ASCII characters, which might be problematic in URLs:
     * <ul>
     * <li>all whitespaces
     * <li>dots ('.')
     * <li>(semi-)colons (';' and ':')
     * <li>equals ('=')
     * <li>ampersands ('&')
     * <li>slashes ('/')
     * <li>angle brackets ('<' and '>')
     * </ul>
     * 
     * @param s
     *            The String to slugify
     * @return The slugified String
     * @see #normalize(String)
     */
    public static String slugify(String s) {

        if (s == null)
            return null;

        String n = normalize(s);
        n = StringUtils.lowerCase(n);
        n = n.replaceAll("[\\s.:;&=<>/]", "");

        return n;
    }
}

독일어를 사용하는 사람이기 때문에 독일어 움라우트에 대한 적절한 처리도 포함 시켰습니다. 목록은 다른 언어로도 쉽게 확장 할 수 있어야합니다.

HTH

편집 : 이주의 할 수 URL에서 반환 된 문자열을 포함하는 안전하지 않을 수. XSS 공격을 방지하려면 최소한 HTML로 인코딩해야합니다.

Well, here is one solution I ended up with, but I hope there's a more elegant one...

StringBuilder result = new StringBuilder();
for(int i=0; i<name.length(); i++) {
    char tmpChar = name.charAt( i );
    if (Character.isLetterOrDigit( tmpChar) || tmpChar == '_' ) {
        result.append( tmpChar );
    }
}

result ends up with the desired result...

You might want to remove the accents and diacritic signs first, then on each character position check if the "simplified" string is an ascii letter - if it is, the original position shall contain word characters, if not, it can be removed.

You can use StringUtils from apache

참고URL : https://stackoverflow.com/questions/1611979/remove-all-non-word-characters-from-a-string-in-java-leaving-accented-charact

'programing' 카테고리의 다른 글

데이터 바인딩을 사용하여 속성 값을 기반으로 DataGrid의 행 배경을 설정하는 방법 (0)	2020.11.13
docker-compose 내에서 인수를 전달하는 방법은 무엇입니까? (0)	2020.11.13
Google Maps API v3의 모든 정보 창을 닫습니다. (0)	2020.11.13
pcre 지원으로 uwsgi 다시 빌드 (0)	2020.11.13
2 차원 배열을 만들고 Ruby에서 하위 배열에 액세스 (0)	2020.11.13

현재글악센트 부호가있는 문자를 남기고 Java의 문자열에서 "단어 문자"가 아닌 모든 문자를 제거 하시겠습니까?

nasanasa

악센트 부호가있는 문자를 남기고 Java의 문자열에서 "단어 문자"가 아닌 모든 문자를 제거 하시겠습니까?

악센트 부호가있는 문자를 남기고 Java의 문자열에서 "단어 문자"가 아닌 모든 문자를 제거 하시겠습니까?

'programing' 카테고리의 다른 글

'programing'의 다른글

티스토리툴바

악센트 부호가있는 문자를 남기고 Java의 문자열에서 "단어 문자"가 아닌 모든 문자를 제거 하시겠습니까?

악센트 부호가있는 문자를 남기고 Java의 문자열에서 "단어 문자"가 아닌 모든 문자를 제거 하시겠습니까?

'programing' 카테고리의 다른 글

'programing'의 다른글

관련글

티스토리툴바