based on
http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
import java.util.Arrays;
public class JaroDistance {
public static double jaroDistance(String source, String target) {
int slen = source.length();
int tlen = target.length();
int i, j;
// the match window
int matchWindow = Math.max(slen, tlen)/2 - 1;
//the number of matching characters
int m = 0;
//half the number of transposition
int t = 0;
boolean[] smatched = new boolean[slen];
boolean[] tmatched = new boolean[tlen];
Arrays.fill(smatched, false);
Arrays.fill(tmatched, false);
if (slen == 0) {
return tlen == 0 ? 1.0 : 0.0;
}
for (i = 0; i < slen; ++i) {
int start = Math.max(0, i-matchWindow);
int end = Math.min(i+matchWindow, tlen);
for (j = start; j < end; j++) {
if (tmatched[j]) continue;
if (source.charAt(i) != target.charAt(j))
continue;
smatched[i] = true;
tmatched[j] = true;
++m;
break;
}
}
if (m == 0) return 0.0;
j = 0;
for (i = 0; i < slen; ++i) {
if (!smatched[i]) continue;
while (!tmatched[j]) ++j;
if (source.charAt(i) != target.charAt(j))
++t;
++j;
}
t = t / 2;
// System.out.println(m + " " + t);
return ((double)m/slen + (double)m/tlen + (double)(m-t)/m) / 3.0;
}
public static double jaroWinklerDistance(String source, String target) {
int max = Math.min(4, Math.min(source.length(), target.length()));
int len = 0;
for (int i = 0; i < max; ++i) {
if (source.charAt(len) == target.charAt(len))
++len;
}
double jaro = jaroDistance(source, target);
return jaro + 0.1 * len * (1.0 - jaro);
}
public static void main(String[] args) {
String source = "MARTHA";
String target = "MARHTA";
System.out.println("Dj = " + jaroDistance(source, target));
System.out.println("Dw = " + jaroWinklerDistance(source, target));
System.out.println();
source = "DICKSONX";
target = "DIXON";
System.out.println("Dj = " + jaroDistance(source, target));
System.out.println("Dw = " + jaroWinklerDistance(source, target));
System.out.println();
}
}
Output:
Dj = 0.9444444444444445
Dw = 0.9611111111111111
Dj = 0.7666666666666666
Dw = 0.8133333333333332
分享到:
相关推荐
是用C扩展编写的算法的实现,在MRI / KRI以外的其他平台(如JRuby或Rubinius)上,将回纯Ruby版本。... jaro_distance "MARTHA" , "MARHTA"# => 0.9444444444444445 没有JaroWinkler.jaro_winkler_
tree edit distance metric. Determining similarity using tree edit distance has been proven useful in a variety of application areas. While subtree similarity-search has been studied in the past, ...
JARO_WINKLER_SIMILARITY(s1, s2)函数返回一个介于 0 和 1 之间的浮点数,其中 0 表示根本没有相似性,1 表示完全匹配。 仅供参考,有很多替代/附加算法来计算语音相似度。 这绝不是一个完整的列表: (也有可用的...
Text similarity calculation method based on ontology model
An Integrated Item Similarity Calculation Method for Collaborative Filtering
Aiming at the characteristics of Naxi language, a method is proposed for Naxi sentence similarity calculation. First, according to the characteristics of Naxi language that verbs set back, and nouns ...
当前实现了十二种算法(包括Levenshtein编辑距离和同级,Jaro-Winkler,最长公共子序列,余弦相似性等)。 查看下面的摘要表以获取完整列表... 下载 使用Maven: <groupId>info.debatty <artifactId>java-...
java-string-similarity, 各种字符串相似性和距离算法 java-string-similarity 实现不同字符串相似度和距离... 目前已经实现了许多算法( 包括Levenshtein编辑距离和 sibblings,jaro winkler,最长公共子序列,余弦相
贝达 获取BEDA ...介绍 BEDA是一个golang库,用于检测两个单词或...Jaro&Jaro Winkler Distance:一个字符串度量,用于测量两个序列之间的编辑距离。 BEDA是印度尼西亚语中“不同”的意思。 用法 import "github.com/h
从以上模糊匹配的结果看,Jaro Winkler和Pair letters Similarity的结果比较合适, 如果在实际工作中,大数据量的模糊匹配应该如何从中选择合适的算法?
Algorithm-java-string-similarity.zip,各种字符串相似度和距离算法的实现:levenshtein、jaro winkler、n-gram、q-gram、jaccard索引、最长公共子序列编辑距离、余弦相似度……,算法是为计算机程序高效、彻底地完成...
各种字符串相似度和距离算法的实现:Levenshtein,Jaro-winkler,n-Gram,Q-Gram,Jaccard索引,最长公共子序列编辑距离,余弦相似度......
A large-scale Chinese Nature language inference and Semantic similarity calculation Dataset
应用Methodological System and Application Scenarios on Text Similarity Calculation
当前实现了十二种算法(包括Levenshtein编辑距离和同级,Jaro-Winkler,最长公共子序列,余弦相似性等)。 查看下面的摘要表以获取完整列表...下载使用NuGet: Install-Package F23.StringSimilarity总览下面介绍了...
similarity cosine ontology
Similarity的功能是用来寻找相似或者相同的音乐文件的,利用内置的算法,Similarity能够快速分析音乐文件的内容,而不是通过文件名字来判断,
1 The Distance Searching Problem 6 2 The Metric Space 8 3 Distance Measures 9 3.1 Minkowski Distances 10 3.2 Quadratic Form Distance 11 3.3 Edit Distance 12 3.4 Tree Edit Distance 13 3.5 Jaccard’s ...
Similarity Search focuses on the state of the art in developing index structures for searching the metric space. Part I of the text describes major theoretical principles, and provides an extensive ...
First, we propose an axiomatic definition of entropy for IVFS based on distance which is consistent with the axiomatic definition of entropy of a fuzzy set introduced by De Luca, Termini and Liu....