Jaccard token distance
It is simply given by the number of common tokens in two names and the count of total number of tokens in those names.
Jaccard distance (simplify)
To reduce the computational complexity
Jaccard distance (weighted)
weighted Jaccard distance is equal to the following expression
then
Jaccard similarity function only need to take last past from the above each function.
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
public class JaccardDistance {
public static Map<String, Double> weightMap = new HashMap<String, Double>();
/**
* intersection between two strings
* @param source
* @param target
* @return
*/
public static List<String> intersection(String source, String target) {
List<String> slist = Arrays.asList(source.split(" "));
List<String> tlist = Arrays.asList(target.split(" "));
List<String> intersection = new ArrayList<String>();
for (String s: slist) {
if (tlist.contains(s)) {
if (!intersection.contains(s)) {
intersection.add(s);
}
}
}
return intersection;
}
/**
* J(s,t) = 1 - intersection(s, t).size()) / (s.size() + t.size() - intersection.size()) *
* @param source
* @param target
*/
public static double Jaccard1(String source, String target) {
List<String> slist = Arrays.asList(source.split(" "));
List<String> tlist = Arrays.asList(target.split(" "));
List<String> intersection = intersection(source, target);
return (double) 1 - intersection.size() / (double)(slist.size() + tlist.size() - intersection.size());
}
/**
* J(s,t) = 1 - 2 * intersection(s, t).size()) / (s.size() + t.size())
* @param source
* @param target
* @return
*/
public static double Jaccard2(String source, String target) {
List<String> slist = Arrays.asList(source.split(" "));
List<String> tlist = Arrays.asList(target.split(" "));
List<String> intersection = intersection(source, target);
return (double) 1 - 2 * intersection.size() / (double)(slist.size() + tlist.size());
}
/**
* J(s,t) each token has weight value.
* @param stringList
* @param token
* @return
*/
public static void JaccardWeight(List<String> stringList) {
Map<String, Integer> freqMap = new HashMap<String, Integer>();
for (String string : stringList) {
List<String> slist = Arrays.asList(string.split(" "));
for (String s : slist) {
s = s.trim();
if (freqMap.containsKey(s)) {
freqMap.put(s, freqMap.get(s)+1);
} else {
freqMap.put(s, 1);
}
}
}
for (String key : freqMap.keySet()) {
int freq = freqMap.get(key);
double weight = (double) 1 / (Math.log(freq) + 1);
weightMap.put(key, weight);
}
// return weightMap;
}
public static double Jaccard3(String source, String target) {
List<String> slist = Arrays.asList(source.split(" "));
List<String> tlist = Arrays.asList(target.split(" "));
List<String> intersection = intersection(source, target);
double intersectionWeight = 0;
double sourceWeight = 0;
double targetWeight = 0;
for (String s : intersection) {
intersectionWeight += weightMap.get(s);
}
for (String s : slist) {
sourceWeight += weightMap.get(s);
}
for (String s: tlist) {
targetWeight += weightMap.get(s);
}
return 1 - 2 * intersectionWeight / (sourceWeight + targetWeight);
}
//main
public static void main(String[] args) {
String s1 = "AAE HOLDING";
String s2 = "AAE TECHNOLOGY INTERNATIONAL";
String s3 = "AGRIPA HOLDING";
System.out.println("J1(s1, s2) = " + Jaccard1(s1, s2));
System.out.println("J1(s1, s3) = " + Jaccard1(s1, s3));
System.out.println("J1(s2, s3) = " + Jaccard1(s2, s3));
System.out.println();
System.out.println("J2(s1, s2) = " + Jaccard2(s1, s2));
System.out.println("J2(s1, s3) = " + Jaccard2(s1, s3));
System.out.println("J2(s2, s3) = " + Jaccard2(s2, s3));
System.out.println();
List<String> stringList = new ArrayList<String>();
Collections.addAll(stringList, s1, s2, s3);
JaccardWeight(stringList);
System.out.println(weightMap);
System.out.println("J3(s1, s2) = " + Jaccard3(s1, s2));
System.out.println("J3(s1, s3) = " + Jaccard3(s1, s3));
System.out.println("J3(s2, s3) = " + Jaccard3(s2, s3));
}
}
Output:
J1(s1, s2) = 0.75
J1(s1, s3) = 0.6666666666666667
J1(s2, s3) = 1.0
J2(s1, s2) = 0.6
J2(s1, s3) = 0.5
J2(s2, s3) = 1.0
{AAE=0.5906161091496412, TECHNOLOGY=1.0, AGRIPA=1.0, INTERNATIONAL=1.0, HOLDING=0.5906161091496412}
J3(s1, s2) = 0.6868293431358082
J3(s1, s3) = 0.5738467337473576
J3(s2, s3) = 1.0
- 大小: 1.4 KB
- 大小: 1.4 KB
- 大小: 1.7 KB
分享到:
相关推荐
tree edit distance metric. Determining similarity using tree edit distance has been proven useful in a variety of application areas. While subtree similarity-search has been studied in the past, ...
Text similarity calculation method based on ontology model
An Integrated Item Similarity Calculation Method for Collaborative Filtering
Aiming at the characteristics of Naxi language, a method is proposed for Naxi sentence similarity calculation. First, according to the characteristics of Naxi language that verbs set back, and nouns ...
MapReduce在聚类算法中的应用论文.rarMapReduce在聚类算法中的应用论文.rarMapReduce在聚类算法中的应用论文.rarMapReduce在聚类算法中的应用论文.rarMapReduce在聚类算法中的应用论文.rarMapReduce在聚类算法中的...
First, we propose an axiomatic definition of entropy for IVFS based on distance which is consistent with the axiomatic definition of entropy of a fuzzy set introduced by De Luca, Termini and Liu....
A large-scale Chinese Nature language inference and Semantic similarity calculation Dataset
3.5 Jaccard’s Coefficient 13 3.6 Hausdorff Distance 14 3.7 Time Complexity 14 4 Similarity Queries 15 4.1 Range Query 15 4.2 Nearest Neighbor Query 16 4.3 Reverse Nearest Neighbor Query 17 4.4 ...
python 编写 cosin jaccard and smith-waterman函数
应用Methodological System and Application Scenarios on Text Similarity Calculation
similarity cosine ontology
Similarity的功能是用来寻找相似或者相同的音乐文件的,利用内置的算法,Similarity能够快速分析音乐文件的内容,而不是通过文件名字来判断,
,,,
,,,
ngraph.jaccard 羽毛的鸟儿聚集在一起。 原来在我的编程生涯中激发我灵感的人也激发我去追随他们追随的其他人。 我们可以解决反向问题吗? 在我追随的所有人中,谁...// compute jaccard similarity for each conn
Similarity Search focuses on the state of the art in developing index structures for searching the metric space. Part I of the text describes major theoretical principles, and provides an extensive ...
WordNet Similarity 词语相似度 提供了多种计算方式
利用MATLAB进行结构相似性指数(Structural Similarity Index)计算
In this paper, we proposes a.candidate label-aware similarity graph constructing method for.partial label data which effectively combines candidate label.information using Jaccard distance and linear...
前端开源库-similarity相似性,这两个字符串有多相似?