Fuzzy data operations
    1.
    发明授权

    公开(公告)号:US11615093B2

    公开(公告)日:2023-03-28

    申请号:US15434777

    申请日:2017-02-16

    Inventor: Arlen Anderson

    Abstract: A method for clustering data elements stored in a data storage system includes reading data elements from the data storage system. Clusters of data elements are formed with each data element being a member of at least one cluster. At least one data element is associated with two or more clusters. Membership of the data element belonging to respective ones of the two or more clusters is represented by a measure of ambiguity. Information is stored in the data storage system to represent the formed clusters.

    PROFILING DATA WITH LOCATION INFORMATION
    2.
    发明申请
    PROFILING DATA WITH LOCATION INFORMATION 有权
    用位置信息分析数据

    公开(公告)号:US20140114968A1

    公开(公告)日:2014-04-24

    申请号:US14059590

    申请日:2013-10-22

    Inventor: Arlen Anderson

    Abstract: Profiling data includes processing an accessed collection of records, including: generating, for a first set of distinct values appearing in a first set of one or more fields, corresponding location information; generating, for the first set of fields, a corresponding list of entries identifying a distinct value from the first set of distinct values and the location information for the distinct value; generating, for a second set of one or more fields, a corresponding list of entries, with each entry identifying a distinct value from a second set of distinct values appearing in the second set of fields; and generating result information, based at least in part on: locating at least one record of the collection using the location information for at least one value appearing in the first set of fields, and determining at least one value appearing in the second set of fields of the located record.

    Abstract translation: 分析数据包括处理所访问的记录集合,包括:为出现在一个或多个字段的第一集合中的第一组不同值生成对应的位置信息; 为所述第一组字段生成相应的条目列表,其识别来自所述第一组不同值的不同值和所述不同值的位置信息; 为一个或多个字段的第二组生成对应的条目列表,其中每个条目标识出现在第二组字段中的第二组不同值的不同值; 以及至少部分地生成结果信息,所述方法至少部分地基于:使用出现在第一组字段中的至少一个值的位置信息来定位集合的至少一个记录,以及确定出现在第二组字段中的至少一个值 的位置记录。

    Profiling data with source tracking

    公开(公告)号:US10719511B2

    公开(公告)日:2020-07-21

    申请号:US15431008

    申请日:2017-02-13

    Inventor: Arlen Anderson

    Abstract: Profiling data includes accessing multiple collections of records to store quantitative information for each particular collection including, for at least one selected field of the records in the particular collection, a corresponding list of value count entries, each including a value appearing in the selected field and a count of the number of records in which the value appears. Processing the quantitative information of two or more collections includes: merging the value count entries of corresponding lists for at least one field from each of a first collection and a second collection to generate a combined list of value count entries, and aggregating value count entries of the combined list of value count entries to generate a list of distinct field value entries identifying a distinct value and including information quantifying a number of records in which the distinct value appears for each of the two or more collections.

    Generating data pattern information

    公开(公告)号:US09652513B2

    公开(公告)日:2017-05-16

    申请号:US14954434

    申请日:2015-11-30

    Inventor: Arlen Anderson

    Abstract: A data storage system stores at least one dataset including a plurality of records. A data processing system, coupled to the data storage system, processes the plurality of records to produce codes representing data patterns in the records, the processing including: for each of multiple records in the plurality of records, associating with the record a code encoding one or more elements, wherein each element represents a state or property of a corresponding field or combination of fields as one of a set of element values, and, for at least one element of at least a first code, the number of element values in the set is smaller than the total number of data values that occur in the corresponding field or combination of fields over all of the plurality of records in the dataset.

    Managing an archive for approximate string matching
    5.
    发明授权
    Managing an archive for approximate string matching 有权
    管理近似字符串匹配的归档

    公开(公告)号:US09563721B2

    公开(公告)日:2017-02-07

    申请号:US14325007

    申请日:2014-07-07

    Inventor: Arlen Anderson

    Abstract: In one aspect, in general, a method is described for managing an archive for determining approximate matches associated with strings occurring in records. The method includes: processing records to determine a set of string representations that correspond to strings occurring in the records; generating, for each of at least some of the string representations in the set, a plurality of close representations that are each generated from at least some of the same characters in the string; and storing entries in the archive that each represent a potential approximate match between at least two strings based on their respective close representations.

    Abstract translation: 在一个方面,通常,描述了一种用于管理归档以确定与记录中出现的字符串相关联的近似匹配的方法。 该方法包括:处理记录以确定对应于记录中出现的字符串的一组字符串表示; 为集合中的至少一些字符串表示中的每一个生成多个紧密表示,每个紧密表示从字符串中的至少一些相同的字符生成; 并且存储在归档中的条目,其中每个表示基于它们各自的紧密表示的至少两个字符串之间的潜在的近似匹配。

    CHARACTERIZING DATA SOURCES IN A DATA STORAGE SYSTEM

    公开(公告)号:US20230169053A1

    公开(公告)日:2023-06-01

    申请号:US17860568

    申请日:2022-07-08

    Inventor: Arlen Anderson

    CPC classification number: G06F16/215 G06F16/2365 G06F16/22

    Abstract: Characterizing data includes: reading data from an interface to a data storage system, and storing two or more sets of summary data summarizing data stored in different respective data sources in the data storage system; and processing the stored sets of summary data to generate system information characterizing data from multiple data sources in the data storage system. The processing includes: analyzing the stored sets of summary data to select two or more data sources that store data satisfying predetermined criteria, and generating the system information including information identifying a potential relationship between fields of records included in different data sources based at least in part on comparison between values from a stored set of summary data summarizing a first of the selected data sources and values from a stored set of summary data summarizing a second of the selected data sources.

    Profiling data with location information

    公开(公告)号:US09990362B2

    公开(公告)日:2018-06-05

    申请号:US14859502

    申请日:2015-09-21

    Inventor: Arlen Anderson

    Abstract: Profiling data includes processing an accessed collection of records, including: generating, for a first set of distinct values appearing in a first set of one or more fields, corresponding location information; generating, for the first set of fields, a corresponding list of entries identifying a distinct value from the first set of distinct values and the location information for the distinct value; generating, for a second set of one or more fields, a corresponding list of entries, with each entry identifying a distinct value from a second set of distinct values appearing in the second set of fields; and generating result information, based at least in part on: locating at least one record of the collection using the location information for at least one value appearing in the first set of fields, and determining at least one value appearing in the second set of fields of the located record.

    Profiling data with location information

    公开(公告)号:US09323749B2

    公开(公告)日:2016-04-26

    申请号:US14059590

    申请日:2013-10-22

    Inventor: Arlen Anderson

    Abstract: Profiling data includes processing an accessed collection of records, including: generating, for a first set of distinct values appearing in a first set of one or more fields, corresponding location information; generating, for the first set of fields, a corresponding list of entries identifying a distinct value from the first set of distinct values and the location information for the distinct value; generating, for a second set of one or more fields, a corresponding list of entries, with each entry identifying a distinct value from a second set of distinct values appearing in the second set of fields; and generating result information, based at least in part on: locating at least one record of the collection using the location information for at least one value appearing in the first set of fields, and determining at least one value appearing in the second set of fields of the located record.

    Data clustering based on variant token networks
    9.
    发明授权
    Data clustering based on variant token networks 有权
    基于变体令牌网络的数据聚类

    公开(公告)号:US09037589B2

    公开(公告)日:2015-05-19

    申请号:US13678038

    申请日:2012-11-15

    Inventor: Arlen Anderson

    Abstract: Received data records, each including one or more values in one or more fields, are processed to identify one or more data clusters. The processing includes: identifying tokens that each include at least one value or fragment of a value in a field or a combination of fields; generating a network representing the identified tokens, with nodes of the network representing tokens and edges of the network each representing a variant relationship between tokens; and generating a graphical representation of the network with different subsets of nodes distinguished based at least in part on values associated with nodes, where a value associated with a particular node quantifies a count of a number of instances of the token represented by that particular node appearing within the received data records.

    Abstract translation: 处理在一个或多个字段中包括一个或多个值的接收数据记录,以识别一个或多个数据集群。 该处理包括:识别每个字段中的值或字段的组合中的至少一个值或片段的标记; 生成表示所识别的令牌的网络,其中网络的节点表示网络的令牌和边缘,每个节点代表令牌之间的变体关系; 以及生成具有不同的节点子集的网络的图形表示,所述不同的节点子集至少部分地基于与节点相关联的值,其中与特定节点相关联的值量化由所述特定节点出现的令牌表示的令牌的多个实例的数量 在收到的数据记录中。

    DATA CLUSTERING, SEGMENTATION, AND PARALLELIZATION

    公开(公告)号:US20200320102A1

    公开(公告)日:2020-10-08

    申请号:US16676704

    申请日:2019-11-07

    Inventor: Arlen Anderson

    Abstract: A first set of original records is processed by a first processing entity to generate a second set of records that includes the original records and one or more copies of each original record, each original record including one or more fields. The processing of each of at least some of the original records includes: generating at least one copy of the original record, and associating a first segment value with the original record and associating a second segment value with the copy. The method also includes partitioning the second set of records among a plurality of recipient processing entities based on the segment values associated with the records in the second set, and, at each recipient processing entity, performing an operation based on one or more data values of the records received at the recipient processing entity to generate results.

Patent Agency Ranking