Package 'chromseq'

Title: Split Chromosome 'Fasta' File
Description: Chromosome files in the 'Fasta' format usually contain large sequences like human genome. Sometimes users have to split these chromosomes into different files according to their chromosome number. The 'chromseq' can help to handle this. So the selected chromosome sequence can be used for downstream analysis like motif finding. Howard Y. Chang(2019) <doi:10.1038/s41587-019-0206-z>.
Authors: Shaoqian Ma [aut, cre]
Maintainer: Shaoqian Ma <[email protected]>
License: Artistic-2.0
Version: 0.1.3
Built: 2025-02-28 05:47:21 UTC
Source: https://github.com/msq-123/chromseq

Help Index


Sampled Fasta file of chromosome sequence from hg19 blacklist

Description

This dataset is sampled from The hg19 blacklist. For splitting a chromosome Fasta file, sometimes the Fasta identifier is too complicated to manipulate. This data can be used to show how to simplify the Fasta identifier.

Usage

data(id)

Format

A character sequence with 20 elements

References

Satpathy A T, Granja J M, Yost K E, et al. (2019) Nature biotechnology 37,925–936. (PubMed)

Examples

data(id)

Make a list file from large chromosome Fasta file

Description

Make a list file from large chromosome Fasta file

Usage

readToList(id = id, text = text, con = con)

Arguments

id

The id list made from subFasID function

text

Large character read in by readLines function from Fasta file

con

A connection object or a character string, the connection must refer to the same Fasta file as text

Value

Chromosome Fasta file in list format.

Examples

data("text")
id <- subFasID(text = text)
fil <- tempfile(fileext = ".data")
write(text,file = fil)
con0 <- file(fil, "r")
tex <- readToList(id,text = text,con = con0)

Replace tedious chromosome identifier into simple format

Description

Make the chromosome id starting with ">" into simple format like ">chr:1091194-1093520...",this is helpful for sorting the chromosome according to their number

Usage

replaceText(type = "text", input = input)

Arguments

type

This can be either "text" or "list", The previous is a large character containing each line of the Fasta file, the latter is a list in which each element contains a unit of Fasta file

input

The large character or list containing ids that need to be simplified

Value

The large character or list of Chromosome Fasta file with simplified id.

Author(s)

Shaoqian Ma

Examples

data("id")
simpleID<- replaceText(type = "text",input = id)

Sort the chromosome list according to the chromosome number

Description

Sort the chromosome list according to the chromosome number

Usage

sortList(id = id, tex = tex, chrsig = "single")

Arguments

id

The identifier list of the Fasta file made by subFasID

tex

A chromosome Fasta file in list format made by readToList function

chrsig

The number of characters of the chromosome, either "single" or "double", the previous means a single character following "chr" in the Fasta identifier, the latter means two characters following "chr" in the Fasta identifier. eg."chr1,chrX,chrY,chrM" is "single";"chr10,chr11" is "double". If you want to obtain both "single" and "double" sorted list of chromosome, try "single" and "double" respectively

Value

The sorted chromosome Fasta file in list format.

Examples

data("tex")
data("text")
text<- replaceText(type = "text",input = text)
id <- subFasID(text = text)
tex2<- sortList(id=id,tex = tex,chrsig = "single")
tex3 <- sortList(id=id,tex = tex,chrsig = "double")

Split all chromosomes from the sorted chromosome list

Description

Split all chromosomes from the sorted chromosome list

Usage

splitChr(tex = tex, chr = chr, sex = FALSE, outdir = ".")

Arguments

tex

The sorted chromosome list made by sortList function.

chr

The chromosome number sequence, if the chromosome list is "single" which means a single character following "chr" in the Fasta identifier, be sure starting with 1 and ending with 9; if the chromosome list is "double" which means two characters following "chr" in the Fasta identifier, be sure that starting with 10 but the ending can be changed.

sex

Whether to output the sex chromosomes like X chromosome and Y chromosome.

outdir

The output directory.

Value

Write the splitted chromosome Fasta file to separated txt files according to the chromosome number.

Author(s)

Shaoqian Ma

Examples

data(tex)
data(text)
#Simplify the Fasta id
text<- replaceText(type = "text",input = text)
#Subtract id
id <- subFasID(text = text)
#Sort the fasta according to the chromosome number in id
tex2<- sortList(id=id,tex = tex,chrsig = "single")
tex3 <- sortList(id=id,tex = tex,chrsig = "double")
outdir <- tempdir()
#Output the results
splitChr(tex = tex2,chr=seq(1,9),sex = TRUE,outdir = outdir)
splitChr(tex = tex3,chr=seq(10,22),sex = FALSE,outdir = outdir)

Subtract chromosome ids from Fasta file

Description

Subtract chromosome ids from Fasta file

Usage

subFasID(text = text)

Arguments

text

Large character read by readLines from chromosome Fasta file.

Value

The id list of the Fasta file.

Examples

data("text")
text<- replaceText(type = "text",input = text)
id <- subFasID(text = text)

Fasta file of chromosome sequence produced from sequence character

Description

Data from "Three representative inter and intra-subspecific crosses reveal the genetic architecture of reproductive isolation in rice."

Usage

data(tex)

Format

A large list containing 62 elements.

References

Li, G. et al. (2017) The Plant Journal 92, 349–362. (PubMed)

Examples

data(tex)

Fasta file of chromosome sequence

Description

A downsampled dataset containing the hg19 chromosome sequence from the hg19 blacklist. The hg19 blacklist is obtained from the supplementary dataset from "Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion." The dataset is sent to the UCSC Table Browser for obtaining the corresponding sequence file. The sequence file is processed with replaceText function to simplify the fasta id. To best illustate the usage, the sequence file is downsampled.

Usage

data(text)

Format

A character sequence with 2099 elements.

References

Satpathy A T, Granja J M, Yost K E, et al. (2019) Nature biotechnology 37, 925–936. (PubMed)

Examples

data(text)